Introduction
A regular expression, or regex, is a sequence of characters that specifies a pattern which can be searched for in a text. A regex defines a set of strings, usually united for a given purpose. Suppose you need a way to formalize and refer to all the strings that make up the format of an email address. Since there are a near infinite number of possible email addresses, it’d be hard to enumerate them all. However, as we know an email address has a specific structure, and we can encode that using the regex syntax.
Character Classes
- [abc] : a, b, or c (simple class)
- [^abc] : Any character except a, b, or c (negation)
- [a-zA-Z] : a through z or A through Z, inclusive (range)
- [a-d[m-p]] : a through d, or m through p: [a-dm-p] (union)
- [a-z&&[def]] : d, e, or f (intersection)
- [a-z&&[^bc]] : a through z, except for b and c: [ad-z] (subtraction)
- [a-z&&[^m-p]] : a through z, and not m through p: [a-lq-z](subtraction)
Predefined Character Classes
- . : Any character (may or may not match line terminators)
- \d : A digit: [0-9]
- \D : A non-digit: [^0-9]
- \h : A horizontal whitespace character
- \H : A non-horizontal whitespace character
- \s : A whitespace character: [ \t\n\x0B\f\r]
- \S : A non-whitespace character: [^\s]
- \v : A vertical whitespace character
- \V : A non-vertical whitespace character
- \w : A word character: [a-zA-Z_0-9]
- \W : A non-word character: [^\w]
Boundary Matchers
- ^ : The beginning of a line
- $ : The end of a line
- \b : A word boundary
- \B : A non-word boundary
- \A : The beginning of the input
- \G : The end of the previous match
- \Z : The end of the input but for the final terminator, if any
- \z : The end of the input
Greedy Quantifiers
- X? : X, once or not at all
- X* : X, zero or more times
- X+ : X, one or more times
- X{n} : X, exactly n times
- X{n,} : X, at least n times
- X{n,m} : X, at least n but not more than m times
Logical Operators
- XY : X followed by Y
- X|Y : Either X or Y
- (X) : X, as a capturing group
Precedence
The precedence of character-class operators is as follows, from highest to lowest:
- Literal escape \x
- Grouping […]
- Range a-z
- Union [a-e][i-u]
- Intersection [a-z&&[aeiou]]