Introduction

A regular expression, or regex, is a sequence of characters that specifies a pattern which can be searched for in a text. A regex defines a set of strings, usually united for a given purpose. Suppose you need a way to formalize and refer to all the strings that make up the format of an email address. Since there are a near infinite number of possible email addresses, it’d be hard to enumerate them all. However, as we know an email address has a specific structure, and we can encode that using the regex syntax.

Character Classes

  • [abc] : a, b, or c (simple class)
  • [^abc] : Any character except a, b, or c (negation)
  • [a-zA-Z] : a through z or A through Z, inclusive (range)
  • [a-d[m-p]] : a through d, or m through p: [a-dm-p] (union)
  • [a-z&&[def]] : d, e, or f (intersection)
  • [a-z&&[^bc]] : a through z, except for b and c: [ad-z] (subtraction)
  • [a-z&&[^m-p]] : a through z, and not m through p: [a-lq-z](subtraction)

Predefined Character Classes

  • .    : Any character (may or may not match line terminators)
  • \d : A digit: [0-9]
  • \D : A non-digit: [^0-9]
  • \h : A horizontal whitespace character
  • \H : A non-horizontal whitespace character
  • \s : A whitespace character: [ \t\n\x0B\f\r]
  • \S : A non-whitespace character: [^\s]
  • \v : A vertical whitespace character
  • \V : A non-vertical whitespace character
  • \w : A word character: [a-zA-Z_0-9]
  • \W : A non-word character: [^\w]

Boundary Matchers

  • ^ : The beginning of a line
  • $ : The end of a line
  • \b : A word boundary
  • \B : A non-word boundary
  • \A : The beginning of the input
  • \G : The end of the previous match
  • \Z : The end of the input but for the final terminator, if any
  • \z  : The end of the input

Greedy Quantifiers

  • X?  : X, once or not at all
  • X*  :  X, zero or more times
  • X+  :  X, one or more times
  • X{n}  : X, exactly n times
  • X{n,}  : X, at least n times
  • X{n,m}  : X, at least n but not more than m times

Logical Operators

  • XY : X followed by Y
  • X|Y : Either X or Y
  • (X) : X, as a capturing group

Precedence

The precedence of character-class operators is as follows, from highest to lowest:

  1. Literal escape    \x
  2. Grouping       […]
  3. Range  a-z
  4. Union    [a-e][i-u]
  5. Intersection  [a-z&&[aeiou]]

Reference

Class Pattern