Regular Expression | Java

Introduction

A regular expression, or regex, is a sequence of characters that specifies a pattern which can be searched for in a text. A regex defines a set of strings, usually united for a given purpose. Suppose you need a way to formalize and refer to all the strings that make up the format of an email address. Since there are a near infinite number of possible email addresses, it’d be hard to enumerate them all. However, as we know an email address has a specific structure, and we can encode that using the regex syntax.

Character Classes

[abc] : a, b, or c (simple class)
[^abc] : Any character except a, b, or c (negation)
[a-zA-Z] : a through z or A through Z, inclusive (range)
[a-d[m-p]] : a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] : d, e, or f (intersection)
[a-z&&[^bc]] : a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] : a through z, and not m through p: [a-lq-z](subtraction)

Predefined Character Classes

. : Any character (may or may not match line terminators)
\d : A digit: [0-9]
\D : A non-digit: [^0-9]
\h : A horizontal whitespace character
\H : A non-horizontal whitespace character
\s : A whitespace character: [ \t\n\x0B\f\r]
\S : A non-whitespace character: [^\s]
\v : A vertical whitespace character
\V : A non-vertical whitespace character
\w : A word character: [a-zA-Z_0-9]
\W : A non-word character: [^\w]

Boundary Matchers

^ : The beginning of a line
$ : The end of a line
\b : A word boundary
\B : A non-word boundary
\A : The beginning of the input
\G : The end of the previous match
\Z : The end of the input but for the final terminator, if any
\z : The end of the input

Greedy Quantifiers

X? : X, once or not at all
X* : X, zero or more times
X+ : X, one or more times
X{n} : X, exactly n times
X{n,} : X, at least n times
X{n,m} : X, at least n but not more than m times

Logical Operators

XY : X followed by Y
X|Y : Either X or Y
(X) : X, as a capturing group

Precedence

The precedence of character-class operators is as follows, from highest to lowest:

Literal escape \x
Grouping […]
Range a-z
Union [a-e][i-u]
Intersection [a-z&&[aeiou]]

Reference

Class Pattern