## Introduction

In this article, we’ll show several algorithms for searching for a pattern in a text. The fundamental string searching (matching) problem is defined as follows: given two strings – a text and a pattern, determine whether the pattern appears in the text.

## Naive Method

For every position in the text, consider it a starting position of the pattern and see if you get a match.

// Naive Pattern Searching algorithm
// Number of comparisons in the worst case is O(m*(n-m+1))
void brute_force(char* pat, char* text)
{
int M = strlen(pat);
int N = strlen(text);

// A loop to slide pat[] one by one
for (int i = 0; i <= N - M; i++) {
int j;

// For current index i, check for pattern match
for (j = 0; j < M; j++) {
if (text[i + j] != pat[j]){
break;
}
}

// Pattern found
if (j == M){
cout << "Pattern found at index " << i << endl;
}
}
}

## Rabin Karp Algorithm

As mentioned above, naive search algorithm is very inefficient when patterns are long and when there is a lot of repeated elements of the pattern. Searching can be efficient using Rabin Karp Algorithm. This is actually the “naive” approach augmented with hash function.

The Rabin-Karp algorithm makes use of hash functions and the rolling hash technique. A rolling hash allows an algorithm to calculate a hash value without having the rehash the entire string. For example, when searching for a word in a text, as the algorithm shifts one letter to the right instead of having to calculate the hash of the section of text[i:j] as it shifts from text[i-1:j-1], the algorithm can use a rolling hash to do an operation on the hash to get the new hash from the old hash. Hashing can map data of arbitrary size to a value of fixed size. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. This figure shows algorithm in action.

The Rabin-Karp Rolling Hash (H) is given by

H = c1a(k-1) + c2a(k-2) + c3a(k-3) + …. cka(0)

a is a constant, c1, .. ck are the input characters, k is the number of characters there are in the string we are comparing (this is the length of the word).

All the operations are done modulo a prime number M to avoid dealing with large numbers. For calculating hash we can imagine a window sliding over all the substrings in text. Calculating the hash value of the next substring only inspects two elements: the element leaving the window and the element entering the window. Finding the hash value of the next substring is now a O(1) operation. Hash function for base b and substring length L is given by

To overcome large number and range overflow, instead of the number H itself we use its remainder when divided by M.  Applying the basic rules of modular arithmetic to the above expression:

A + B = C => (A % M + B % M) % M = C % M
A * B = C => ((A % M) * (B % M)) % M = C % M
A – B = C => (A % M – B % M + k * M) % M = C % M

To avoid the mapping of two or more different strings to the same number (it is called a collision), choose M and a such that they are sufficiently large and are prime numbers.

Let’s use Rabin-Karp’s rolling hash to hash the alphabet. Here,  a will be 26, k will be 3, and c will represent the place in the alphabet where the character appears — so for “a”, c will be 1, for “z” c will be 26.

H("abc") = 1 * 26^2 + 2 * 26^1 + 3 * 26^0

// Rabin Karp Algorithm
void RabinKarp(char pat[], char text[], int M)
{
int M = strlen(pat);
int N = strlen(text);
int i, j;
int patHash = 0; // hash value for pattern
int txtHash = 0; // hash value for text
int H = 1;
int a = 256, // Number of characters in the input alphabet

// The value of H would be "pow(a, M-1)%M"
for (i = 0; i < M-1; i++)
H = (H*a)%M;

// Calculate the hash value of pattern and first window of pattern
for (i = 0; i < M; i++)
{
patHash = (a*patHash + pat[i])%M;
txtHash = (a*txtHash + text[i])%M;
}

// Slide the pattern over text one by one
for (i = 0; i <= N - M; i++)
{
// Check the hash values of current window of text and pattern.
if ( patHash == txtHash )
{
// Check for characters on by one for suprious match
for (j = 0; j < M; j++)
{
if (text[i+j] != pat[j])
break;
}

if (j == M)
printf("Pattern found at index %a \n", i);
}

// Calculate hash value for next window of text
}