## Introduction

In this article, we’ll show several algorithms for searching for a pattern in a text. The fundamental string searching (matching) problem is defined as follows: given two strings – a text and a pattern, determine whether the pattern appears in the text.

## Naive Method

For every position in the text, consider it a starting position of the pattern and see if you get a match.

// Naive Pattern Searching algorithm // Number of comparisons in the worst case is O(m*(n-m+1)) void brute_force(char* pat, char* text) { int M = strlen(pat); int N = strlen(text); // A loop to slide pat[] one by one for (int i = 0; i <= N - M; i++) { int j; // For current index i, check for pattern match for (j = 0; j < M; j++) { if (text[i + j] != pat[j]){ break; } } // Pattern found if (j == M){ cout << "Pattern found at index " << i << endl; } } }

## Rabin Karp Algorithm

As mentioned above, naive search algorithm is very inefficient when patterns are long and when there is a lot of repeated elements of the pattern. Searching can be efficient using Rabin Karp Algorithm. This is actually the “naive” approach augmented with hash function.

The Rabin-Karp algorithm makes use of hash functions and the rolling hash technique. A rolling hash allows an algorithm to calculate a hash value without having the rehash the entire string. For example, when searching for a word in a text, as the algorithm shifts one letter to the right instead of having to calculate the hash of the section of text[i:j] as it shifts from text[i-1:j-1], the algorithm can use a rolling hash to do an operation on the hash to get the new hash from the old hash. Hashing can map data of arbitrary size to a value of fixed size. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. This figure shows algorithm in action.

The Rabin-Karp Rolling Hash (H) is given by

H = c_{1}a^{(k-1)} + c_{2}a^{(k-2)} + c_{3}a^{(k-3)} + …. c_{k}a^{(0)}

a is a constant, c_{1}, .. c_{k} are the input characters, k is the number of characters there are in the string we are comparing (this is the length of the word).

All the operations are done modulo a prime number **M** to avoid dealing with large numbers. For calculating hash we can imagine a window sliding over all the substrings in text. Calculating the hash value of the next substring only inspects two elements: the element leaving the window and the element entering the window. Finding the hash value of the next substring is now a O(1) operation. Hash function for base b and substring length L is given by

To overcome large number and range overflow, instead of the number H itself we use its remainder when divided by M. Applying the basic rules of modular arithmetic to the above expression:

A + B = C => (A % M + B % M) % M = C % M A * B = C => ((A % M) * (B % M)) % M = C % M A – B = C => (A % M – B % M + k * M) % M = C % M

To avoid the mapping of two or more different strings to the same number (it is called a collision), choose M and a such that they are sufficiently large and are prime numbers.

Let’s use Rabin-Karp’s rolling hash to hash the alphabet. Here, a will be 26, k will be 3, and c will represent the place in the alphabet where the character appears — so for “a”, c will be 1, for “z” c will be 26.

H("abc") = 1 * 26^2 + 2 * 26^1 + 3 * 26^0

// Rabin Karp Algorithm void RabinKarp(char pat[], char text[], int M) { int M = strlen(pat); int N = strlen(text); int i, j; int patHash = 0; // hash value for pattern int txtHash = 0; // hash value for text int H = 1; int a = 256, // Number of characters in the input alphabet // The value of H would be "pow(a, M-1)%M" for (i = 0; i < M-1; i++) H = (H*a)%M; // Calculate the hash value of pattern and first window of pattern for (i = 0; i < M; i++) { patHash = (a*patHash + pat[i])%M; txtHash = (a*txtHash + text[i])%M; } // Slide the pattern over text one by one for (i = 0; i <= N - M; i++) { // Check the hash values of current window of text and pattern. if ( patHash == txtHash ) { // Check for characters on by one for suprious match for (j = 0; j < M; j++) { if (text[i+j] != pat[j]) break; } if (j == M) printf("Pattern found at index %a \n", i); } // Calculate hash value for next window of text // Remove leading digit, add trailing digit if ( i < N-M ) { txtHash = (a*(txtHash - text[i]*H) + text[i+M])%M; // Converting to positive if (txtHash < 0){ txtHash = (txtHash + M); } } } }