Introduction

In this article, we’ll show several algorithms for searching for a pattern in a text. The fundamental string searching (matching) problem is defined as follows: given two strings – a text and a pattern, determine whether the pattern appears in the text.

Naive Method

For every position in the text, consider it a starting position of the pattern and see if you get a match.

// Naive Pattern Searching algorithm
// Number of comparisons in the worst case is O(m*(n-m+1))
void brute_force(char* pat, char* text) 
{ 
  int M = strlen(pat); 
  int N = strlen(text); 

  // A loop to slide pat[] one by one
  for (int i = 0; i <= N - M; i++) { 
    int j; 

    // For current index i, check for pattern match
    for (j = 0; j < M; j++) {
    	if (text[i + j] != pat[j]){
       	break; 
      }          
    }    

    // Pattern found 
    if (j == M){
    	cout << "Pattern found at index " << i << endl; 
    }
  } 
}

 

Rabin Karp Algorithm

As mentioned above, naive search algorithm is very inefficient when patterns are long and when there is a lot of repeated elements of the pattern. Searching can be efficient using Rabin Karp Algorithm. This is actually the “naive” approach augmented with hash function.

The Rabin-Karp algorithm makes use of hash functions and the rolling hash technique. A rolling hash allows an algorithm to calculate a hash value without having the rehash the entire string. For example, when searching for a word in a text, as the algorithm shifts one letter to the right instead of having to calculate the hash of the section of text[i:j] as it shifts from text[i-1:j-1], the algorithm can use a rolling hash to do an operation on the hash to get the new hash from the old hash. Hashing can map data of arbitrary size to a value of fixed size. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. This figure shows algorithm in action.

 

The Rabin-Karp Rolling Hash (H) is given by

H = c1a(k-1) + c2a(k-2) + c3a(k-3) + …. cka(0)

a is a constant, c1, .. ck are the input characters, k is the number of characters there are in the string we are comparing (this is the length of the word).

All the operations are done modulo a prime number M to avoid dealing with large numbers. For calculating hash we can imagine a window sliding over all the substrings in text. Calculating the hash value of the next substring only inspects two elements: the element leaving the window and the element entering the window. Finding the hash value of the next substring is now a O(1) operation. Hash function for base b and substring length L is given by

 

To overcome large number and range overflow, instead of the number H itself we use its remainder when divided by M.  Applying the basic rules of modular arithmetic to the above expression:

A + B = C => (A % M + B % M) % M = C % M
A * B = C => ((A % M) * (B % M)) % M = C % M
A – B = C => (A % M – B % M + k * M) % M = C % M

To avoid the mapping of two or more different strings to the same number (it is called a collision), choose M and a such that they are sufficiently large and are prime numbers.

Let’s use Rabin-Karp’s rolling hash to hash the alphabet. Here,  a will be 26, k will be 3, and c will represent the place in the alphabet where the character appears — so for “a”, c will be 1, for “z” c will be 26.

H("abc") = 1 * 26^2 + 2 * 26^1 + 3 * 26^0

 

// Rabin Karp Algorithm
void RabinKarp(char pat[], char text[], int M) 
{ 
  int M = strlen(pat); 
  int N = strlen(text); 
  int i, j; 
  int patHash = 0; // hash value for pattern 
  int txtHash = 0; // hash value for text 
  int H = 1; 
  int a = 256, // Number of characters in the input alphabet 

  // The value of H would be "pow(a, M-1)%M" 
  for (i = 0; i < M-1; i++) 
    H = (H*a)%M; 

  // Calculate the hash value of pattern and first window of pattern 
  for (i = 0; i < M; i++) 
  { 
    patHash = (a*patHash + pat[i])%M; 
    txtHash = (a*txtHash + text[i])%M; 
  } 

  // Slide the pattern over text one by one 
  for (i = 0; i <= N - M; i++) 
  { 
    // Check the hash values of current window of text and pattern.
    if ( patHash == txtHash ) 
    { 
      // Check for characters on by one for suprious match 
      for (j = 0; j < M; j++) 
      { 
        if (text[i+j] != pat[j]) 
          break; 
      } 

      if (j == M) 
        printf("Pattern found at index %a \n", i); 
    } 

    // Calculate hash value for next window of text
    // Remove leading digit, add trailing digit 
    if ( i < N-M ) 
    { 
      txtHash = (a*(txtHash - text[i]*H) + text[i+M])%M; 

      // Converting to positive 
      if (txtHash < 0){
      	txtHash = (txtHash + M); 
      }      
    } 
  } 
}