字符串匹配&Rabin-Karp算法讲解

问题描述：

Rabin-Karp的预处理时间是O(m)，匹配时间O( ( n - m + 1 ) m )既然与朴素算法的匹配时间一样，而且还多了一些预处理时间，那为什么我们还要学习这个算法呢？虽然Rain-Karp在最坏的情况下与朴素匹配一样，但是实际应用中往往比朴素算法快很多。而且该算法的期望匹配时间是O(n)【参照《算法导论》】，但是Rabin-Karp算法需要进行数值运算，速度必然不会比KMP算法快，那我们有了KMP算法以后为什么还要学习Rabin-Karp算法呢？个人认为学习的是一种思想，一种解题的思路，当我们见识的越多，眼界也就也开阔，面对实际问题的时候，就能找到更加合适的算法。比如二维模式匹配，Rabin-Karp就是一种好的选择。

而且Rabin-Karp算法非常有趣，将字符当作数字来处理，基本思路：如果Tm是一个长度为 |P| 的T的子串，且转换为数值后模上一个数（一般为素数）与模式字符串P转换成数值后模上同一个数的值相同，则Tm可能是一个合法的匹配。

 Rabin-Karp字符串匹配算法和前面介绍的《朴素字符串匹配算法》类似，也是对应每一个字符进行比较，不同的是Rabin-Karp采用了把字符进行预处理，也就是对每个字符进行对应进制数并取模运算，类似于通过某种函数计算其函数值，比较的是每个字符的函数值。预处理时间O(m)，匹配时间是O((n-m+)m)。

Rabin-Karp算法的思想：

假设待匹配字符串的长度为M，目标字符串的长度为N（N>M）；

首先计算待匹配字符串的hash值，计算目标字符串前M个字符的hash值；

比较前面计算的两个hash值，比较次数N-M+：

若hash值不相等，则继续计算目标字符串的下一个长度为M的字符子串的hash值

若hash值相同，则需要使用朴素算法再次判断是否为相同的字串；

We can compute p in time O(m) using Horner's rule (see Section 32.1):

p = P[m] +  (P[m - ] + (P[m - ] + . . . + (P[] + 10P[]) . . . )).

The value t0 can be similarly computed from T[ . . m] in time O(m).

To compute the remaining values t1, t2, . . . , tn-m in time O(n - m), it suffices to observe that ts +  can be computed from ts in constant time, since

ts +    =   (ts - 10m - 1T[s + ]) + T[s + m + ].

(34.1)

For example, if m=  and ts = , then we wish to remove the high-order digit T[s + ] =  and bring in the new low-order digit (suppose it is T[s +  + ] = ) to obtain

ts+ = ( - 10000.3) + 

=  .

http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm

以上算法很简单，但是当模式字符串P的长度达到7以后就要出错了，即使将t，p定义为long unsigned int型也解决不了大问题，也就是说上面代码没什么用。

　　其中b是基数，相当于把字符串看作b进制数。这样，字符串S=s1s2s3...sn从位置k+1开始长度为m的字符串子串S[k+1...k+m]的哈希值，就可以利用从位置k开始的字符串子串S[k...k+m-1]的哈希值，直接进行如下计算：H(S[k+1...k+m])=（H(S[k...k+m-1]）* b - sk*b^m + s(k+m)） mod h

该算法的难点就在于p和t的值可能很大，导致不能方便的对其进行处理。对这个问题有一个简单的补救办法，用一个合适的数q来计算p和t的模。每个字符其实十一个十进制的整数，所以p，t以及递归式都可以对模q进行，所以可以在O(m)的时间里计算出模q的p值，在O（n - m + 1）时间内计算出模q的所有t值。参见《算法导论》或http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm

递推式是如下这个式子：

ts+1 = (d ( ts-T[s + 1]h) + T[s + m + 1 ] ) mod q

例如，如果d = 10 （十进制）m= 5, ts = 31415,我们希望去掉最高位数字T[s + 1] = 3,再加入一个低位数字（假定 T[s+5+1] = 2)就得到：

ts+1 = 10(31415 - 10003) +2 = 14152

于是，只要不断这样计算开始位置右移一位后的字符串子串的哈希值，就可以在O（n）时间内得到所有位置对应的哈希值，从而可以在O（n+m）时间内完成字符串匹配。在实现时，可以用64位无符号整数计算哈希值，并取h等于2^64，通过自然溢出省去求模运算。

typedef unsigned long long ull;

const ull b=;//哈希的基数；

//a是否在b中出现

bool contain(string C,string S)

{

     int m=C.length(),n=S.length();

     if(m>n)  return false;

     //计算b的m次方

     ull t=;

     for(int i=;i<m;i++)   t*=b;

     //计算C和S长度为m的前缀对应的哈希值

     ull Chash=,Shash=;

     for(int i=;i<m;i++)   Chash=Chash*b+C[i];

     for(int i=;i<m;i++)   Shash=Shash*b+S[i];

     //对S不断右移一位，更新哈希值并判断

     for(int i=;i+m<=n;i++){

          if(Chash==Shash)  return true;//S从位置i开始长度为m的字符串子串等于C；

          if(i+m<n)  Shash=Shash*b-S[i]*t+S[i+m];

      }

      return false;

}

滚动哈希（Rabin-Karp算法）

hash( txt[s+1 .. s+m] ) = ( d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q

hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)

/* Following program is a C implementation of Rabin Karp

Algorithm given in the CLRS book */

#include<stdio.h>

#include<string.h>

// d is the number of characters in the input alphabet

#define d 256

/* pat -> pattern

    txt -> text

    q -> A prime number

*/

void search(char pat[], char txt[], int q)

{

    int M = strlen(pat);

    int N = strlen(txt);

    int i, j;

    int p = ; // hash value for pattern

    int t = ; // hash value for txt

    int h = ;

    // The value of h would be "pow(d, M-1)%q"

    for (i = ; i < M-; i++)

        h = (h*d)%q;

    // Calculate the hash value of pattern and first

    // window of text

    for (i = ; i < M; i++)

    {

        p = (d*p + pat[i])%q;

        t = (d*t + txt[i])%q;

    }

    // Slide the pattern over text one by one

    for (i = ; i <= N - M; i++)

    {

        // Check the hash values of current window of text

        // and pattern. If the hash values match then only

        // check for characters on by one

        if ( p == t )

        {

            /* Check for characters one by one */

            for (j = ; j < M; j++)

            {

                if (txt[i+j] != pat[j])

                    break;

            }

            // if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]

            if (j == M)

                printf("Pattern found at index %d \n", i);

        }

        // Calculate hash value for next window of text: Remove

        // leading digit, add trailing digit

        if ( i < N-M )

        {

            t = (d*(t - txt[i]*h) + txt[i+M])%q;

            // We might get negative value of t, converting it

            // to positive

            if (t < )

            t = (t + q);

        }

    }

}

/* Driver program to test above function */

int main()

{

    char txt[] = "GEEKS FOR GEEKS";

    char pat[] = "GEEK";

    int q = ; // A prime number

    search(pat, txt, q);

    return ;

}

参考资料：http://www.geeksforgeeks.org/archives/11937

参考资料：http://net.pku.edu.cn/~course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm

http://www.cnblogs.com/feature/articles/1813967.html （翻译PKU