The Knuth-Morris-Pratt Algorithm in my own words(转)
For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading “the prefix of the suffix of the prefix of the…”.
Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more efficient than na”ive string matching; that’s explained perfectly well in a multitude of places. I’m going to explain exactly how it works, as my brain understands it.
The Partial Match Table
The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.
Here’s the partial match table for the pattern “abababca”:
1 |
|
If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.
Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.
Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.
Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.
With this in mind, I can now give the one-sentence meaning of the values in the partial match table:
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.
Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.
Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.
Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.
Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.
How to use the Partial Match Table
We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:
If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.
Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:
1 |
|
The first time we get a partial match is here:
1 |
|
This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:
1 |
|
This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:
1 |
|
This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:
1 |
|
At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.
Conclusion
So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.
// C++ program for implementation of KMP pattern searching
// algorithm
#include <bits/stdc++.h> void computeLPSArray(char* pat, int M, int* lps); // Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt); // create lps[] that will hold the longest prefix suffix
// values for pattern
int lps[M]; // Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps); int i = 0; // index for txt[]
int j = 0; // index for pat[]
while (i < N) {
if (pat[j] == txt[i]) {
j++;
i++;
} if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
} // mismatch after j matches
else if (i < N && pat[j] != txt[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
} // Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
// length of the previous longest prefix suffix
int len = 0; lps[0] = 0; // lps[0] is always 0 // the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i < M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1]; // Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
} // Driver program to test above function
int main()
{
char txt[] = "ABABDABACDABABCABAB";
char pat[] = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
The Knuth-Morris-Pratt Algorithm in my own words(转)的更多相关文章
- 我所理解的 KMP(Knuth–Morris–Pratt) 算法
假设要在 haystack 中匹配 needle . 要理解 KMP 先需要理解两个概念 proper prefix 和 proper suffix,由于找到没有合适的翻译,暂时分别称真实前缀 和 真 ...
- 字符串匹配算法--KMP字符串搜索(Knuth–Morris–Pratt string-searching)C语言实现与讲解
一.前言 在计算机科学中,Knuth-Morris-Pratt字符串查找算法(简称为KMP算法)可在一个主文本字符串S内查找一个词W的出现位置.此算法通过运用对这个词在不匹配时本身就包含足够的信息 ...
- 笔试算法题(52):简介 - KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm)
议题:KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm) 分析: KMP算法用于在一个主串中找出特定的字符或者模式串.现在假设主串为长度n的数组T ...
- Aho - Corasick string matching algorithm
Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ...
- GO语言的开源库
Indexes and search engines These sites provide indexes and search engines for Go packages: godoc.org ...
- Go语言(golang)开源项目大全
转http://www.open-open.com/lib/view/open1396063913278.html内容目录Astronomy构建工具缓存云计算命令行选项解析器命令行工具压缩配置文件解析 ...
- 一个字符串搜索的Aho-Corasick算法
Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时 ...
- [转]Go语言(golang)开源项目大全
内容目录 Astronomy 构建工具 缓存 云计算 命令行选项解析器 命令行工具 压缩 配置文件解析器 控制台用户界面 加密 数据处理 数据结构 数据库和存储 开发工具 分布式/网格计算 文档 编辑 ...
- go语言项目汇总
Horst Rutter edited this page 7 days ago · 529 revisions Indexes and search engines These sites prov ...
- Golang优秀开源项目汇总, 10大流行Go语言开源项目, golang 开源项目全集(golang/go/wiki/Projects), GitHub上优秀的Go开源项目
Golang优秀开源项目汇总(持续更新...)我把这个汇总放在github上了, 后面更新也会在github上更新. https://github.com/hackstoic/golang-open- ...
随机推荐
- 在线编辑代码[django]版本
再国内,做什么都这么吃力.连aliyun 的ssh 都被封这是什么世道,所以做一个在线编辑代码的忙忙碌碌有点粗糙.大家见谅1. [代码]views.py #-*- coding:utf-8 -*- ...
- mysql一些操作
1.产生随机数 SELECT rand(); 随机一个0-1的随机数. 2.截取小数点后 SELECT FORMAT(RAND()*10000,2); 在返回 1,306.47 数据中“,” SE ...
- Python-单元测试unittest
Python中有一个自带的单元测试框架是unittest模块,用它来做单元测试,它里面封装好了一些校验返回的结果方法和一些用例执行前的初始化操作,概念见下: TestCase 也就是测试用例 Test ...
- Windows cmd findstr
/********************************************************************************** * Windows cmd fi ...
- 使用pjsip传输已经编码的视频,源码在github
pjsip功能很强,做sip rtp语音通话库首选.在2.0之后,也支持视频.不过,它的视频功能缺省是从视频设备采集,然后进行编译,再发送出去的.假设,我们已经有了视频源,比如IP摄像机,不需要采集和 ...
- Arc073_F Many Moves
传送门 题目大意 有$n$个格子从左到右依次挨着,一开始有两枚棋子分布在$A,B$某一个或两个格子里,有$m$个操作,第$i$次操作要求你把其中一个棋子移到$X_i$上,移动一个棋子的代价是两个格子之 ...
- [Codeforces 1139D] Steps to One
[题目链接] https://codeforces.com/contest/1139/problem/D [算法] 考虑dp 设fi表示现在gcd为i , 期望多少次gcd变为1 显然 , fi = ...
- C/C++面试题总结(1)
首先说一下,这些东西,有的是必须掌握的,有的是面试时你讲出来就是闪光点.自己把握.把握不好的都搞懂.实在不行背下来. 由于时间关系,总结的比较随意,有的就直接贴链接了,希望理解一下. 第一篇:基础(必 ...
- debian软件安装和卸载
用root身份执行如下命令: dpkg -l |grep "^rc"|awk '{print $2}' |xargs aptitude -y purge dpkg是Debian P ...
- findBug 错误修改指南
1. EC_UNRELATED_TYPESBug: Call to equals() comparing different types Pattern id: EC_UNRELATED_TYPE ...