The Knuth-Morris-Pratt Algorithm in my own words

For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading “the prefix of the suffix of the prefix of the…”.

Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more efficient than na”ive string matching; that’s explained perfectly well in a multitude of places. I’m going to explain exactly how it works, as my brain understands it.

The Partial Match Table

The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.

Here’s the partial match table for the pattern “abababca”:

char:  | a | b | a | b | a | b | c | a |

index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.

Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.

Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

With this in mind, I can now give the one-sentence meaning of the values in the partial match table:

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.

Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.

Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.

Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.

Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.

How to use the Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:

char:  | a | b | a | b | a | b | c | a |

index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

The first time we get a partial match is here:

bacbababaabcbab

 |

 abababca

This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:

bacbababaabcbab

    |||||

    abababca

This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:

// x denotes a skip

bacbababaabcbab

    xx|||

      abababca

This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:

// x denotes a skip

bacbababaabcbab

      xx|

        abababca

At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.

Conclusion

So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.

Posted by Jake Boxer Dec 13th, 2009

转：The Knuth-Morris-Pratt Algorithm in my own words的更多相关文章

我所理解的 KMP(Knuth–Morris–Pratt) 算法
假设要在 haystack 中匹配 needle . 要理解 KMP 先需要理解两个概念 proper prefix 和 proper suffix,由于找到没有合适的翻译,暂时分别称真实前缀和真 ...
字符串匹配算法--KMP字符串搜索(Knuth–Morris–Pratt string-searching)C语言实现与讲解
一.前言在计算机科学中,Knuth-Morris-Pratt字符串查找算法(简称为KMP算法)可在一个主文本字符串S内查找一个词W的出现位置.此算法通过运用对这个词在不匹配时本身就包含足够的信息 ...
笔试算法题（52）：简介 - KMP算法（D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm）
议题:KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm) 分析: KMP算法用于在一个主串中找出特定的字符或者模式串.现在假设主串为长度n的数组T ...
Aho - Corasick string matching algorithm
Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ...
GO语言的开源库
Indexes and search engines These sites provide indexes and search engines for Go packages: godoc.org ...
Go语言(golang)开源项目大全
转http://www.open-open.com/lib/view/open1396063913278.html内容目录Astronomy构建工具缓存云计算命令行选项解析器命令行工具压缩配置文件解析 ...
一个字符串搜索的Aho-Corasick算法
Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时 ...
[转]Go语言(golang)开源项目大全
内容目录 Astronomy 构建工具缓存云计算命令行选项解析器命令行工具压缩配置文件解析器控制台用户界面加密数据处理数据结构数据库和存储开发工具分布式/网格计算文档编辑 ...
go语言项目汇总
Horst Rutter edited this page 7 days ago · 529 revisions Indexes and search engines These sites prov ...
Golang优秀开源项目汇总, 10大流行Go语言开源项目, golang 开源项目全集(golang/go/wiki/Projects), GitHub上优秀的Go开源项目
Golang优秀开源项目汇总(持续更新...)我把这个汇总放在github上了, 后面更新也会在github上更新. https://github.com/hackstoic/golang-open- ...

随机推荐

Mybatis用法小结
select 1.基本用法 <select id="selectTableOne" resultType="com.test.entity.tableOne&quo ...
类型解释器——C专家编程读书笔记
对于声明,应该按下面的步骤来进行解释: 1) 声明从它的名字开始读取,然后按照优先级顺序依次读取 2) 优先级顺序 a) 括号括起来的部分 b) 后缀操作符,()表示函数,[]表示数组 c) 前缀操作 ...
20150603_Andriod 多个窗体数据回调
package com.example.test1; import android.support.v7.app.ActionBarActivity;import android.os.Bundle; ...
C#微信开发文档
C#微信开发文档开发前准备微信公众平台链接: https://mp.weixin.qq.com/cgi-bin/home?t=home/index&lang=zh_CN 开发初期我们使用测 ...
给用户添加sudo权限
centos中默认创建的新用户是没有sudo权限的. 在文件/etc/sudoers中添加即可: ## Allow root to run any commands anywhere root ALL ...
select resharper shortcuts scheme
VS代码生成工具ReSharper提供了丰富的快捷键,可以极大地提高你的开发效率.安装ReSharper后首次启动Visual Studio时,会出现一个名为ReSharper Keyboard Sc ...
sql中decode(...)函数的用法
相当于if语句 decode函数比较1个参数时 SELECT ID,DECODE(inParam,'beComparedParam','值1' ,'值2') name FROM bank #如果第一个 ...
iOS- 详解文本属性Attributes
1.NSKernAttributeName: @10 调整字句 kerning 字句调整 2.NSFontAttributeName : [UIFont systemFontOfSize:_fontS ...
SQL数据库约束行为---防止数据乱填(即数据规范化)
防止乱填:一.Check约束.按照某种规则对数据进行检查.操作:在表的设计界面中,右击相应的列,选择“CHECK约束”在弹出的对话框中,设置约束的名称和表达式. 代码实现: create table ...
javascript/jquery判断是否为undefined或是null!
var exp = undefined; if (typeof(exp) == "undefined"){ alert("undefined");} 注意 ...