转:The Knuth-Morris-Pratt Algorithm in my own words
The Knuth-Morris-Pratt Algorithm in my own words
For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading “the prefix of the suffix of the prefix of the…”.
Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more efficient than na”ive string matching; that’s explained perfectly well in a multitude of places. I’m going to explain exactly how it works, as my brain understands it.
The Partial Match Table
The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.
Here’s the partial match table for the pattern “abababca”:
1 |
|
If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.
Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.
Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.
Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.
With this in mind, I can now give the one-sentence meaning of the values in the partial match table:
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.
Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.
Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.
Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.
Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.
How to use the Partial Match Table
We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:
If a partial match of length partial_match_length is found and table[partial_match_length] > 1
, we may skip ahead partial_match_length - table[partial_match_length - 1]
characters.
Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:
1 |
|
The first time we get a partial match is here:
1 |
|
This is a partial_match_length of 1. The value at table[partial_match_length - 1]
(or table[0]
) is 0, so we don’t get to skip ahead any. The next partial match we get is here:
1 |
|
This is a partial_match_length of 5. The value at table[partial_match_length - 1]
(or table[4]
) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1]
(or 5 - table[4]
or 5 - 3
or 2
) characters:
1 |
|
This is a partial_match_length of 3. The value at table[partial_match_length - 1]
(or table[2]
) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1]
(or 3 - table[2]
or 3 - 1
or 2
) characters:
1 |
|
At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.
Conclusion
So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.
Posted by Jake Boxer Dec 13th, 2009
转:The Knuth-Morris-Pratt Algorithm in my own words的更多相关文章
- 我所理解的 KMP(Knuth–Morris–Pratt) 算法
假设要在 haystack 中匹配 needle . 要理解 KMP 先需要理解两个概念 proper prefix 和 proper suffix,由于找到没有合适的翻译,暂时分别称真实前缀 和 真 ...
- 字符串匹配算法--KMP字符串搜索(Knuth–Morris–Pratt string-searching)C语言实现与讲解
一.前言 在计算机科学中,Knuth-Morris-Pratt字符串查找算法(简称为KMP算法)可在一个主文本字符串S内查找一个词W的出现位置.此算法通过运用对这个词在不匹配时本身就包含足够的信息 ...
- 笔试算法题(52):简介 - KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm)
议题:KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm) 分析: KMP算法用于在一个主串中找出特定的字符或者模式串.现在假设主串为长度n的数组T ...
- Aho - Corasick string matching algorithm
Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ...
- GO语言的开源库
Indexes and search engines These sites provide indexes and search engines for Go packages: godoc.org ...
- Go语言(golang)开源项目大全
转http://www.open-open.com/lib/view/open1396063913278.html内容目录Astronomy构建工具缓存云计算命令行选项解析器命令行工具压缩配置文件解析 ...
- 一个字符串搜索的Aho-Corasick算法
Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时 ...
- [转]Go语言(golang)开源项目大全
内容目录 Astronomy 构建工具 缓存 云计算 命令行选项解析器 命令行工具 压缩 配置文件解析器 控制台用户界面 加密 数据处理 数据结构 数据库和存储 开发工具 分布式/网格计算 文档 编辑 ...
- go语言项目汇总
Horst Rutter edited this page 7 days ago · 529 revisions Indexes and search engines These sites prov ...
- Golang优秀开源项目汇总, 10大流行Go语言开源项目, golang 开源项目全集(golang/go/wiki/Projects), GitHub上优秀的Go开源项目
Golang优秀开源项目汇总(持续更新...)我把这个汇总放在github上了, 后面更新也会在github上更新. https://github.com/hackstoic/golang-open- ...
随机推荐
- 在javaEE下学习web(在eclipse中开发动态的WEB工程,servlet的环境搭建,及servlet的一些方法)
一个简便的方法实现javaee版的eclipse开发动态的WEB工程(javaWEB项目)1.把开发选项切换到javaEE2. 可以在window->shou view 中找到package e ...
- LA 4126 Password Suspects
问题描述:给定m个模式串,计数包含所有模式串且长度为n的字符串的数目. 数据范围:模式串长度不超过10,m <= 10, n <= 25,此外保证答案不超过1015. 分析:既然要计数给定 ...
- Unity-Animator深入系列---Foot IK
回到 Animator深入系列总目录 最近在做一个demo,遇到了角色跑动不自然的问题(注意双腿): 后来得知勾选FootIK之后Unity会智能修复这类问题: 好像这个功能还能做到斜面地形匹配,不过 ...
- mysql 字段引号那个像单引号的撇号用法
我们知道通常的SQL查询语句是这么写的: select col from table; 这当然没问题,但如果字段名是“from”呢? select from from table; 若真的这么写,必然 ...
- java的myeclipse生成webservice的service和client
前言:朋友们开始以下教程前,请先看第五大点的注意事项,以避免不必要的重复操作. 一.准备工作(以下为本实例使用工具) 1.MyEclipse10.7.1 2.JDK 1.6.0_22 二.创建服务端 ...
- HDU2112 HDU Today 最短路+字符串哈希
HDU Today Time Limit: 15000/5000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total ...
- 各种操作中心Operation Center一览
Operation Center在中国可能有很多种名称,例如指挥中心.运维室.总控中心等等,国外可能也有很多名称,不管名称如何,任何一个上规模得数据总心或者运维单位一般都有一个这样得中心,来负责所管理 ...
- [JAVA设计模式]第四部分:行为模式
声明:原创作品,转载时请注明文章来自SAP师太技术博客( 博/客/园www.cnblogs.com):www.cnblogs.com/jiangzhengjun,并以超链接形式标明文章原始出处,否则将 ...
- How to run an manually installed program from terminals in Linux / Ubuntu
Say we have installed qt programs and we want to run qtcreator from the command line. What we need h ...
- HTML笔记(七)head相关元素<base> & <meta>
<head>元素是所有头部元素的容器. 可添加的标签有:<title>.<base>.<link>.<meta>.<script> ...