一个字符串搜索的Aho-Corasick算法
Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时间复杂度是O(n)。原理图如下(摘自Aho-Corasick string matching in C#):


Building of the keyword tree (figure 1 - after the first step, figure 2 - tree with the fail function)
C#版本的实现代码可以从Aho-Corasick string matching in C#得到,也可以点击这里获得该算法的PDF文档。
这是一个应用示例:

它能将载入的RTF文档中的搜索关键字高亮,检索速度较快,示例没有实现全字匹配,算法代码简要如下:
- /* Aho-Corasick text search algorithm implementation
- *
- * For more information visit
- * - http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
- */
- using System;
- using System.Collections;
- namespace EeekSoft.Text
- {
- /// <summary>
- /// Interface containing all methods to be implemented
- /// by string search algorithm
- /// </summary>
- public interface IStringSearchAlgorithm
- {
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- bool IgnoreCase { get; set; }
- /// <summary>
- /// List of keywords to search for
- /// </summary>
- string[] Keywords { get; set; }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- StringSearchResult[] FindAll(string text);
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- StringSearchResult FindFirst(string text);
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- bool ContainsAny(string text);
- #endregion
- }
- /// <summary>
- /// Structure containing results of search
- /// (keyword and position in original text)
- /// </summary>
- public struct StringSearchResult
- {
- #region Members
- private int _index;
- private string _keyword;
- /// <summary>
- /// Initialize string search result
- /// </summary>
- /// <param name="index">Index in text</param>
- /// <param name="keyword">Found keyword</param>
- public StringSearchResult(int index, string keyword)
- {
- _index = index; _keyword = keyword;
- }
- /// <summary>
- /// Returns index of found keyword in original text
- /// </summary>
- public int Index
- {
- get { return _index; }
- }
- /// <summary>
- /// Returns keyword found by this result
- /// </summary>
- public string Keyword
- {
- get { return _keyword; }
- }
- /// <summary>
- /// Returns empty search result
- /// </summary>
- public static StringSearchResult Empty
- {
- get { return new StringSearchResult(-1, ""); }
- }
- #endregion
- }
- /// <summary>
- /// Class for searching string for one or multiple
- /// keywords using efficient Aho-Corasick search algorithm
- /// </summary>
- public class StringSearch : IStringSearchAlgorithm
- {
- #region Objects
- /// <summary>
- /// Tree node representing character and its
- /// transition and failure function
- /// </summary>
- class TreeNode
- {
- #region Constructor & Methods
- /// <summary>
- /// Initialize tree node with specified character
- /// </summary>
- /// <param name="parent">Parent node</param>
- /// <param name="c">Character</param>
- public TreeNode(TreeNode parent, char c)
- {
- _char = c; _parent = parent;
- _results = new ArrayList();
- _resultsAr = new string[] { };
- _transitionsAr = new TreeNode[] { };
- _transHash = new Hashtable();
- }
- /// <summary>
- /// Adds pattern ending in this node
- /// </summary>
- /// <param name="result">Pattern</param>
- public void AddResult(string result)
- {
- if (_results.Contains(result)) return;
- _results.Add(result);
- _resultsAr = (string[])_results.ToArray(typeof(string));
- }
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- //public void AddTransition(TreeNode node)
- //{
- // AddTransition(node, false);
- //}
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- public void AddTransition(TreeNode node, bool ignoreCase)
- {
- if (ignoreCase) _transHash.Add(char.ToLower(node.Char), node);
- else _transHash.Add(node.Char, node);
- TreeNode[] ar = new TreeNode[_transHash.Values.Count];
- _transHash.Values.CopyTo(ar, 0);
- _transitionsAr = ar;
- }
- /// <summary>
- /// Returns transition to specified character (if exists)
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>Returns TreeNode or null</returns>
- public TreeNode GetTransition(char c, bool ignoreCase)
- {
- if (ignoreCase)
- return (TreeNode)_transHash[char.ToLower(c)];
- return (TreeNode)_transHash[c];
- }
- /// <summary>
- /// Returns true if node contains transition to specified character
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>True if transition exists</returns>
- public bool ContainsTransition(char c, bool ignoreCase)
- {
- return GetTransition(c, ignoreCase) != null;
- }
- #endregion
- #region Properties
- private char _char;
- private TreeNode _parent;
- private TreeNode _failure;
- private ArrayList _results;
- private TreeNode[] _transitionsAr;
- private string[] _resultsAr;
- private Hashtable _transHash;
- /// <summary>
- /// Character
- /// </summary>
- public char Char
- {
- get { return _char; }
- }
- /// <summary>
- /// Parent tree node
- /// </summary>
- public TreeNode Parent
- {
- get { return _parent; }
- }
- /// <summary>
- /// Failure function - descendant node
- /// </summary>
- public TreeNode Failure
- {
- get { return _failure; }
- set { _failure = value; }
- }
- /// <summary>
- /// Transition function - list of descendant nodes
- /// </summary>
- public TreeNode[] Transitions
- {
- get { return _transitionsAr; }
- }
- /// <summary>
- /// Returns list of patterns ending by this letter
- /// </summary>
- public string[] Results
- {
- get { return _resultsAr; }
- }
- #endregion
- }
- #endregion
- #region Local fields
- /// <summary>
- /// Root of keyword tree
- /// </summary>
- private TreeNode _root;
- /// <summary>
- /// Keywords to search for
- /// </summary>
- private string[] _keywords;
- #endregion
- #region Initialization
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- /// <param name="ignoreCase">Ignore case of letters (the default is false)</param>
- public StringSearch(string[] keywords, bool ignoreCase)
- : this(keywords)
- {
- IgnoreCase = ignoreCase;
- }
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- public StringSearch(string[] keywords)
- {
- Keywords = keywords;
- }
- /// <summary>
- /// Initialize search algorithm with no keywords
- /// (Use Keywords property)
- /// </summary>
- public StringSearch()
- { }
- #endregion
- #region Implementation
- /// <summary>
- /// Build tree from specified keywords
- /// </summary>
- void BuildTree()
- {
- // Build keyword tree and transition function
- _root = new TreeNode(null, ' ');
- foreach (string p in _keywords)
- {
- // add pattern to tree
- TreeNode nd = _root;
- foreach (char c in p)
- {
- TreeNode ndNew = null;
- foreach (TreeNode trans in nd.Transitions)
- {
- if (this.IgnoreCase)
- {
- if (char.ToLower(trans.Char) == char.ToLower(c)) { ndNew = trans; break; }
- }
- else
- {
- if (trans.Char == c) { ndNew = trans; break; }
- }
- }
- if (ndNew == null)
- {
- ndNew = new TreeNode(nd, c);
- nd.AddTransition(ndNew, this.IgnoreCase);
- }
- nd = ndNew;
- }
- nd.AddResult(p);
- }
- // Find failure functions
- ArrayList nodes = new ArrayList();
- // level 1 nodes - fail to root node
- foreach (TreeNode nd in _root.Transitions)
- {
- nd.Failure = _root;
- foreach (TreeNode trans in nd.Transitions) nodes.Add(trans);
- }
- // other nodes - using BFS
- while (nodes.Count != 0)
- {
- ArrayList newNodes = new ArrayList();
- foreach (TreeNode nd in nodes)
- {
- TreeNode r = nd.Parent.Failure;
- char c = nd.Char;
- while (r != null && !r.ContainsTransition(c, this.IgnoreCase)) r = r.Failure;
- if (r == null)
- nd.Failure = _root;
- else
- {
- nd.Failure = r.GetTransition(c, this.IgnoreCase);
- foreach (string result in nd.Failure.Results)
- nd.AddResult(result);
- }
- // add child nodes to BFS list
- foreach (TreeNode child in nd.Transitions)
- newNodes.Add(child);
- }
- nodes = newNodes;
- }
- _root.Failure = _root;
- }
- #endregion
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- public bool IgnoreCase
- {
- get;
- set;
- }
- /// <summary>
- /// Keywords to search for (setting this property is slow, because
- /// it requieres rebuilding of keyword tree)
- /// </summary>
- public string[] Keywords
- {
- get { return _keywords; }
- set
- {
- _keywords = value;
- BuildTree();
- }
- }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- public StringSearchResult[] FindAll(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- ret.Add(new StringSearchResult(index - found.Length + 1, found));
- index++;
- }
- return (StringSearchResult[])ret.ToArray(typeof(StringSearchResult));
- }
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- public StringSearchResult FindFirst(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- return new StringSearchResult(index - found.Length + 1, found);
- index++;
- }
- return StringSearchResult.Empty;
- }
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- public bool ContainsAny(string text)
- {
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- if (ptr.Results.Length > 0) return true;
- index++;
- }
- return false;
- }
- #endregion
- }
- }
示例下载页面:http://www.uushare.com/user/m2nlight/file/2722093
一个字符串搜索的Aho-Corasick算法的更多相关文章
- 多模字符串匹配算法-Aho–Corasick
背景 在做实际工作中,最简单也最常用的一种自然语言处理方法就是关键词匹配,例如我们要对n条文本进行过滤,那本身是一个过滤词表的,通常进行过滤的代码如下 for (String document : d ...
- 【ToolGood.Words】之【StringSearch】字符串搜索——基于BFS算法
字符串搜索中,BFS算法很巧妙,个人认为BFS算法效率是最高的. [StringSearch]就是根据BFS算法并优化. 使用方法: string s = "中国|国人|zg人|fuck|a ...
- C#算法之判断一个字符串是否是对称字符串
记得曾经一次面试时,面试官给我电脑,让我现场写个算法,判断一个字符串是不是对称字符串.我当时用了几分钟写了一个很简单的代码. 这里说的对称字符串是指字符串的左边和右边字符顺序相反,如"abb ...
- 基于python 3.5 所做的找出来一个字符串中最长不重复子串算法
功能:找出来一个字符串中最长不重复子串 def find_longest_no_repeat_substr(one_str): #定义一个列表用于存储非重复字符子串 res_list=[] #获得字符 ...
- 算法 - 给出一个字符串str,输出包含两个字符串str的最短字符串,如str为abca时,输出则为abcabca
今天碰到一个算法题觉得比较有意思,研究后自己实现了出来,代码比较简单,如发现什么问题请指正.思路和代码如下: 基本思路:从左开始取str的最大子字符串,判断子字符串是否为str的后缀,如果是则返回st ...
- 算法:Manacher,给定一个字符串str,返回str中最长回文子串的长度。
[题目] 给定一个字符串str,返回str中最长回文子串的长度 [举例] str="123", 1 str="abc1234321ab" 7 [暴力破解] 从左 ...
- 字符串模式匹配算法2 - AC算法
上篇文章(http://www.cnblogs.com/zzqcn/p/3508442.html)里提到的BF和KMP算法都是单模式串匹配算法,也就是说,模式串只有一个.当需要在字符串中搜索多个关键字 ...
- 字符串混淆技术应用 设计一个字符串混淆程序 可混淆.NET程序集中的字符串
关于字符串的研究,目前已经有两篇. 原理篇:字符串混淆技术在.NET程序保护中的应用及如何解密被混淆的字符串 实践篇:字符串反混淆实战 Dotfuscator 4.9 字符串加密技术应对策略 今天来 ...
- Aho - Corasick string matching algorithm
Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ...
随机推荐
- ThinkPHP批量添加数据和getField()示例
批量添加数据 // 批量添加数据 $User = M('users'); $dataList[] = array('name'=>'thinkphp','email'=>'thinkphp ...
- push类型消息中间件-消息发布者(二)
1.消息发布者声明 我们以spring的方式来声明一个消息发布者: <bean id="operateLogsMessageManager" class="com. ...
- Java Method Logging with AOP and Annotations
Sometimes, I want to log (through slf4j and log4j) every execution of a method, seeing what argument ...
- 设置phpMyAdmin本地自动登陆
一般配置本地测试用的 phpMyAdmin 可以不用每次输入帐号密码,打开后自动登陆就行了. 版本: phpMyAdmin 3.5.3 打开: phpMyAdmin 根目录 复制: config.sa ...
- gRPC编码初探(java)
背景:gRPC是一个高性能.通用的开源RPC框架,其由Google主要面向移动应用开发并基于HTTP/2协议标准而设计,基于ProtoBuf(Protocol Buffers)序列化协议开发,且支持众 ...
- CSS3秘笈复习:第八章
一.背景的所有属性: 属性 作用 可选项 1.background-image 定义一张图片 url(...) 2.background-repeat 控制重复 no-repeat | repeat- ...
- JS复习:第七章
第七章 函数表达式 一.定义函数的方式有两种:函数声明和函数表达式. 1.函数声明: function functionName(arg0 , arg1 , arg2){ //函数体... } 函数 ...
- Spring Security(05)——异常信息本地化
Spring Security支持将展现给终端用户看的异常信息本地化,这些信息包括认证失败.访问被拒绝等.而对于展现给开发者看的异常信息和日志信息(如配置错误)则是不能够进行本地化的,它们是以英文硬编 ...
- aws部署从无到有(二)windows管理aws
1 AMI正常启动后会进入下面页面 2 远程链接点击如何连接至您的 Linux 实例进入下载页 Windows下使用 PuTTY连接到 Linux 实例 http://www.chiark.green ...
- Java 泛型 协变式覆盖和泛型重载
Java 泛型 协变式覆盖和泛型重载 @author ixenos 1.协变式覆盖(Override) 在JDK 1.4及以前,子类方法如果要覆盖超类的某个方法,必须具有完全相同的方法签名,包括返回值 ...