一个字符串搜索的Aho-Corasick算法
Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时间复杂度是O(n)。原理图如下(摘自Aho-Corasick string matching in C#):


Building of the keyword tree (figure 1 - after the first step, figure 2 - tree with the fail function)
C#版本的实现代码可以从Aho-Corasick string matching in C#得到,也可以点击这里获得该算法的PDF文档。
这是一个应用示例:

它能将载入的RTF文档中的搜索关键字高亮,检索速度较快,示例没有实现全字匹配,算法代码简要如下:
- /* Aho-Corasick text search algorithm implementation
- *
- * For more information visit
- * - http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
- */
- using System;
- using System.Collections;
- namespace EeekSoft.Text
- {
- /// <summary>
- /// Interface containing all methods to be implemented
- /// by string search algorithm
- /// </summary>
- public interface IStringSearchAlgorithm
- {
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- bool IgnoreCase { get; set; }
- /// <summary>
- /// List of keywords to search for
- /// </summary>
- string[] Keywords { get; set; }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- StringSearchResult[] FindAll(string text);
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- StringSearchResult FindFirst(string text);
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- bool ContainsAny(string text);
- #endregion
- }
- /// <summary>
- /// Structure containing results of search
- /// (keyword and position in original text)
- /// </summary>
- public struct StringSearchResult
- {
- #region Members
- private int _index;
- private string _keyword;
- /// <summary>
- /// Initialize string search result
- /// </summary>
- /// <param name="index">Index in text</param>
- /// <param name="keyword">Found keyword</param>
- public StringSearchResult(int index, string keyword)
- {
- _index = index; _keyword = keyword;
- }
- /// <summary>
- /// Returns index of found keyword in original text
- /// </summary>
- public int Index
- {
- get { return _index; }
- }
- /// <summary>
- /// Returns keyword found by this result
- /// </summary>
- public string Keyword
- {
- get { return _keyword; }
- }
- /// <summary>
- /// Returns empty search result
- /// </summary>
- public static StringSearchResult Empty
- {
- get { return new StringSearchResult(-1, ""); }
- }
- #endregion
- }
- /// <summary>
- /// Class for searching string for one or multiple
- /// keywords using efficient Aho-Corasick search algorithm
- /// </summary>
- public class StringSearch : IStringSearchAlgorithm
- {
- #region Objects
- /// <summary>
- /// Tree node representing character and its
- /// transition and failure function
- /// </summary>
- class TreeNode
- {
- #region Constructor & Methods
- /// <summary>
- /// Initialize tree node with specified character
- /// </summary>
- /// <param name="parent">Parent node</param>
- /// <param name="c">Character</param>
- public TreeNode(TreeNode parent, char c)
- {
- _char = c; _parent = parent;
- _results = new ArrayList();
- _resultsAr = new string[] { };
- _transitionsAr = new TreeNode[] { };
- _transHash = new Hashtable();
- }
- /// <summary>
- /// Adds pattern ending in this node
- /// </summary>
- /// <param name="result">Pattern</param>
- public void AddResult(string result)
- {
- if (_results.Contains(result)) return;
- _results.Add(result);
- _resultsAr = (string[])_results.ToArray(typeof(string));
- }
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- //public void AddTransition(TreeNode node)
- //{
- // AddTransition(node, false);
- //}
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- public void AddTransition(TreeNode node, bool ignoreCase)
- {
- if (ignoreCase) _transHash.Add(char.ToLower(node.Char), node);
- else _transHash.Add(node.Char, node);
- TreeNode[] ar = new TreeNode[_transHash.Values.Count];
- _transHash.Values.CopyTo(ar, 0);
- _transitionsAr = ar;
- }
- /// <summary>
- /// Returns transition to specified character (if exists)
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>Returns TreeNode or null</returns>
- public TreeNode GetTransition(char c, bool ignoreCase)
- {
- if (ignoreCase)
- return (TreeNode)_transHash[char.ToLower(c)];
- return (TreeNode)_transHash[c];
- }
- /// <summary>
- /// Returns true if node contains transition to specified character
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>True if transition exists</returns>
- public bool ContainsTransition(char c, bool ignoreCase)
- {
- return GetTransition(c, ignoreCase) != null;
- }
- #endregion
- #region Properties
- private char _char;
- private TreeNode _parent;
- private TreeNode _failure;
- private ArrayList _results;
- private TreeNode[] _transitionsAr;
- private string[] _resultsAr;
- private Hashtable _transHash;
- /// <summary>
- /// Character
- /// </summary>
- public char Char
- {
- get { return _char; }
- }
- /// <summary>
- /// Parent tree node
- /// </summary>
- public TreeNode Parent
- {
- get { return _parent; }
- }
- /// <summary>
- /// Failure function - descendant node
- /// </summary>
- public TreeNode Failure
- {
- get { return _failure; }
- set { _failure = value; }
- }
- /// <summary>
- /// Transition function - list of descendant nodes
- /// </summary>
- public TreeNode[] Transitions
- {
- get { return _transitionsAr; }
- }
- /// <summary>
- /// Returns list of patterns ending by this letter
- /// </summary>
- public string[] Results
- {
- get { return _resultsAr; }
- }
- #endregion
- }
- #endregion
- #region Local fields
- /// <summary>
- /// Root of keyword tree
- /// </summary>
- private TreeNode _root;
- /// <summary>
- /// Keywords to search for
- /// </summary>
- private string[] _keywords;
- #endregion
- #region Initialization
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- /// <param name="ignoreCase">Ignore case of letters (the default is false)</param>
- public StringSearch(string[] keywords, bool ignoreCase)
- : this(keywords)
- {
- IgnoreCase = ignoreCase;
- }
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- public StringSearch(string[] keywords)
- {
- Keywords = keywords;
- }
- /// <summary>
- /// Initialize search algorithm with no keywords
- /// (Use Keywords property)
- /// </summary>
- public StringSearch()
- { }
- #endregion
- #region Implementation
- /// <summary>
- /// Build tree from specified keywords
- /// </summary>
- void BuildTree()
- {
- // Build keyword tree and transition function
- _root = new TreeNode(null, ' ');
- foreach (string p in _keywords)
- {
- // add pattern to tree
- TreeNode nd = _root;
- foreach (char c in p)
- {
- TreeNode ndNew = null;
- foreach (TreeNode trans in nd.Transitions)
- {
- if (this.IgnoreCase)
- {
- if (char.ToLower(trans.Char) == char.ToLower(c)) { ndNew = trans; break; }
- }
- else
- {
- if (trans.Char == c) { ndNew = trans; break; }
- }
- }
- if (ndNew == null)
- {
- ndNew = new TreeNode(nd, c);
- nd.AddTransition(ndNew, this.IgnoreCase);
- }
- nd = ndNew;
- }
- nd.AddResult(p);
- }
- // Find failure functions
- ArrayList nodes = new ArrayList();
- // level 1 nodes - fail to root node
- foreach (TreeNode nd in _root.Transitions)
- {
- nd.Failure = _root;
- foreach (TreeNode trans in nd.Transitions) nodes.Add(trans);
- }
- // other nodes - using BFS
- while (nodes.Count != 0)
- {
- ArrayList newNodes = new ArrayList();
- foreach (TreeNode nd in nodes)
- {
- TreeNode r = nd.Parent.Failure;
- char c = nd.Char;
- while (r != null && !r.ContainsTransition(c, this.IgnoreCase)) r = r.Failure;
- if (r == null)
- nd.Failure = _root;
- else
- {
- nd.Failure = r.GetTransition(c, this.IgnoreCase);
- foreach (string result in nd.Failure.Results)
- nd.AddResult(result);
- }
- // add child nodes to BFS list
- foreach (TreeNode child in nd.Transitions)
- newNodes.Add(child);
- }
- nodes = newNodes;
- }
- _root.Failure = _root;
- }
- #endregion
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- public bool IgnoreCase
- {
- get;
- set;
- }
- /// <summary>
- /// Keywords to search for (setting this property is slow, because
- /// it requieres rebuilding of keyword tree)
- /// </summary>
- public string[] Keywords
- {
- get { return _keywords; }
- set
- {
- _keywords = value;
- BuildTree();
- }
- }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- public StringSearchResult[] FindAll(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- ret.Add(new StringSearchResult(index - found.Length + 1, found));
- index++;
- }
- return (StringSearchResult[])ret.ToArray(typeof(StringSearchResult));
- }
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- public StringSearchResult FindFirst(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- return new StringSearchResult(index - found.Length + 1, found);
- index++;
- }
- return StringSearchResult.Empty;
- }
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- public bool ContainsAny(string text)
- {
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- if (ptr.Results.Length > 0) return true;
- index++;
- }
- return false;
- }
- #endregion
- }
- }
示例下载页面:http://www.uushare.com/user/m2nlight/file/2722093
一个字符串搜索的Aho-Corasick算法的更多相关文章
- 多模字符串匹配算法-Aho–Corasick
		背景 在做实际工作中,最简单也最常用的一种自然语言处理方法就是关键词匹配,例如我们要对n条文本进行过滤,那本身是一个过滤词表的,通常进行过滤的代码如下 for (String document : d ... 
- 【ToolGood.Words】之【StringSearch】字符串搜索——基于BFS算法
		字符串搜索中,BFS算法很巧妙,个人认为BFS算法效率是最高的. [StringSearch]就是根据BFS算法并优化. 使用方法: string s = "中国|国人|zg人|fuck|a ... 
- C#算法之判断一个字符串是否是对称字符串
		记得曾经一次面试时,面试官给我电脑,让我现场写个算法,判断一个字符串是不是对称字符串.我当时用了几分钟写了一个很简单的代码. 这里说的对称字符串是指字符串的左边和右边字符顺序相反,如"abb ... 
- 基于python 3.5 所做的找出来一个字符串中最长不重复子串算法
		功能:找出来一个字符串中最长不重复子串 def find_longest_no_repeat_substr(one_str): #定义一个列表用于存储非重复字符子串 res_list=[] #获得字符 ... 
- 算法 - 给出一个字符串str,输出包含两个字符串str的最短字符串,如str为abca时,输出则为abcabca
		今天碰到一个算法题觉得比较有意思,研究后自己实现了出来,代码比较简单,如发现什么问题请指正.思路和代码如下: 基本思路:从左开始取str的最大子字符串,判断子字符串是否为str的后缀,如果是则返回st ... 
- 算法:Manacher,给定一个字符串str,返回str中最长回文子串的长度。
		[题目] 给定一个字符串str,返回str中最长回文子串的长度 [举例] str="123", 1 str="abc1234321ab" 7 [暴力破解] 从左 ... 
- 字符串模式匹配算法2 - AC算法
		上篇文章(http://www.cnblogs.com/zzqcn/p/3508442.html)里提到的BF和KMP算法都是单模式串匹配算法,也就是说,模式串只有一个.当需要在字符串中搜索多个关键字 ... 
- 字符串混淆技术应用 设计一个字符串混淆程序 可混淆.NET程序集中的字符串
		关于字符串的研究,目前已经有两篇. 原理篇:字符串混淆技术在.NET程序保护中的应用及如何解密被混淆的字符串 实践篇:字符串反混淆实战 Dotfuscator 4.9 字符串加密技术应对策略 今天来 ... 
- Aho - Corasick string matching algorithm
		Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ... 
随机推荐
- maven创建spring项目之后,启动报错java.lang.ClassNotFoundException: org.springframework.web.context.ContextLoade
			错误: org.apache.catalina.core.StandardContext listenerStart严重: Error configuring application listener ... 
- HDU 1907 John(取火柴博弈2)
			传送门 #include<iostream> #include<cstdio> #include<cstring> using namespace std; int ... 
- HttpModule的基本概念
			注:本文为个人学习摘录,原文地址:http://www.cnblogs.com/stwyhm/archive/2006/08/09/471765.html HttpModule是如何工作的 当一个HT ... 
- Python之Socket&异常处理
			Socket Socket用于描述IP地址和端口号,每个应用程序都是通过它来进行网络请求或者网络应答. socket模块和file模块有相似之处,file主要对某个文件进行打开.读写.关闭操作.soc ... 
- Linux中kettle启动spoon.sh遇到的问题
			aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAccAAAAYCAIAAAAAgaGrAAAE1klEQVR4nO1b2ZWrMAylLgpyPa7Gzb 
- PHP中使用CURL(三)
			对 post 提交的数据进行 http_build_query处理,然后再send出去,能实现更好的兼容性,更小的请求数据包. <?php /** * PHP发送Post数据 * @param ... 
- Openjudge-计算概论(A)-第二个重复出现的数
			描述: 给定一个正整数数组(元素的值都大于零),输出数组中第二个重复出现的正整数,如果没有,则输出字符串"NOT EXIST". 输入第一行为整数m,表示有m组数据.其后每组数据分 ... 
- Openjudge-计算概论(A)-数组顺序逆放
			描述: 将一个数组中的值按逆序重新存放.例如,原来的顺序为8,6,5,4,1.要求改为1,4,5,6,8.输入输入为两行:第一行数组中元素的个数n(1<n<100),第二行是n个整数,每两 ... 
- 纯计算监控(Pure computed observables)
			纯计算监控,在knockout 3.2.0里才有,提供了对性能和内存更好的管理.这是因为纯计算监控不包含对他的依赖的订阅.特点有: 防止内存泄漏 降低计算开销:值不再是observed,是一个不会重新 ... 
- Python Tools
			[TOC] Python virtualenv.fabric 和 pip 是 pythoneer 的三大神器 pip pip pip是一个安装和管理Python包的工具,是easy_install的一 ... 
