N-Gram

N-Gram是大词汇连续语音识别中常用的一种语言模型，对中文而言，我们称之为汉语语言模型(CLM, Chinese Language Model)。

中文名: 汉语语言模型
外文名: N-Gram

定义: 计算出具有最大概率的句子
基于: 该模型基于这样一种假设

汉语语言模型利用上下文中相邻词间的搭配信息，在需要把连续无空格的拼音、笔划，或代表字母或笔划的数字，转换成汉字串(即句子)时，可以计算出具有最大概率的句子，从而实现到汉字的自动转换，无需用户手动选择，避开了许多汉字对应一个相同的拼音(或笔划串，或数字串)的重码问题。

该模型基于这样一种假设，第N个词的出现只与前面N-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。常用的是二元的Bi-Gram和三元的Tri-Gram。

https://en.wikipedia.org/wiki/N-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.^[1]

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.

Applications

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.^[2] n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression. Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.

Examples

Figure 1 n-gram examples from various disciplines
Field	Unit	Sample sequence	1-gram sequence	2-gram sequence	3-gram sequence
Vernacular name			unigram	bigram	trigram
Order of resulting Markov model			0	1	2
Protein sequencing	amino acid	… Cys-Gly-Leu-Ser-Trp …	…, Cys, Gly, Leu, Ser, Trp, …	…, Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, …	…, Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, …
DNA sequencing	base pair	…AGCTTCGA…	…, A, G, C, T, T, C, G, A, …	…, AG, GC, CT, TT, TC, CG, GA, …	…, AGC, GCT, CTT, TTC, TCG, CGA, …
Computational linguistics	character	…to_be_or_not_to_be…	…, t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, …	…, to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, …	…, to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, …
Computational linguistics	word	… to be or not to be …	…, to, be, or, not, to, be, …	…, to be, be or, or not, not to, to be, …	…, to be or, be or not, or not to, not to be, …

Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.

Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.^[3]

ceramics collectables collectibles (55)
ceramics collectables fine (130)
ceramics collected by (52)
ceramics collectible pottery (50)
ceramics collectibles cooking (45)

4-grams

serve as the incoming (92)
serve as the incubator (99)
serve as the independent (794)
serve as the index (223)
serve as the indication (72)
serve as the indicator (120)

n-gram models

An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.

This idea can be traced to an experiment by Claude Shannon's work in information theory. Shannon posed the question: given a sequence of letters (for example, the sequence "for ex"), what is the likelihood of the next letter? From training data, one can derive a probability distribution for the next letter given a history of size n {\displaystyle n} : a = 0.4, b = 0.00001, c = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0...

More concisely, an n-gram model predicts x i {\displaystyle x_{i}} based on x i − ( n − 1 ) , … , x i − 1 {\displaystyle x_{i-(n-1)},\dots ,x_{i-1}} . In probability terms, this is P ( x i ∣ x i − ( n − 1 ) , … , x i − 1 ) {\displaystyle P(x_{i}\mid x_{i-(n-1)},\dots ,x_{i-1})} . When used for language modeling, independence assumptions are made so that each word depends only on the last n − 1 words. This Markov model is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of learning the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together.

Note that in a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a "multinomial distribution").

In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams; see smoothing techniques.

Applications and considerations

n-gram models are widely used in statistical natural language processing. In speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution. For parsing, words are modeled such that each n-gram is composed of n words. For language identification, sequences of characters/graphemes (e.g., letters of the alphabet) are modeled for different languages.^[4] For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth (sometimes the beginning and end of a text are modeled explicitly, adding "__g", "_go", "ng_", and "g__"). For sequences of words, the trigrams that can be generated from "the dog smelled like a skunk" are "# the dog", "the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #".

Practitioners^[who?] more interested in multiple word terms might preprocess strings to remove spaces.^[who?] Many simply collapse whitespace to a single space while preserving paragraph marks, because the whitespace is frequently either an element of writing style or introduces layout or presentation not required by the prediction and deduction methodology. Punctuation is also commonly reduced or removed by preprocessing and is frequently used to trigger functionality.

n-grams can also be used for sequences of words or almost any type of data. For example, they have been used for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from.^[5] They have also been very successful as the first pass in genetic sequence search and in the identification of the species from which short sequences of DNA originated.^[6]

n-gram models are often criticized because they lack any explicit representation of long range dependency. This is because the only explicit dependency range is (n − 1) tokens for an n-gram model, and since natural languages incorporate many cases of unbounded dependencies (such as wh-movement), this means that an n-gram model cannot in principle distinguish unbounded dependencies from noise (since long range correlations drop exponentially with distance for any Markov model). For this reason, n-gram models have not made much impact on linguistic theory, where part of the explicit goal is to model such dependencies.

Another criticism that has been made is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction. This is because n-gram models are not designed to model linguistic knowledge as such, and make no claims to being (even potentially) complete models of linguistic knowledge; instead, they are used in practical applications.

In practice, n-gram models have been shown to be extremely effective in modeling language data, which is a core component in modern statistical language applications.

Most modern applications that rely on n-gram based models, such as machine translation applications, do not rely exclusively on such models; instead, they typically also incorporate Bayesian inference. Modern statistical models are typically made up of two parts, a prior distribution describing the inherent likelihood of a possible result and a likelihood function used to assess the compatibility of a possible result with observed data. When a language model is used, it is used as part of the prior distribution (e.g. to gauge the inherent "goodness" of a possible translation), and even then it is often not the only component in this distribution.

Handcrafted features of various sorts are also used, for example variables that represent the position of a word in a sentence or the general topic of discourse. In addition, features based on the structure of the potential result, such as syntactic considerations, are often used. Such features are also used as part of the likelihood function, which makes use of the observed data. Conventional linguistic theory can be incorporated in these features (although in practice, it is rare that features specific to generative or other particular theories of grammar are incorporated, as computational linguists tend to be "agnostic" towards individual theories of grammar^{[citation needed]}).

Out-of-vocabulary words

Main article: Statistical machine translation

An issue when using n-gram language models are out-of-vocabulary (OOV) words. They are encountered in computational linguistics and natural language processing when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the n-grams in the corpus that contain an out-of-vocabulary word are ignored. The n-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed.^[7]

Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g. <unk>) into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special <unk> token before n-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of n-grams involving out-of-vocabulary words.^[8]

n-grams for approximate matching

Main article: Approximate string matching

n-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of n-grams, it can be embedded in a vector space, thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into single character 3-grams, we get a 26 3 {\displaystyle 26^{3}} -dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. For example, both the strings "abc" and "bca" give rise to exactly the same 2-gram "bc" (although {"ab", "bc"} is clearly not the same as {"bc", "ca"}). However, we know empirically that if two strings of real text have a similar vector representation (as measured by cosine distance) then they are likely to be similar. Other metrics have also been applied to vectors of n-grams with varying, sometimes better, results. For example, z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents (which form the "background" vector). In the event of small counts, the g-score (also known as g-test) may give better results for comparing alternative models.

Another method for approximate matching is signature files. The study reported in ^[9] shows that a bit-sliced signature file can be compressed to a smaller size than an inverted file which is the standard way of implementing a vector space approach. With a signature width less than half the number of unique n-grams, the signature file method is about as fast as the inverted file method, and significantly smaller.

It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.

n-gram-based searching can also be used for plagiarism detection.

Other applications

n-grams find use in several areas of computer science, computational linguistics, and applied mathematics.

They have been used to:

design kernels that allow machine learning algorithms such as support vector machines to learn from string data
find likely candidates for the correct spelling of a misspelled word
improve compression in compression algorithms where a small area of data requires n-grams of greater length
assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, speech recognition, OCR (optical character recognition), Intelligent Character Recognition (ICR), machine translation and similar applications
improve retrieval in information retrieval systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents
improve retrieval performance in genetic sequence analysis as in the BLAST family of programs
identify the language a text is in or the species a small sequence of DNA was taken from
predict letters or words at random in order to create text, as in the dissociated press algorithm.
cryptanalysis

Bias-versus-variance trade-off

To choose a value for n in an n-gram model, it is necessary to find the right trade off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.

Smoothing techniques

There are problems of balance weight between infrequent grams (for example, if a proper name appeared in the training data) and frequent grams. Also, items not seen in the training data will be given a probability of 0.0 without smoothing. For unseen but plausible data from a sample, one can introduce pseudocounts. Pseudocounts are generally motivated on Bayesian grounds.

In practice it is necessary to smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. The reason is that models derived directly from the n-gram frequency counts have severe problems when confronted with any n-grams that have not explicitly been seen before – the zero-frequency problem. Various smoothing methods are used, from simple "add-one" (Laplace) smoothing (assign a count of 1 to unseen n-grams; see Rule of succession) to more sophisticated models, such as Good–Turing discounting or back-off models. Some of these methods are equivalent to assigning a prior distribution to the probabilities of the n-grams and using Bayesian inference to compute the resulting posterior n-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations.

Linear interpolation (e.g., taking the weighted mean of the unigram, bigram, and trigram)
Good–Turing discounting
Witten–Bell discounting
Lidstone's smoothing
Katz's back-off model (trigram)
Kneser–Ney smoothing

Skip-gram

In the field of computational linguistics, in particular language modeling, skip-grams^[10] are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.^[11] They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.

Formally, an n-gram is a consecutive subsequence of length n of some sequence of tokens w₁ … w_n. A k-skip-n-gram is a length-n subsequence where the components occur at distance at most k from each other.

For example, in the input text:

the rain in Spain falls mainly on the plain

the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences

the in, rain Spain, in falls, Spain mainly, falls on, mainly the, and on plain.

Syntactic n-grams

Syntactic n-grams are n-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text.^[12]^[13]^[14] For example, the sentence "economic news has little effect on financial markets" can be transformed to syntactic n-grams following the tree structure of its dependency relations: news-economic, effect-little, effect-on-markets-financial.^[12]

Syntactic n-grams are intended to reflect syntactic structure more faithfully than linear n-grams, and have many of the same applications, especially as features in a Vector Space Model. Syntactic n-grams for certain tasks gives better results than the use of standard n-grams, for example, for authorship attribution.^[15]

N-Gram的更多相关文章

Gram格拉姆矩阵在风格迁移中的应用
Gram定义 n维欧式空间中任意k个向量之间两两的内积所组成的矩阵,称为这k个向量的格拉姆矩阵(Gram matrix) 根据定义可以看到,每个Gram矩阵背后都有一组向量,Gram矩阵就是由这一组向 ...
Gram矩阵迁移学习 one-shot 之类
格拉姆矩阵是由内积空间中的向量两两内积而得.格拉姆矩阵在向量为随机的情况下也是协方差矩阵.每个数字都来自于一个特定滤波器在特定位置的卷积,因此每个数字代表一个特征的强度,而Gram计算的实际上是两两特 ...
LG Gram 2018 z980 白
因为今年8代处理器i5的双核变成了四核,感觉是个换电脑的好时机,本来打算买macbook,但是6月的发布会并没有发布,于是开始寻找一些比较有特点的笔记本电脑. 了解了这样一款笔记本 LG GRAM 1 ...
Gram 矩阵性质及应用
v1,v2,-,vn 是内积空间的一组向量,Gram 矩阵定义为: Gij=⟨vi,vj⟩,显然其是对称矩阵. 其实对于一个XN⋅d(N 个样本,d 个属性)的样本矩阵而言,X⋅X′ 即为 Gram ...
【线性代数】4-4:正交基和Gram算法(Orthogonal Bases and Gram-Schmidt)
title: [线性代数]4-4:正交基和Gram算法(Orthogonal Bases and Gram-Schmidt) categories: Mathematic Linear Algebra ...
如何给LG gram写一个Linux下的驱动？
其实就是实现一下几个Fn键的功能,没有标题吹得那么牛. 不知道为啥,LG gram这本子意外的小众. 就因为这个,装Linux遇到的硬件问题就没法在网上直接搜到解决办法了. Fn + F9 实现阅读模 ...
LG gram 双系统全指南
LG gram 双系统全指南为了和同学联机玩帝国时代2,以及为了下学期的编程课,五年没用过 Windows 的我决定装 Ubuntu20.04 LTS / WIndows 10 双系统了. 我的 L ...
Gram 矩阵与向量到子空间的距离
设 $W$ 是 $n$ 维 Euclidean 空间 $V$ 的子空间, $\beta\in V$, 定义 $\beta$ 到 $W$ 的距离 $$\bex \rd (\beta,W)=|\bet ...
让IIS7.0.0.0支持 .iso .7z .torrent .apk等文件下载的设置方法
IIS默认支持哪些MIME类型呢,我们可以这样查看:打开IIS管理器(计算机--管理--服务和应用程序--Internet信息服务(IIS)管理器:或者Win+R,输入inetmgr,Enter),在 ...
CFD冲蚀模拟的一些理论
[TOC] 在CFD中计算颗粒对固体壁面的冲蚀往往采用冲蚀模型(Erosion Model). 1 冲蚀速率(Erosion Rate) 冲蚀速率定义为壁面材料在单位时间单位面积上损失的质量(单位:\ ...

随机推荐

webpack进阶之插件篇
一.插件篇 1. 自动补全css3前缀 autoprefixer 官方是这样说的:Parse CSS and add vendor prefixes to CSS rules using values ...
通过android 客户端上传图片到服务器
昨天,(在我的上一篇博客中)写了通过浏览器上传图片到服务器(php),今天将这个功能付诸实践.(还完善了服务端的代码) 不试不知道,原来通过android 向服务端发送图片还真是挺麻烦的一件事. 上传 ...
Shell命令_正则表达式
正则表达式是包含匹配,通配符是完全匹配基础正则表达式 test.txt示例文件 1 2 3 4 5 6 7 8 9 10 11 12 Mr. James said: he was the hones ...
【BZOJ 3669】【NOI 2014】魔法森林 LCT+枚举边
$LCT+枚举$ 复习一下$LCT$模板. 先以$Ai$为关键字$sort$,然后$Ai$从小到大枚举每条边,看能否构成环,构不成则加边,构成则判断,判断过了就切断$Bi$最大的边. 我的边是编号为$ ...
炮兵阵地 POJ 1185
n*m P 和 M P可以放人 M不行人不能相互打到问最多可以放多少人 #include<stdio.h> #include<algorithm> #include< ...
JAVA System.getProperty() 与 System.getenv() 差异及示例
System.getenv() 方法是获取指定的环境变量的值. System.getenv() 接收参数为任意字符串,当存在指定环境变量时即返回环境变量的值,否则返回null. System.getP ...
poj1966 求顶点连通度
Cable TV Network Time Limit: 1000MS Memory Limit: 30000K Total Submissions: 4563 Accepted: 2118 ...
100726C
显而易见,我们要找子串,每次记录前缀和,算出余数,然后通过一个数组保存余数,答案就是加上之前余数的总和,要注意整除的情况 #include<iostream> #include<cs ...
lucene-查询query->PrefixQuery使用前缀搜索
PrefixQuery就是使用前缀来进行查找的.通常情况下,首先定义一个词条Term.该词条包含要查找的字段名以及关键字的前缀,然后通过该词条构造一个PrefixQuery对象,就可以进行前缀查找了. ...
Android Studio构建系统基础
基础知识项目创建成功后会自动下载Gradle,这个过程特别慢,建议FQ.下载的Gradle在Windows平台会默认在 C:\Documents and Settings\<用户名>.g ...