nltk中的三元词组，二元词组

在做英文文本处理时，常常会遇到这样的情况，需要我们提取出里面的词组进行主题抽取，尤其是具有行业特色的，比如金融年报等。其中主要进行的是进行双连词和三连词的抽取，那如何进行双连词和三连词的抽取呢？这是本文将要介绍的具体内容。

1. nltk.bigrams(tokens) 和 nltk.trigrams(tokens)

一般如果只是要求穷举双连词或三连词，则可以直接用nltk中的函数bigrams()或trigrams()，效果如下面代码：

 >>> import nltk

 >>> str='you are my sunshine, and all of things are so beautiful just for you.'

 >>> tokens=nltk.wordpunct_tokenize(str)

 >>> bigram=nltk.bigrams(tokens)

 >>> bigram

 <generator object bigrams at 0x025C1C10>

 >>> list(bigram)

 [('you', 'are'), ('are', 'my'), ('my', 'sunshine'), ('sunshine', ','), (',', 'and'), ('and', 'all'), ('all', 'of'), ('of', 'things'), ('things', 'are'), ('are', 'so'), ('so', 'beautiful'), ('beautiful

 ', 'just'), ('just', 'for'), ('for', 'you'), ('you', '.')]

 >>> trigram=nltk.trigrams(tokens)

 >>> list(trigram)

 [('you', 'are', 'my'), ('are', 'my', 'sunshine'), ('my', 'sunshine', ','), ('sunshine', ',', 'and'), (',', 'and', 'all'), ('and', 'all', 'of'), ('all', 'of', 'things'), ('of', 'things', 'are'), ('thin

 gs', 'are', 'so'), ('are', 'so', 'beautiful'), ('so', 'beautiful', 'just'), ('beautiful', 'just', 'for'), ('just', 'for', 'you'), ('for', 'you', '.')]

2. nltk.ngrams(tokens, n)

如果要求穷举四连词甚至更长的多词组，则可以用统一的函数ngrams(tokens, n),其中n表示n词词组，该函数表达形式较统一，效果如下代码：

 >>> nltk.ngrams(tokens, 2)

 <generator object ngrams at 0x027AAF30>

 >>> list(nltk.ngrams(tokens,2))

 [('you', 'are'), ('are', 'my'), ('my', 'sunshine'), ('sunshine', ','), (',', 'and'), ('and', 'all'), ('all', 'of'), ('of', 'things'), ('things', 'are'), ('are', 'so'), ('so', 'beautiful'), ('beautiful

 ', 'just'), ('just', 'for'), ('for', 'you'), ('you', '.')]

 >>> list(nltk.ngrams(tokens,3))

 [('you', 'are', 'my'), ('are', 'my', 'sunshine'), ('my', 'sunshine', ','), ('sunshine', ',', 'and'), (',', 'and', 'all'), ('and', 'all', 'of'), ('all', 'of', 'things'), ('of', 'things', 'are'), ('thin

 gs', 'are', 'so'), ('are', 'so', 'beautiful'), ('so', 'beautiful', 'just'), ('beautiful', 'just', 'for'), ('just', 'for', 'you'), ('for', 'you', '.')]

 >>> list(nltk.ngrams(tokens,4))

 [('you', 'are', 'my', 'sunshine'), ('are', 'my', 'sunshine', ','), ('my', 'sunshine', ',', 'and'), ('sunshine', ',', 'and', 'all'), (',', 'and', 'all', 'of'), ('and', 'all', 'of', 'things'), ('all', '

 of', 'things', 'are'), ('of', 'things', 'are', 'so'), ('things', 'are', 'so', 'beautiful'), ('are', 'so', 'beautiful', 'just'), ('so', 'beautiful', 'just', 'for'), ('beautiful', 'just', 'for', 'you'),

  ('just', 'for', 'you', '.')]

3. nltk.collocations下的相关类

nltk.collocations下有三个类：BigramCollocationFinder， QuadgramCollocationFinder， TrigramCollocationFinder

1）BigramCollocationFinder

它是一个发现二元词组并对其进行排序的工具，一般使用函数from_words()去构建一个搜索器，而不是直接生成一个实例。发现器主要调用以下方法：

above_score(self, score_fn, min_score): 返回分数超过min_score的n元词组，并按分数从大到小对其进行排序。这里当然返回的是二元词组，这里的分数有多种定义，后面将做详细介绍。

apply_freq_filter(self, min_freq):过滤掉词组出现频率小于min_freq的词组。

apply_ngram_filter(self, fn): 过滤掉符合条件fn的词组。在判断条件fn时，是将整个词组进行判断是否满足条件fn，如果满足条件，则将该词组过滤掉。

apply_word_filter(self, fn): 过滤掉符合条件fn的词组。在判断条件fn时，是将词组中的词一一判断，如果有一个词满足条件fn，则该词组满足条件，将会被过滤掉。

nbest(self, score_fn, n): 返回分数最高的前n个词组。

score_ngrams(self, score_fn): 返回由词组和对应分数组成的序列，并将其从高到低排列。

 >>> finder=nltk.collocations.BigramCollocationFinder.from_words(tokens)

 >>> bigram_measures=nltk.collocations.BigramAssocMeasures()

 >>> finder.nbest(bigram_measures.pmi, 10)

 [(',', 'and'), ('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('sunshine', ','), ('are', 'my')]

 >>> finder.nbest(bigram_measures.pmi, 100)

 [(',', 'and'), ('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('sunshine', ','), ('are', 'my'), ('are', 'so'), ('for'

 , 'you'), ('things', 'are'), ('you', '.'), ('you', 'are')]

 >>> finder.apply_ngram_filter(lambda w1,w2: w1 in [',', '.'] and w2 in [',', '.'] )

 >>> finder.nbest(bigram_measures.pmi, 100)

 [(',', 'and'), ('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('sunshine', ','), ('are', 'my'), ('are', 'so'), ('for'

 , 'you'), ('things', 'are'), ('you', '.'), ('you', 'are')]

 >>> finder.apply_word_filter(lambda x: x in [',', '.'])

 >>> finder.nbest(bigram_measures.pmi, 100)

 [('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('are', 'my'), ('are', 'so'), ('for', 'you'), ('things', 'are'), ('yo

 u', 'are')]

2)TrigramCollocationFinder 和 QuadgramCollocationFinder

用法同BigramCollocationFinder, 只不过这里生产的是三元词组搜索器, 而QuadgramCollocationFinder产生的是四元词组搜索器。对应函数也同上。

4. 计算词组词频

>>> sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10]

[(('all', 'of'), 1), (('and', 'all'), 1), (('are', 'my'), 1), (('are', 'so'), 1), (('beautiful', 'just'), 1), (('for', 'you'), 1), (('just', 'for'), 1), (('my', 'sunshine'), 1), (('of', 'things'), 1),

 (('so', 'beautiful'), 1)]

###这里的key是排序依据，就是说先按t[1](词频)排序，-表示从大到小；再按照词组（t[0]）排序，默认从a-z.

5. 判断的分数

在nltk.collocations.ngramAssocMeasures下，有多种分数：

chi_sq(cls, n_ii, n_ix_xi_tuple, n_xx): 使用卡方分布计算出的各个n元词组的分数。

pmi(cls, *marginals): 使用点互信息计算出的各个n元词组的分数。

likelihood_ratio(cls, *marginals): 使用最大似然比计算出的各个n元词组的分数。

student_t(cls, *marginals): 使用针对单元词组的带有独立假设的学生t检验计算各个n元词组的分数

以上是比较常用的几种分数，当然还有很多其他的分数，比如：poisson_stirling, jaccard, fisher, phi_sq等。

 >>> bigram_measures=nltk.collocations.BigramAssocMeasures()

 >>> bigram_measures.student_t(8, (15828, 4675), 14307668)

 0.9999319894802036

 >>> bigram_measures.student_t(8, (42, 20), 14307668)

 2.828406367705413

 >>> bigram_measures.chi_sq(8, (15828, 4675), 14307668)

 1.5488692067282201

 >>> bigram_measures.chi_sq(59, (67, 65), 571007)

 456399.76190356724

 >>> bigram_measures.likelihood_ratio(110, (2552, 221), 31777)

 270.721876936225

 >>> bigram_measures.pmi(110, (2552, 221), 31777)

 2.6317398492166078

 >>> bigram_measures.pmi

 <bound method type.pmi of <class 'nltk.metrics.association.BigramAssocMeasures'>>

 >>> bigram_measures.likelihood_ratio

 <bound method type.likelihood_ratio of <class 'nltk.metrics.association.BigramAssocMeasures'>>

 >>> bigram_measures.chi_sq

 <bound method type.chi_sq of <class 'nltk.metrics.association.BigramAssocMeasures'>>

 >>> bigram_measures.student_t

 <bound method type.student_t of <class 'nltk.metrics.association.BigramAssocMeasures'>>

6. Ranking and correlation

It is useful to consider the results of finding collocations as a ranking, and the rankings output using different association measures can be compared using the Spearman correlation coefficient.

Ranks can be assigned to a sorted list of results trivially by assigning strictly increasing ranks to each result:

>>> from nltk.metrics.spearman import *

>>> results_list = ['item1', 'item2', 'item3', 'item4', 'item5']

>>> print(list(ranks_from_sequence(results_list)))

[('item1', 0), ('item2', 1), ('item3', 2), ('item4', 3), ('item5', 4)]

If scores are available for each result, we may allow sufficiently similar results (differing by no more than rank_gap) to be assigned the same rank:

>>> results_scored = [('item1', 50.0), ('item2', 40.0), ('item3', 38.0),

...                   ('item4', 35.0), ('item5', 14.0)]

>>> print(list(ranks_from_scores(results_scored, rank_gap=5)))

[('item1', 0), ('item2', 1), ('item3', 1), ('item4', 1), ('item5', 4)]

The Spearman correlation coefficient gives a number from -1.0 to 1.0 comparing two rankings. A coefficient of 1.0 indicates identical rankings; -1.0 indicates exact opposite rankings.

>>> print('%0.1f' % spearman_correlation(

...         ranks_from_sequence(results_list),

...         ranks_from_sequence(results_list)))

1.0

>>> print('%0.1f' % spearman_correlation(

...         ranks_from_sequence(reversed(results_list)),

...         ranks_from_sequence(results_list)))

-1.0

>>> results_list2 = ['item2', 'item3', 'item1', 'item5', 'item4']

>>> print('%0.1f' % spearman_correlation(

...        ranks_from_sequence(results_list),

...        ranks_from_sequence(results_list2)))

0.6

>>> print('%0.1f' % spearman_correlation(

...        ranks_from_sequence(reversed(results_list)),

...        ranks_from_sequence(results_list2)))

-0.6

nltk中的三元词组，二元词组的更多相关文章

python中的三元运算
一.三元运算符三元运算符就是在赋值变量的时候,可以直接加判断,然后赋值格式:[on_true] if [expression] else [on_false] res = 值1 if 条件 els ...
python中类似三元表达式的写法
python中没有其它语言中的三元表达式,如: a = x > y ? m : n; python中的类似写法为: a = 1 b = 2 h = "" h = " ...
Python中的三元运算符
Python中的三元运算符对于如下需求: if var1>1 : goal = "执行表达式1" else: goal = "执行表达式2" 1.在其他 ...
在 NLTK 中使用 Stanford NLP 工具包
转载自:http://www.zmonster.me/2016/06/08/use-stanford-nlp-package-in-nltk.html 目录 NLTK 与 Stanford NLP 安 ...
js中，三元运算的简单应用(？：)
js中,三元运算的简单应用: var sinOrMul = ""; sinOrMul =(subType=="single")?("<span ...
python中的三元表达式（三目运算符）
python中没有其他语言中的三元表达式,不过有类似的实现方法其他语言中,例如java的三元表达式是这样 int a = 1; String b = ""; b = a > ...
HTML中的三元表达式,灵活的使用or逻辑判断
08.27自我总结 HTML中的三元表达式判断内容 ? 满足返回的值 : 不满足返回的值灵活使用or逻辑判断比如我们某个变量为空的时候返回他另外个值 var a = msg || '没有消息'
java语言中使用三元式的时候应该注意的问题
今天在项目中改领导要求的代码表现的时候发现了一个很有趣的问题. 但是的代码情况类似如下: 1 2 Integer test1 = null; System.out.println("test ...
Python 中的三元运算（软件测试中运用）
前言在java中,有类似于 (condition) ? a :b 这样的语法,表示如果condition 为真,返回a,反之返回b.我们称之为三元运算. 那Python中,有没有这样的语法呢,非常遗 ...

随机推荐

【Common】NO.81.Note.1.Common.1.002-【文章摘要】
1.0.0 Summary Tittle:[Common]NO.81.Note.1.Common.1.002-[文章摘要] Style:Common Series:Common Since:2018- ...
Linux应用之crontab定时任务的设置
实现Linux定时任务有:cron.anacron.at等,这里主要介绍cron服务. 名词解释: cron是服务名称,crond是后台进程,crontab则是定制好的计划任务表. 软件包安装: 要使 ...
centos7.5固定局域网ip
有点时候,比如像我们单位,没事干就停一次网,结果ip变了,还得重新看ip,重新配置,很麻烦,所以干脆把自己ip固定,以不变应万变!!! 1.首先查看自己的ip是什么: $ ifconfig eno1: ...
es6阮一峰读后感
不经意间看了你一眼(阮一峰的es6读后感)我自己常用的字符串篇:ES6 为字符串添加了遍历器接口(详见<Iterator>一章),使得字符串可以被for...of循环遍历.只要有遍历器接 ...
ASM磁盘组剔盘、加盘实施过程
Task:从一个ASM磁盘组中剔除一块盘,加入到另一个ASM磁盘组. 环境:AIX6.1 + Oracle RAC 11.2.0.3 前期准备: 1.查看DG磁盘组空间情况: --查看DG磁盘组空间情 ...
UVA 11582 Colossal Fibonacci Numbers(数学)
Colossal Fibonacci Numbers 想先说下最近的状态吧,已经考完试了,这个暑假也应该是最后刷题的暑假了,打完今年acm就应该会退了,但是还什么都不会呢? +_+ 所以这个暑假,一定 ...
MySQL 5.6 (Win7 64位)下载、安装与配置图文教程
一．工具 Win7 64位操作系统二．步骤第一步:下载安装包下载地址:http://www.mysql.com/ 截止到目前(2016/7/24) ,官网的最新版本是5.7.13,不过自己 ...
MongoExport后的负载均衡问题查询及解决：can't accept new chunks because there are still 2 deletes from previous migration
问题前一阵有一个数据导出需求,按照各种数据库的使用方法,使用MongoExport方法导出数据,将数据导出到本地文件系统,在导出之后遇到此问题. 此问题和mongoexport的原理有关,我们知道数 ...
wm_concat函数oracle 11g返回clob
用wm_concat连接拼接字符串,测试环境是10g,一切正常到了生产环境是11g,点开直接报错了 wm_concat函数在oracle 10g返回的是字符串,到了11g返回的是clob 解决办法: ...
layer（jQuery弹出层插件）
弹窗alert:默认确定按钮+右上角关闭 top.layer.alert("请选择要删除的记录!",{shade: 0.3,offset:'250px'}); 弹窗alert:默认 ...

nltk中的三元词组，二元词组

nltk中的三元词组，二元词组的更多相关文章

随机推荐

热门专题