自然语言18.2_NLTK命名实体识别

python机器学习-乳腺癌细胞挖掘（博主亲自录制视频）https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

机器学习，统计项目合作QQ：231469242

http://blog.csdn.net/u010718606/article/details/50148261参考

NLTK中对于很多自然语言处理应用有着开箱即用的api，但是结果往往让人弄不清楚状况。
下面的例子使用NLTK进行命名实体的识别。第一例中，Apple成功被识别出来，而第二例并未被识别。究竟是什么原因导致这样的结果，接下来一探究竟。

In [1]: import nltk

In [2]: tokens = nltk.word_tokenize('I am very excited about the next generation of Apple products.')

In [3]: tokens = nltk.pos_tag(tokens)

In [4]: print tokens

[('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('excited', 'JJ'), ('about', 'IN'), ('the', 'DT'), ('next', 'JJ'), ('generation', 'NN'), ('of', 'IN'), ('Apple', 'NNP'), ('products', 'NNS'), ('.', '.')]

In [5]: tree = nltk.ne_chunk(tokens)

In [6]: print tree

(S

  I/PRP

  am/VBP

  very/RB

  excited/JJ

  about/IN

  the/DT

  next/JJ

  generation/NN

  of/IN

  (GPE Apple/NNP)

  products/NNS

  ./.)

In [7]: tokens = nltk.word_tokenize('I bought these Apple products today.')

In [8]: tokens = nltk.pos_tag(tokens)

In [9]: print tokens

['I', 'bought', 'these', 'Apple', 'products', 'today', '.']

In [10]: tree = nltk.ne_chunk(tokens)

In [11]: print tree

(S I/PRP bought/VBD these/DT Apple/NNP products/NNS today/NN ./.)

最大熵算法

注意到在上述两个例子Apple这个词被词性标注为NNP（NNP是宾夕法尼亚大学树图资料库II为专有名词，单数）。另外，这两个单词都以大写字母开始。为什么Apple在1例中被标记为GPE（地缘政治实体），而2例未标记？另外，为什么Apple标记为GPE，而不是ORG（组织机构）？

NLTK的命名实体识别是通过使用的MaxEnt分类器。MaxEnt分类器工作有两个原则：1.总是试图保持均匀分布(即最大化熵)，2.保持其统计概率与经验数据一致。经验数据来源于语料库，通过手动标记，所以大多数标记数据并不是免费的。NLTK不提供其训练命名实体识别器的语料库（训练数据来自ACE（自动内容抽取））。NLTK所提供的是一个pickle文件（在nltk_data/chunkers/目录下），而这个pickle文件，就是训练好的MaxEnt分类器实例。　　　　

➜  maxent_ne_chunker  tree

.

├── english_ace_binary.pickle

├── english_ace_multiclass.pickle

└── PY3

    ├── english_ace_binary.pickle

    └── english_ace_multiclass.pickle

要训练良好的监督学习的算法基于良好的特征。在命名实体识别中，特征可能是这个词是否包含一个大写字母。所以NLTK使用的特征有哪些呢？下面我列出他们：

- 词的形状（是否包含数字/首字母大写/包含符号）　　
- 词的长度
- 词的前三个字母
- 词尾三个字母　
- 词性标签
- 词本身　
- 该词是否存在
- 该词前面词的词性（前面是否有名词）　　
- 前词词性
- 后词词性
- 前词本身　
- 后词本身　
- …

下面的代码可以列出NLTK中所使用的标签

import nltk

# 载入序列化对象

chunker = nltk.data.load('chunkers/maxent_ne_chunker/english_ace_multiclass.pickle')

# 最大熵分类器

maxEnt = chunker._tagger.classifier()

def maxEnt_report():

    maxEnt = chunker._tagger.classifier()

    print 'These are the labels used by the NLTK\'s NEC...'

    print maxEnt.labels()

    print ''

    print 'These are the most informative features found in the ACE corpora...'

    maxEnt.show_most_informative_features()

def ne_report(sentence, report_all=False):

    # 词性标记

    tokens = nltk.word_tokenize(sentence)

    tokens = nltk.pos_tag(tokens)

    tags = []

    for i in range(0, len(tokens)):

        featureset = chunker._tagger.feature_detector(tokens, i, tags)

        tag = chunker._tagger.choose_tag(tokens, i, tags)

        if tag != 'O' or report_all:

            print '\nExplanation on the why the word \'' + tokens[i][0] + '\' was tagged:'

            featureset = chunker._tagger.feature_detector(tokens, i, tags)

            maxEnt.explain(featureset)

        tags.append(tag)

下面的输出报告中列出了NLTK所使用的标签，”I-“，”B-“, “O”前缀的含义为包含/开始/例外（inside/begin/others）标记。当一块开始，第一个词是前缀“B”来表示这个词是一个块的开始。下一个单词，如果它属于同一块，将以”I-“前缀，表示这是块的一部分，而不是开始。如果一个词不属于一块，贴上“O”，这意味着它是在外面。

➜  test  python dd.py

These are the labels used by the NLTK's NEC...

['I-GSP', 'B-LOCATION', 'B-GPE', 'I-ORGANIZATION', 'I-PERSON', 'O', 'I-FACILITY', 'I-LOCATION', 'B-PERSON', 'B-FACILITY', 'B-GSP', 'B-ORGANIZATION', 'I-GPE']

These are the most informative features found in the ACE corpora...

  10.125 bias==True and label is 'O'

   6.631 suffix3=='day' and label is 'O'

  -6.207 bias==True and label is 'I-GSP'

   5.628 prevtag=='O' and label is 'O'

  -4.740 shape=='upcase' and label is 'O'

   4.106 shape+prevtag=='<function shape at 0x8bde0d4>+O' and label is 'O'

  -3.994 shape=='mixedcase' and label is 'O'

   3.992 pos+prevtag=='NNP+B-PERSON' and label is 'I-PERSON'

   3.890 prevtag=='I-ORGANIZATION' and label is 'I-ORGANIZATION'

   3.879 shape+prevtag=='<function shape at 0x8bde0d4>+I-ORGANIZATION' and label is 'I-ORGANIZATION'

Note:
- GPE is Geo-Political Entity
- GSP is Geo-Socio-Political group

例1输出：

Explanation on the why the word 'Apple' was tagged:

  Feature                                            B-GPE       O B-ORGAN   B-GSP

  --------------------------------------------------------------------------------

  prevtag=='O' (1)                                   3.767

  shape=='upcase' (1)                                2.701

  pos+prevtag=='NNP+O' (1)                           2.254

  en-wordlist==False (1)                             2.095

  label is 'B-GPE' (1)                              -2.005

  bias==True (1)                                    -1.975

  prevword=='of' (1)                                 0.742

  pos=='NNP' (1)                                     0.681

  nextpos=='nns' (1)                                 0.661

  prevpos=='IN' (1)                                  0.311

  wordlen==5 (1)                                     0.113

  nextword=='products' (1)                           0.060

  bias==True (1)                                            10.125

  prevtag=='O' (1)                                           5.628

  shape=='upcase' (1)                                       -4.740

  prevpos=='IN' (1)                                         -1.668

  label is 'O' (1)                                          -1.075

  pos=='NNP' (1)                                            -1.024

  suffix3=='ple' (1)                                         0.797

  en-wordlist==False (1)                                     0.698

  wordlen==5 (1)                                            -0.449

  prevword=='of' (1)                                        -0.217

  nextpos=='nns' (1)                                         0.104

  prefix3=='app' (1)                                         0.089

  pos+prevtag=='NNP+O' (1)                                   0.011

  nextword=='products' (1)                                   0.005

  prevtag=='O' (1)                                                   3.389

  pos+prevtag=='NNP+O' (1)                                           1.725

  bias==True (1)                                                     0.955

  en-wordlist==False (1)                                             0.837

  label is 'B-ORGANIZATION' (1)                                      0.718

  nextpos=='nns' (1)                                                 0.365

  wordlen==5 (1)                                                    -0.351

  pos=='NNP' (1)                                                     0.174

  prevpos=='IN' (1)                                                 -0.139

  prevword=='of' (1)                                                 0.131

  prefix3=='app' (1)                                                -0.126

  shape=='upcase' (1)                                               -0.084

  suffix3=='ple' (1)                                                -0.077

  prevtag=='O' (1)                                                           2.925

  pos+prevtag=='NNP+O' (1)                                                   2.213

  shape=='upcase' (1)                                                        0.929

  en-wordlist==False (1)                                                     0.891

  bias==True (1)                                                            -0.592

  label is 'B-GSP' (1)                                                      -0.565

  prevpos=='IN' (1)                                                          0.410

  nextpos=='nns' (1)                                                         0.399

  pos=='NNP' (1)                                                             0.393

  prevword=='of' (1)                                                         0.184

  wordlen==5 (1)                                                             0.177

  ---------------------------------------------------------------------------------

  TOTAL:                                             9.406   8.283   7.515   7.366

  PROBS:                                             0.453   0.208   0.122   0.110

最后一行中列出的概率加起来加起来是0.893，而非1。这是因为只输出概率最大的四类标签。

例2输出：

Explanation on the why the word 'Apple' was tagged:

  Feature                                                O   B-GPE B-ORGAN B-LOCAT

  --------------------------------------------------------------------------------

  bias==True (1)                                    10.125

  prevtag=='O' (1)                                   5.628

  shape=='upcase' (1)                               -4.740

  label is 'O' (1)                                  -1.075

  pos=='NNP' (1)                                    -1.024

  suffix3=='ple' (1)                                 0.797

  en-wordlist==False (1)                             0.698

  prevpos=='DT' (1)                                  0.585

  wordlen==5 (1)                                    -0.449

  nextpos=='nns' (1)                                 0.104

  prefix3=='app' (1)                                 0.089

  prevword=='these' (1)                             -0.024

  pos+prevtag=='NNP+O' (1)                           0.011

  nextword=='products' (1)                           0.005

  prevtag=='O' (1)                                           3.767

  shape=='upcase' (1)                                        2.701

  pos+prevtag=='NNP+O' (1)                                   2.254

  en-wordlist==False (1)                                     2.095

  label is 'B-GPE' (1)                                      -2.005

  bias==True (1)                                            -1.975

  pos=='NNP' (1)                                             0.681

  nextpos=='nns' (1)                                         0.661

  prevpos=='DT' (1)                                         -0.181

  wordlen==5 (1)                                             0.113

  nextword=='products' (1)                                   0.060

  prevtag=='O' (1)                                                   3.389

  pos+prevtag=='NNP+O' (1)                                           1.725

  bias==True (1)                                                     0.955

  en-wordlist==False (1)                                             0.837

  label is 'B-ORGANIZATION' (1)                                      0.718

  prevpos=='DT' (1)                                                 -0.494

  nextpos=='nns' (1)                                                 0.365

  wordlen==5 (1)                                                    -0.351

  pos=='NNP' (1)                                                     0.174

  prefix3=='app' (1)                                                -0.126

  shape=='upcase' (1)                                               -0.084

  suffix3=='ple' (1)                                                -0.077

  prevword=='these' (1)                                              0.067

  prevtag=='O' (1)                                                           2.682

  label is 'B-LOCATION' (1)                                                 -2.038

  pos+prevtag=='NNP+O' (1)                                                   1.724

  shape=='upcase' (1)                                                        1.275

  prefix3=='app' (1)                                                         1.169

  bias==True (1)                                                             0.747

  prevpos=='DT' (1)                                                          0.745

  pos=='NNP' (1)                                                             0.616

  en-wordlist==False (1)                                                    -0.309

  nextpos=='nns' (1)                                                         0.151

  wordlen==5 (1)                                                             0.041

  ---------------------------------------------------------------------------------

  TOTAL:                                            10.730   8.171   7.095   6.802

  PROBS:                                             0.697   0.118   0.056   0.046

由此：1和2中在GPE识别中唯一的区别在于下面三行：

prevword==’of’ (1) 0.742
prevpos==’IN’ (1) 0.311
prevpos==’DT’ (1) -0.181

可见，1中
1中的Apple被识别为B-GPE，而2中的Apple被识别为O。

引用：

[1] http://www.nltk.org/book/ch07.html
[2] http://spark-public.s3.amazonaws.com/nlp/slides/Information_Extraction_and_Named_Entity_Recognition_v2.pdf
[3] http://www.mattshomepage.com/#/blog/feb2013/liftingthehood

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（博主视频教学主页）

自然语言18.2_NLTK命名实体识别的更多相关文章

『深度应用』NLP命名实体识别(NER)开源实战教程
近几年来,基于神经网络的深度学习方法在计算机视觉.语音识别等领域取得了巨大成功,另外在自然语言处理领域也取得了不少进展.在NLP的关键性基础任务—命名实体识别(Named Entity Recogni ...
python调用hanlp进行命名实体识别
本文分享自 6丁一的猫的博客,主要是python调用hanlp进行命名实体识别的方法介绍.以下为分享的全文. 1.python与jdk版本位数一致 2.pip install jpype1(pyth ...
神经网络结构在命名实体识别（NER）中的应用
神经网络结构在命名实体识别(NER)中的应用近年来,基于神经网络的深度学习方法在自然语言处理领域已经取得了不少进展.作为NLP领域的基础任务-命名实体识别(Named Entity Recognit ...
学习笔记CB007:分词、命名实体识别、词性标注、句法分析树
中文分词把文本切分成词语,还可以反过来,把该拼一起的词再拼到一起,找到命名实体. 概率图模型条件随机场适用观测值条件下决定随机变量有有限个取值情况.给定观察序列X,某个特定标记序列Y概率,指数函数 e ...
NLP入门（五）用深度学习实现命名实体识别（NER）
前言在文章:NLP入门(四)命名实体识别(NER)中,笔者介绍了两个实现命名实体识别的工具--NLTK和Stanford NLP.在本文中,我们将会学习到如何使用深度学习工具来自己一步步地实现N ...
NLP入门（四）命名实体识别（NER）
本文将会简单介绍自然语言处理(NLP)中的命名实体识别(NER). 命名实体识别(Named Entity Recognition,简称NER)是信息提取.问答系统.句法分析.机器翻译等应用领 ...
HMM与分词、词性标注、命名实体识别
http://www.hankcs.com/nlp/hmm-and-segmentation-tagging-named-entity-recognition.html HMM(隐马尔可夫模型)是用来 ...
【神经网络】神经网络结构在命名实体识别（NER）中的应用
命名实体识别(Named Entity Recognition,NER)就是从一段自然语言文本中找出相关实体,并标注出其位置以及类型,如下图.它是NLP领域中一些复杂任务(例如关系抽取,信息检索等)的 ...
2. 知识图谱-命名实体识别（NER）详解
1. 通俗易懂解释知识图谱(Knowledge Graph) 2. 知识图谱-命名实体识别(NER)详解 3. 哈工大LTP解析 1. 前言在解了知识图谱的全貌之后,我们现在慢慢的开始深入的学习知识 ...

随机推荐

Integer与int的种种比较
package com.lxm.basics; public class IntegerTest { public static void main(String[] args) { Integer ...
re模块（正则表达式）
re 模块:正则表达式import re 内置模块: 1> re.match(pattern,string) pattern:就是正则表达式 string:被操作的对象 match,search ...
选项卡js
趁着公司不忙,抓紧充充电,开始可能会写的不好,但是每写一个都是一点进步,哈哈,加油用js实现选项卡切换 1.获取元素 2.初始状态 3.通过循环清空元素状态 4.点击操作以及对应的内容切换 5.自定 ...
Maven异常Type Project configuration is not up-to-date with pom.xml. Run Maven->Update Project or use Quick Fix
eclipse maven错误“Project configuration is not up-to-date with pom.xml. Run proje” 导入maven工程后,出现如下错误: ...
Android任务和返回栈完全解析，细数那些你所不知道的细节
附:Android task详解出处:http://blog.csdn.net/guolin_blog/article/details/41087993 原文: http://developer. ...
js学习笔记3---自定义属性
1.自定义属性-----JS可以为任何 HTML元素添加任意个自定义属性方法:元素.属性 = 属性值如:aBtn[0].abc = 123; 2.添加索引值,匹配数组 for(i=0; i& ...
smtplib.SMTPDataError: (554, 'DT:SPM 126 smtp5错误解决办法
1.自动化测试中,调用邮件模块自动发送邮件时,运行脚本报错: smtplib.SMTPDataError: (554, 'DT:SPM 126 smtp5,jtKowAD3MJz2c1JXLcK2AA ...
快速提高 Xcode 编译速度的方法（转载自网上一个大神的方法）
1.,中的 Debug Information Format 的选项中选择 DWARF ,平时调试就是用整个选项,经过测试,速度确实有很大的提升,等发行版本的时候在调回 DWARF with dsYM ...
【BZOJ-3337】ORZJRY I 块状链表
3337: ORZJRY I Time Limit: 30 Sec Memory Limit: 512 MBSubmit: 190 Solved: 50[Submit][Status][Discu ...
【poj2114】 Boatherds
http://poj.org/problem?id=2114 (题目链接) 题意给出一棵树,问是否存在两点间的距离为K. Solution 点分治嘛,跟poj1741差不多.. 然而为什么我调了一个 ...

自然语言18.2_NLTK命名实体识别

python机器学习-乳腺癌细胞挖掘（博主亲自录制视频）https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

引用：

自然语言18.2_NLTK命名实体识别的更多相关文章

随机推荐

热门专题