NLP（七）信息抽取和文本分类

原文链接：http://www.one2know.cn/nlp7/

命名实体

专有名词：人名地名产品名

例句	命名实体
Hampi is on the South Bank of Tungabhabra river	Hampi,Tungabhabra River
Paris is famous for Fashion	Paris
Burj Khalifa is one of the SKyscrapers in Dubai	Burj Khalifa,Dubai
Jeff Weiner is the CEO of LinkedIn	Jeff Weiner,LinkedIn

命名实体是独一无二的名词

分类：TIMEZONE,LOCATION,RIVERS,COSMETICS(化妆品),CURRENCY(货币),DATE,TIME,PERSON

NLTK识别命名实体

使用的数据已经经过以下预处理（之前学过的）：

1.将大文档分割成句子

2.将句子分割成词

3.对句子进行词性标注

4.从句子中提取包含连续词（非重叠）的组块（短语）

5.给这些组块包含的词标注IOB标签

分析treebank语料库：

import nltk

def sampleNE():

    sent = nltk.corpus.treebank.tagged_sents()[0] # 语料库第一句

    print(nltk.ne_chunk(sent)) # nltk.ne_chunk()函数分析识别一个句子的命名实体

def sampleNE2():

    sent = nltk.corpus.treebank.tagged_sents()[0]

    print(nltk.ne_chunk(sent,binary=True))  # 包含识别无类别的命名实体

if __name__ == "__main__":

    sampleNE()

    sampleNE2()

输出：

(S

  (PERSON Pierre/NNP)

  (ORGANIZATION Vinken/NNP)

  ,/,

  61/CD

  years/NNS

  old/JJ

  ,/,

  will/MD

  join/VB

  the/DT

  board/NN

  as/IN

  a/DT

  nonexecutive/JJ

  director/NN

  Nov./NNP

  29/CD

  ./.)

(S

  (NE Pierre/NNP Vinken/NNP)

  ,/,

  61/CD

  years/NNS

  old/JJ

  ,/,

  will/MD

  join/VB

  the/DT

  board/NN

  as/IN

  a/DT

  nonexecutive/JJ

  director/NN

  Nov./NNP

  29/CD

  ./.)

创建字典、逆序字典和使用字典

字典：一对一映射，将词和词性一一对应放入字典，下次可高效查找

import nltk

class LearningDictionary():

    def __init__(self,sentence): # 实例化时直接运行,建立了两个字典

        self.words = nltk.word_tokenize(sentence)

        self.tagged = nltk.pos_tag(self.words)

        self.buildDictionary()

        self.buildReverseDictionary()

    # 将词和词性放到字典

    def buildDictionary(self):

        self.dictionary = {}

        for (word,pos) in self.tagged:

            self.dictionary[word] = pos

    # 在原来的字典基础上，新建一个key和value调过来的字典

    def buildReverseDictionary(self):

        self.rdictionary = {}

        for key in self.dictionary.keys():

            value = self.dictionary[key]

            if value not in self.rdictionary:

                self.rdictionary[value] = [key]

            else:

                self.rdictionary[value].append(key)

    # 判断词是否在字典里

    def isWordPresent(self,word):

        return 'Yes' if word in self.dictionary else 'No'

    # 词 => 词性

    def getPOSForWord(self,word):

        return self.dictionary[word] if word in self.dictionary else None

    # 词性 => 词

    def getWordsForPOS(self,pos):

        return self.rdictionary[pos] if pos in self.rdictionary else None

# 测试

if __name__ == "__main__":

    # 以sentence实例化一个对象

    sentence = 'All the flights got delayed due to bad weather'

    learning = LearningDictionary(sentence)

    words = ['chair','flights','delayed','pencil','weather']

    pos = ['NN','VBS','NNS']

    for word in words:

        status = learning.isWordPresent(word)

        print("It '{}' present in dictionary ? : '{}'".format(word,status))

        if status is 'Yes':

            print("\tPOS For '{}' is '{}'".format(word,learning.getPOSForWord(word)))

    for pword in pos:

        print("POS '{}' has '{}' words".format(pword,learning.getWordsForPOS(pword)))

输出：

It 'chair' present in dictionary ? : 'No'

It 'flights' present in dictionary ? : 'Yes'

	POS For 'flights' is 'NNS'

It 'delayed' present in dictionary ? : 'Yes'

	POS For 'delayed' is 'VBN'

It 'pencil' present in dictionary ? : 'No'

It 'weather' present in dictionary ? : 'Yes'

	POS For 'weather' is 'NN'

POS 'NN' has '['weather']' words

POS 'VBS' has 'None' words

POS 'NNS' has '['flights']' words

特征集合选择

import nltk

import random

sampledata = [

    ('KA-01-F 1034 A','rtc'),

    ('KA-02-F 1030 B','rtc'),

    ('KA-03-FA 1200 C','rtc'),

    ('KA-01-G 0001 A','gov'),

    ('KA-02-G 1004 A','gov'),

    ('KA-03-G 0204 A','gov'),

    ('KA-04-G 9230 A','gov'),

    ('KA-27 1290','oth')

]

random.shuffle(sampledata) # 随机排序

testdata = [

    'KA-01-G 0109',

    'KA-02-F 9020 AC',

    'KA-02-FA 0801',

    'KA-01 9129'

]

def learnSimpleFeatures():

    def vehicleNumberFeature(vnumber):

        return {'vehicle_class':vnumber[6]} # 返回第7个字母

    # 元组（第7个字母作为特征，类别）构成的列表

    featuresets = [(vehicleNumberFeature(vn),cls) for (vn,cls) in sampledata]

    # 朴素贝叶斯训练数据 将分类器保存在classifier中

    classifier = nltk.NaiveBayesClassifier.train(featuresets)

    # 测试数据

    for num in testdata:

        feature = vehicleNumberFeature(num)

        print('(simple) %s is type of %s'%(num,classifier.classify(feature)))

def learnFeatures(): # 用6，7两位作为特征

    def vehicleNumberFeature(vnumber):

        return {

            'vehicle_class':vnumber[6],

            'vehicle_prev':vnumber[5],

        }

    featuresets = [(vehicleNumberFeature(vn),cls) for (vn,cls) in sampledata]

    classifier = nltk.NaiveBayesClassifier.train(featuresets)

    for num in testdata:

        feature = vehicleNumberFeature(num)

        print('(dual) %s is type of %s'%(num,classifier.classify(feature)))

if __name__ == "__main__":

    learnSimpleFeatures()

    learnFeatures()

输出：

(simple) KA-01-G 0109 is type of gov

(simple) KA-02-F 9020 AC is type of rtc

(simple) KA-02-FA 0801 is type of rtc

(simple) KA-01 9129 is type of gov

(dual) KA-01-G 0109 is type of gov

(dual) KA-02-F 9020 AC is type of rtc

(dual) KA-02-FA 0801 is type of rtc

(dual) KA-01 9129 is type of oth

利用分类器分割句子

依据：以'.'结尾，下一单词首字母大写

import nltk

# 定义特征 返回（字典，下一个句子首字母是否为大写的布尔值）

def featureExtractor(words,i):

    return ({'current-word':words[i],'next-is-upper':words[i+1][0].isupper()},words[i+1][0].isupper())

# 得到特征集合

def getFeaturesets(sentence):

    words = nltk.word_tokenize(sentence) # 得到句子的单词数组

    featuresets = [featureExtractor(words,i) for i in range(1,len(words)-1) if words[i] == '.']

    return featuresets

# 将文章分句的函数

def segmentTextAndPrintSentences(data):

    words = nltk.word_tokenize(data) # 整个文章分词

    for i in range(0,len(words)-1):

        if words[i] == '.':

            if classifier.classify(featureExtractor(words,i)[0]) == True:

                print(".")

            else:

                print(words[i],end='')

        else:

            print("{} ".format(words[i]),end='')

    print(words[-1]) # 输出最后一个标点

traindata = "The train and test data consist of three columns separated by spaces.Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag as derived by the Brill tagger and the third its chunk tag as derived from the WSJ corpus. The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of the file format."

testdata = "The baseline result was obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag. At the workshop, all 11 systems outperformed the baseline. Most of them (six of the eleven) obtained an F-score between 91.5 and 92.5. Two systems performed a lot better: Support Vector Machines used by Kudoh and Matsumoto [KM00] and Weighted Probability Distribution Voting used by Van Halteren [Hal00]. The papers associated with the participating systems can be found in the reference section below."

traindataset = getFeaturesets(traindata)

classifier = nltk.NaiveBayesClassifier.train(traindataset)

segmentTextAndPrintSentences(testdata)

输出：

The baseline result was obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag .

At the workshop , all 11 systems outperformed the baseline .

Most of them ( six of the eleven ) obtained an F-score between 91.5 and 92.5 .

Two systems performed a lot better : Support Vector Machines used by Kudoh and Matsumoto [ KM00 ] and Weighted Probability Distribution Voting used by Van Halteren [ Hal00 ] .

The papers associated with the participating systems can be found in the reference section below .

文本分类

以RSS(丰富站点，Rich Site Summary)源的分类为例

import nltk

import random

import feedparser

# 两个跟雅虎体育相关的RSS源

urls = {

    'mlb':'http://sports.yahoo.com/mlb/rss.xml',

    'nfl':'http://sports.yahoo.com/nfl/rss.xml',

}

feedmap = {} # 字典存RSS源

stopwords = nltk.corpus.stopwords.words('english') # 停用词

# 输入单词列表 返回特征字典 key是非停用词 value是True

def featureExtractor(words):

    features = {}

    for word in words:

        if word not in stopwords:

            features["word({})".format(word)] = True

    return features

# 空列表 用于储存正确标注的句子

sentences = []

for category in urls.keys():

    feedmap[category] = feedparser.parse(urls[category]) # 下载数据源存到feedmap字典中

    print("downloading {}".format(urls[category]))

    for entry in feedmap[category]['entries']: # 遍历所有RSS条目

        data = entry['summary']

        words = data.split()

        sentences.append((category,words)) # 将类别和所有单词以元组形式存到sentences中

# 将 （类别，单词列表） 转化成 '所有单词的特征：类别' 组成的字典

featuresets = [(featureExtractor(words),category) for category,words in sentences]

# 打乱 一半训练集 一半测试集

random.shuffle(featuresets)

total = len(featuresets)

off = int(total/2)

trainset = featuresets[off:]

testset = featuresets[:off]

# 调用NaiveBayesClassifier模块train()函数 构造一个分类器

classifier = nltk.NaiveBayesClassifier.train(trainset)

# 打印准确率

print(nltk.classify.accuracy(classifier,testset))

# 打印数据的有效特征

classifier.show_most_informative_features(5)

for (i,entry) in enumerate(feedmap['nfl']['entries']):

    if i < 4: # 从nfl随机选取4个样本测试

        features = featureExtractor(entry['title'].split())

        category = classifier.classify(features)

        print('{} -> {}'.format(category,entry['summary']))

输出：

downloading http://sports.yahoo.com/mlb/rss.xml

downloading http://sports.yahoo.com/nfl/rss.xml

0.9148936170212766

Most Informative Features

               word(NFL) = True              nfl : mlb    =      8.6 : 1.0

       word(quarterback) = True              nfl : mlb    =      3.7 : 1.0

              word(team) = True              nfl : mlb    =      2.9 : 1.0

               word(two) = True              mlb : nfl    =      2.4 : 1.0

         word(Wednesday) = True              mlb : nfl    =      2.4 : 1.0

nfl -> The Cowboys RB will not be suspended for his role in an incident in May in Las Vegas.

nfl -> Giants defensive lineman Dexter Lawrence was 6 years old when Eli Manning began his NFL career. Manning is entering his 16th season, while Lawrence is arriving as a first-round draft pick. Age isn't always "just a number." "In the locker room, I feel their age," Manning said,

nfl -> Hue Jackson compiled a 3-36-1 record in two-and-a-half seasons with the Cleveland Browns before later joining division rival the Cincinnati Bengals.

nfl -> NFL Network's David Carr and free agent defensive lineman Andre Fluellen predict every game on the Minnesota Vikings' 2019 schedule.

利用上下文进行词性标注

import nltk

# 给出一些包含双词性的例句 address laugh

sentences = [

    "What is your address when you're in Beijing?",

    "the president's address on the state of economy.",

    "He addressed his remarks to the lawyers in the audience.",

    "In order to address an assembly, we should be ready",

    "He laughed inwardly at the scene.",

    "After all the advance publicity, the prizefight turned out to be a laugh.",

    "We can learn to laugh a little at even our most serious foibles.",

]

# 将每句话的 词和词性 放到列表中，构成一个二维列表

def getSentenceWords():

    sentwords = []

    for sentence in sentences:

        words = nltk.pos_tag(nltk.word_tokenize(sentence))

        sentwords.append(words)

    return sentwords

# 无上下文词性标注

def noContextTagger():

    # 构建一个基准系统

    tagger = nltk.UnigramTagger(getSentenceWords())

    print(tagger.tag('the little remarks towards assembly are laughable'.split()))

# 有上下文词性标注

def withContextTagger():

    # 返回字典：  4 x 特征:特征值

    def wordFeatures(words,wordPosInSentence):

        # 单词的倒数1，2，3个字母作为特征

        endFeatures = {

            'last(1)':words[wordPosInSentence][-1],

            'last(2)':words[wordPosInSentence][-2:],

            'last(3)':words[wordPosInSentence][-3:],

        }

        # 如果一个词不是句子中第一个 用前面的词决定

        if wordPosInSentence > 1:

            endFeatures['prev'] = words[wordPosInSentence - 1]

        else:

            endFeatures['prev'] = '|NONE|'

        return endFeatures

    allsentences = getSentenceWords() # 二维列表

    featureddata = [] # 准备放元组，元组包括 特征信息(featurelist)和标记(tag)

    for sentence in allsentences:

        untaggedSentence = nltk.tag.untag(sentence)

        featuredsentence = [(wordFeatures(untaggedSentence,index),tag) for index,(word,tag) in enumerate(sentence)]

        featureddata.extend(featuredsentence)

    breakup = int(len(featureddata) * 0.5)

    traindata = featureddata[breakup:]

    testdata = featureddata[:breakup]

    classifier = nltk.NaiveBayesClassifier.train(traindata)

    print("分类器准确率 : {}".format(nltk.classify.accuracy(classifier,testdata)))

if __name__ == "__main__":

    noContextTagger()

    withContextTagger()

输出：

[('the', 'DT'), ('little', 'JJ'), ('remarks', 'NNS'), ('towards', None), ('assembly', 'NN'), ('are', None), ('laughable', None)]

分类器准确率 : 0.38461538461538464

NLP（七）信息抽取和文本分类的更多相关文章

NLP学习（2）----文本分类模型
实战:https://github.com/jiangxinyang227/NLP-Project 一.简介: 1.传统的文本分类方法:[人工特征工程+浅层分类模型] (1)文本预处理: ①(中文) ...
百度开源其NLP主题模型工具包，文本分类等场景可直接使用L——LDA进行主题选择本质就是降维，然后用于推荐或者分类
2017年7月4日,百度开源了一款主题模型项目,名曰:Familia. InfoQ记者第一时间联系到百度Familia项目负责人姜迪并对他进行采访,在本文中,他将为我们解析Familia项目的技术细节 ...
万字总结Keras深度学习中文文本分类
摘要:文章将详细讲解Keras实现经典的深度学习文本分类算法,包括LSTM.BiLSTM.BiLSTM+Attention和CNN.TextCNN. 本文分享自华为云社区<Keras深度学习中文 ...
NLP大赛冠军总结：300万知乎多标签文本分类任务(附深度学习源码)
NLP大赛冠军总结:300万知乎多标签文本分类任务(附深度学习源码) 七月,酷暑难耐,认识的几位同学参加知乎看山杯,均取得不错的排名.当时天池AI医疗大赛初赛结束,官方正在为复赛进行平台调 ...
NLP系列(2)_用朴素贝叶斯进行文本分类(上)
作者:龙心尘 && 寒小阳时间:2016年1月. 出处: http://blog.csdn.net/longxinchen_ml/article/details/50597149 h ...
fastText、TextCNN、TextRNN……这里有一套NLP文本分类深度学习方法库供你选择
https://mp.weixin.qq.com/s/_xILvfEMx3URcB-5C8vfTw 这个库的目的是探索用深度学习进行NLP文本分类的方法. 它具有文本分类的各种基准模型,还支持多标签分 ...
NLP文本分类
引言其实最近挺纠结的,有一点点焦虑,因为自己一直都期望往自然语言处理的方向发展,梦想成为一名NLP算法工程师,也正是我喜欢的事,而不是为了生存而工作.我觉得这也是我这辈子为数不多的剩下的可以自己去追 ...
NLP系列(3)_用朴素贝叶斯进行文本分类(下)
作者: 龙心尘 && 寒小阳时间:2016年2月. 出处: http://blog.csdn.net/longxinchen_ml/article/details/50629110 ...
NLP（十六）轻松上手文本分类
背景介绍文本分类是NLP中的常见的重要任务之一,它的主要功能就是将输入的文本以及文本的类别训练出一个模型,使之具有一定的泛化能力,能够对新文本进行较好地预测.它的应用很广泛,在很多领域发挥着重要 ...

随机推荐

java使用栈计算后缀表达式
package com.nps.base.xue.DataStructure.stack.utils; import java.util.Scanner; import java.util.Stack ...
【译】在 Linux 上不安装 Mono 构建 .NET Framework 类库
在这篇文章中,我展示了如何在Linux上构建针对.NET Framework版本的.NET项目,而不使用Mono.通用使用微软新发布的 Mocrosoft.NETFramework.Reference ...
【未解决】iOS QBImagePickerController访问相册没有取消和确定按钮
这两天调程序时遇到了这个问题,如图所示: 感觉这问题也是奇葩………… 用系统的 UIImagePickerController 替换后就正常了.看来是 QBImagePickerController ...
非UI线程更新界面
package com.caterSys.Thread; import java.text.SimpleDateFormat; import java.util.Date; import org.ec ...
dubbo文档笔记
配置覆盖关系以 timeout 为例,显示了配置的查找顺序,其它 retries, loadbalance, actives 等类似: 方法级优先,接口级次之,全局配置再次之. 如果级别一样,则消费 ...
spring-boot-plus1.2.0-RELEASE发布-快速打包-极速部署-在线演示
spring-boot-plus 一套集成spring boot常用开发组件的后台快速开发脚手架 Purpose 每个人都可以独立.快速.高效地开发项目! Everyone can develop p ...
sqoop 密码别名模式 --password-alias
sqoop要使用别名模式隐藏密码 1.首先使用命令创建别名 hadoop credential create xiaopengfei -provider jceks://hdfs/user/pass ...
axios异步提交表单数据的不同形式
踩坑Axios提交form表单几种格式前后端分离的开发前后端, 前端使用的vue,后端的安全模块使用的SpringSecurity,使用postman测试后端的权限接口时发现都正常,但是使用vue+ ...
egg-sequelize-ts 插件
egg-sequelize-ts plugin 目的 (Purpose) 能让使用 typescript 编写的 egg.js 项目中能够使用 sequelize方法,并同时得到egg.js所赋予的功 ...
洛谷 P5367 【模板】康托展开（数论，树状数组）
题目链接 https://www.luogu.org/problem/P5367 什么是康托展开百度百科上是这样说的: “康托展开是一个全排列到一个自然数的双射,常用于构建哈希表时的空间压缩. ...

NLP（七） 信息抽取和文本分类

NLP（七） 信息抽取和文本分类的更多相关文章

随机推荐

热门专题

NLP（七）信息抽取和文本分类

NLP（七）信息抽取和文本分类的更多相关文章