NLP（七）信息抽取和文本分类

原文链接：http://www.one2know.cn/nlp7/

命名实体

专有名词：人名地名产品名

例句	命名实体
Hampi is on the South Bank of Tungabhabra river	Hampi,Tungabhabra River
Paris is famous for Fashion	Paris
Burj Khalifa is one of the SKyscrapers in Dubai	Burj Khalifa,Dubai
Jeff Weiner is the CEO of LinkedIn	Jeff Weiner,LinkedIn

命名实体是独一无二的名词

分类：TIMEZONE,LOCATION,RIVERS,COSMETICS(化妆品),CURRENCY(货币),DATE,TIME,PERSON

NLTK识别命名实体

使用的数据已经经过以下预处理（之前学过的）：

1.将大文档分割成句子

2.将句子分割成词

3.对句子进行词性标注

4.从句子中提取包含连续词（非重叠）的组块（短语）

5.给这些组块包含的词标注IOB标签

分析treebank语料库：

import nltk

def sampleNE():

    sent = nltk.corpus.treebank.tagged_sents()[0] # 语料库第一句

    print(nltk.ne_chunk(sent)) # nltk.ne_chunk()函数分析识别一个句子的命名实体

def sampleNE2():

    sent = nltk.corpus.treebank.tagged_sents()[0]

    print(nltk.ne_chunk(sent,binary=True))  # 包含识别无类别的命名实体

if __name__ == "__main__":

    sampleNE()

    sampleNE2()

输出：

(S

  (PERSON Pierre/NNP)

  (ORGANIZATION Vinken/NNP)

  ,/,

  61/CD

  years/NNS

  old/JJ

  ,/,

  will/MD

  join/VB

  the/DT

  board/NN

  as/IN

  a/DT

  nonexecutive/JJ

  director/NN

  Nov./NNP

  29/CD

  ./.)

(S

  (NE Pierre/NNP Vinken/NNP)

  ,/,

  61/CD

  years/NNS

  old/JJ

  ,/,

  will/MD

  join/VB

  the/DT

  board/NN

  as/IN

  a/DT

  nonexecutive/JJ

  director/NN

  Nov./NNP

  29/CD

  ./.)

创建字典、逆序字典和使用字典

字典：一对一映射，将词和词性一一对应放入字典，下次可高效查找

import nltk

class LearningDictionary():

    def __init__(self,sentence): # 实例化时直接运行,建立了两个字典

        self.words = nltk.word_tokenize(sentence)

        self.tagged = nltk.pos_tag(self.words)

        self.buildDictionary()

        self.buildReverseDictionary()

    # 将词和词性放到字典

    def buildDictionary(self):

        self.dictionary = {}

        for (word,pos) in self.tagged:

            self.dictionary[word] = pos

    # 在原来的字典基础上，新建一个key和value调过来的字典

    def buildReverseDictionary(self):

        self.rdictionary = {}

        for key in self.dictionary.keys():

            value = self.dictionary[key]

            if value not in self.rdictionary:

                self.rdictionary[value] = [key]

            else:

                self.rdictionary[value].append(key)

    # 判断词是否在字典里

    def isWordPresent(self,word):

        return 'Yes' if word in self.dictionary else 'No'

    # 词 => 词性

    def getPOSForWord(self,word):

        return self.dictionary[word] if word in self.dictionary else None

    # 词性 => 词

    def getWordsForPOS(self,pos):

        return self.rdictionary[pos] if pos in self.rdictionary else None

# 测试

if __name__ == "__main__":

    # 以sentence实例化一个对象

    sentence = 'All the flights got delayed due to bad weather'

    learning = LearningDictionary(sentence)

    words = ['chair','flights','delayed','pencil','weather']

    pos = ['NN','VBS','NNS']

    for word in words:

        status = learning.isWordPresent(word)

        print("It '{}' present in dictionary ? : '{}'".format(word,status))

        if status is 'Yes':

            print("\tPOS For '{}' is '{}'".format(word,learning.getPOSForWord(word)))

    for pword in pos:

        print("POS '{}' has '{}' words".format(pword,learning.getWordsForPOS(pword)))

输出：

It 'chair' present in dictionary ? : 'No'

It 'flights' present in dictionary ? : 'Yes'

	POS For 'flights' is 'NNS'

It 'delayed' present in dictionary ? : 'Yes'

	POS For 'delayed' is 'VBN'

It 'pencil' present in dictionary ? : 'No'

It 'weather' present in dictionary ? : 'Yes'

	POS For 'weather' is 'NN'

POS 'NN' has '['weather']' words

POS 'VBS' has 'None' words

POS 'NNS' has '['flights']' words

特征集合选择

import nltk

import random

sampledata = [

    ('KA-01-F 1034 A','rtc'),

    ('KA-02-F 1030 B','rtc'),

    ('KA-03-FA 1200 C','rtc'),

    ('KA-01-G 0001 A','gov'),

    ('KA-02-G 1004 A','gov'),

    ('KA-03-G 0204 A','gov'),

    ('KA-04-G 9230 A','gov'),

    ('KA-27 1290','oth')

]

random.shuffle(sampledata) # 随机排序

testdata = [

    'KA-01-G 0109',

    'KA-02-F 9020 AC',

    'KA-02-FA 0801',

    'KA-01 9129'

]

def learnSimpleFeatures():

    def vehicleNumberFeature(vnumber):

        return {'vehicle_class':vnumber[6]} # 返回第7个字母

    # 元组（第7个字母作为特征，类别）构成的列表

    featuresets = [(vehicleNumberFeature(vn),cls) for (vn,cls) in sampledata]

    # 朴素贝叶斯训练数据 将分类器保存在classifier中

    classifier = nltk.NaiveBayesClassifier.train(featuresets)

    # 测试数据

    for num in testdata:

        feature = vehicleNumberFeature(num)

        print('(simple) %s is type of %s'%(num,classifier.classify(feature)))

def learnFeatures(): # 用6，7两位作为特征

    def vehicleNumberFeature(vnumber):

        return {

            'vehicle_class':vnumber[6],

            'vehicle_prev':vnumber[5],

        }

    featuresets = [(vehicleNumberFeature(vn),cls) for (vn,cls) in sampledata]

    classifier = nltk.NaiveBayesClassifier.train(featuresets)

    for num in testdata:

        feature = vehicleNumberFeature(num)

        print('(dual) %s is type of %s'%(num,classifier.classify(feature)))

if __name__ == "__main__":

    learnSimpleFeatures()

    learnFeatures()

输出：

(simple) KA-01-G 0109 is type of gov

(simple) KA-02-F 9020 AC is type of rtc

(simple) KA-02-FA 0801 is type of rtc

(simple) KA-01 9129 is type of gov

(dual) KA-01-G 0109 is type of gov

(dual) KA-02-F 9020 AC is type of rtc

(dual) KA-02-FA 0801 is type of rtc

(dual) KA-01 9129 is type of oth

利用分类器分割句子

依据：以'.'结尾，下一单词首字母大写

import nltk

# 定义特征 返回（字典，下一个句子首字母是否为大写的布尔值）

def featureExtractor(words,i):

    return ({'current-word':words[i],'next-is-upper':words[i+1][0].isupper()},words[i+1][0].isupper())

# 得到特征集合

def getFeaturesets(sentence):

    words = nltk.word_tokenize(sentence) # 得到句子的单词数组

    featuresets = [featureExtractor(words,i) for i in range(1,len(words)-1) if words[i] == '.']

    return featuresets

# 将文章分句的函数

def segmentTextAndPrintSentences(data):

    words = nltk.word_tokenize(data) # 整个文章分词

    for i in range(0,len(words)-1):

        if words[i] == '.':

            if classifier.classify(featureExtractor(words,i)[0]) == True:

                print(".")

            else:

                print(words[i],end='')

        else:

            print("{} ".format(words[i]),end='')

    print(words[-1]) # 输出最后一个标点

traindata = "The train and test data consist of three columns separated by spaces.Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag as derived by the Brill tagger and the third its chunk tag as derived from the WSJ corpus. The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of the file format."

testdata = "The baseline result was obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag. At the workshop, all 11 systems outperformed the baseline. Most of them (six of the eleven) obtained an F-score between 91.5 and 92.5. Two systems performed a lot better: Support Vector Machines used by Kudoh and Matsumoto [KM00] and Weighted Probability Distribution Voting used by Van Halteren [Hal00]. The papers associated with the participating systems can be found in the reference section below."

traindataset = getFeaturesets(traindata)

classifier = nltk.NaiveBayesClassifier.train(traindataset)

segmentTextAndPrintSentences(testdata)

输出：

The baseline result was obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag .

At the workshop , all 11 systems outperformed the baseline .

Most of them ( six of the eleven ) obtained an F-score between 91.5 and 92.5 .

Two systems performed a lot better : Support Vector Machines used by Kudoh and Matsumoto [ KM00 ] and Weighted Probability Distribution Voting used by Van Halteren [ Hal00 ] .

The papers associated with the participating systems can be found in the reference section below .

文本分类

以RSS(丰富站点，Rich Site Summary)源的分类为例

import nltk

import random

import feedparser

# 两个跟雅虎体育相关的RSS源

urls = {

    'mlb':'http://sports.yahoo.com/mlb/rss.xml',

    'nfl':'http://sports.yahoo.com/nfl/rss.xml',

}

feedmap = {} # 字典存RSS源

stopwords = nltk.corpus.stopwords.words('english') # 停用词

# 输入单词列表 返回特征字典 key是非停用词 value是True

def featureExtractor(words):

    features = {}

    for word in words:

        if word not in stopwords:

            features["word({})".format(word)] = True

    return features

# 空列表 用于储存正确标注的句子

sentences = []

for category in urls.keys():

    feedmap[category] = feedparser.parse(urls[category]) # 下载数据源存到feedmap字典中

    print("downloading {}".format(urls[category]))

    for entry in feedmap[category]['entries']: # 遍历所有RSS条目

        data = entry['summary']

        words = data.split()

        sentences.append((category,words)) # 将类别和所有单词以元组形式存到sentences中

# 将 （类别，单词列表） 转化成 '所有单词的特征：类别' 组成的字典

featuresets = [(featureExtractor(words),category) for category,words in sentences]

# 打乱 一半训练集 一半测试集

random.shuffle(featuresets)

total = len(featuresets)

off = int(total/2)

trainset = featuresets[off:]

testset = featuresets[:off]

# 调用NaiveBayesClassifier模块train()函数 构造一个分类器

classifier = nltk.NaiveBayesClassifier.train(trainset)

# 打印准确率

print(nltk.classify.accuracy(classifier,testset))

# 打印数据的有效特征

classifier.show_most_informative_features(5)

for (i,entry) in enumerate(feedmap['nfl']['entries']):

    if i < 4: # 从nfl随机选取4个样本测试

        features = featureExtractor(entry['title'].split())

        category = classifier.classify(features)

        print('{} -> {}'.format(category,entry['summary']))

输出：

downloading http://sports.yahoo.com/mlb/rss.xml

downloading http://sports.yahoo.com/nfl/rss.xml

0.9148936170212766

Most Informative Features

               word(NFL) = True              nfl : mlb    =      8.6 : 1.0

       word(quarterback) = True              nfl : mlb    =      3.7 : 1.0

              word(team) = True              nfl : mlb    =      2.9 : 1.0

               word(two) = True              mlb : nfl    =      2.4 : 1.0

         word(Wednesday) = True              mlb : nfl    =      2.4 : 1.0

nfl -> The Cowboys RB will not be suspended for his role in an incident in May in Las Vegas.

nfl -> Giants defensive lineman Dexter Lawrence was 6 years old when Eli Manning began his NFL career. Manning is entering his 16th season, while Lawrence is arriving as a first-round draft pick. Age isn't always "just a number." "In the locker room, I feel their age," Manning said,

nfl -> Hue Jackson compiled a 3-36-1 record in two-and-a-half seasons with the Cleveland Browns before later joining division rival the Cincinnati Bengals.

nfl -> NFL Network's David Carr and free agent defensive lineman Andre Fluellen predict every game on the Minnesota Vikings' 2019 schedule.

利用上下文进行词性标注

import nltk

# 给出一些包含双词性的例句 address laugh

sentences = [

    "What is your address when you're in Beijing?",

    "the president's address on the state of economy.",

    "He addressed his remarks to the lawyers in the audience.",

    "In order to address an assembly, we should be ready",

    "He laughed inwardly at the scene.",

    "After all the advance publicity, the prizefight turned out to be a laugh.",

    "We can learn to laugh a little at even our most serious foibles.",

]

# 将每句话的 词和词性 放到列表中，构成一个二维列表

def getSentenceWords():

    sentwords = []

    for sentence in sentences:

        words = nltk.pos_tag(nltk.word_tokenize(sentence))

        sentwords.append(words)

    return sentwords

# 无上下文词性标注

def noContextTagger():

    # 构建一个基准系统

    tagger = nltk.UnigramTagger(getSentenceWords())

    print(tagger.tag('the little remarks towards assembly are laughable'.split()))

# 有上下文词性标注

def withContextTagger():

    # 返回字典：  4 x 特征:特征值

    def wordFeatures(words,wordPosInSentence):

        # 单词的倒数1，2，3个字母作为特征

        endFeatures = {

            'last(1)':words[wordPosInSentence][-1],

            'last(2)':words[wordPosInSentence][-2:],

            'last(3)':words[wordPosInSentence][-3:],

        }

        # 如果一个词不是句子中第一个 用前面的词决定

        if wordPosInSentence > 1:

            endFeatures['prev'] = words[wordPosInSentence - 1]

        else:

            endFeatures['prev'] = '|NONE|'

        return endFeatures

    allsentences = getSentenceWords() # 二维列表

    featureddata = [] # 准备放元组，元组包括 特征信息(featurelist)和标记(tag)

    for sentence in allsentences:

        untaggedSentence = nltk.tag.untag(sentence)

        featuredsentence = [(wordFeatures(untaggedSentence,index),tag) for index,(word,tag) in enumerate(sentence)]

        featureddata.extend(featuredsentence)

    breakup = int(len(featureddata) * 0.5)

    traindata = featureddata[breakup:]

    testdata = featureddata[:breakup]

    classifier = nltk.NaiveBayesClassifier.train(traindata)

    print("分类器准确率 : {}".format(nltk.classify.accuracy(classifier,testdata)))

if __name__ == "__main__":

    noContextTagger()

    withContextTagger()

输出：

[('the', 'DT'), ('little', 'JJ'), ('remarks', 'NNS'), ('towards', None), ('assembly', 'NN'), ('are', None), ('laughable', None)]

分类器准确率 : 0.38461538461538464

NLP（七）信息抽取和文本分类的更多相关文章

NLP学习（2）----文本分类模型
实战:https://github.com/jiangxinyang227/NLP-Project 一.简介: 1.传统的文本分类方法:[人工特征工程+浅层分类模型] (1)文本预处理: ①(中文) ...
百度开源其NLP主题模型工具包，文本分类等场景可直接使用L——LDA进行主题选择本质就是降维，然后用于推荐或者分类
2017年7月4日,百度开源了一款主题模型项目,名曰:Familia. InfoQ记者第一时间联系到百度Familia项目负责人姜迪并对他进行采访,在本文中,他将为我们解析Familia项目的技术细节 ...
万字总结Keras深度学习中文文本分类
摘要:文章将详细讲解Keras实现经典的深度学习文本分类算法,包括LSTM.BiLSTM.BiLSTM+Attention和CNN.TextCNN. 本文分享自华为云社区<Keras深度学习中文 ...
NLP大赛冠军总结：300万知乎多标签文本分类任务(附深度学习源码)
NLP大赛冠军总结:300万知乎多标签文本分类任务(附深度学习源码) 七月,酷暑难耐,认识的几位同学参加知乎看山杯,均取得不错的排名.当时天池AI医疗大赛初赛结束,官方正在为复赛进行平台调 ...
NLP系列(2)_用朴素贝叶斯进行文本分类(上)
作者:龙心尘 && 寒小阳时间:2016年1月. 出处: http://blog.csdn.net/longxinchen_ml/article/details/50597149 h ...
fastText、TextCNN、TextRNN……这里有一套NLP文本分类深度学习方法库供你选择
https://mp.weixin.qq.com/s/_xILvfEMx3URcB-5C8vfTw 这个库的目的是探索用深度学习进行NLP文本分类的方法. 它具有文本分类的各种基准模型,还支持多标签分 ...
NLP文本分类
引言其实最近挺纠结的,有一点点焦虑,因为自己一直都期望往自然语言处理的方向发展,梦想成为一名NLP算法工程师,也正是我喜欢的事,而不是为了生存而工作.我觉得这也是我这辈子为数不多的剩下的可以自己去追 ...
NLP系列(3)_用朴素贝叶斯进行文本分类(下)
作者: 龙心尘 && 寒小阳时间:2016年2月. 出处: http://blog.csdn.net/longxinchen_ml/article/details/50629110 ...
NLP（十六）轻松上手文本分类
背景介绍文本分类是NLP中的常见的重要任务之一,它的主要功能就是将输入的文本以及文本的类别训练出一个模型,使之具有一定的泛化能力,能够对新文本进行较好地预测.它的应用很广泛,在很多领域发挥着重要 ...

随机推荐

TreeSet类的排序
TreeSet支持两种排序方法:自然排序和定制排序.TreeSet默认采用自然排序. 1.自然排序 TreeSet会调用集合元素的compareTo(Object obj)方法来比较元素之间大小关系, ...
Adapter适配器模式--图解设计模式
第二章: Adapter 模式 Adapter模式分为两种: 1.类适配器模式 2.委托适配器我看的是<图解设计模式>这本书,这小鬼子说的话真难懂,只能好好看代码理解. 先说适配器模式要 ...
Activiti6系列（3）- 快速体验
一.部署启动activiti 1.部署,将两个war包拷贝到Tomcat下即可. 2.启动tomcat,访问http://127.0.0.1:8080/activiti-app 默认账号密码:admi ...
每个程序员都可以「懂」一点 Linux
提到 Linux,作为程序员来说一定都不陌生.但如果说到「懂」Linux,可能就没有那么多人有把握了.到底用 Linux 离懂 Linux 有多远?如果决定学习 Linux,应该怎么开始?要学到什么程 ...
Json串与实体的相互转换 (不依赖于jar包只需Eclipse环境即可)
Json串与实体的相互转换 (不依赖于jar包只需Eclipse环境即可) 最近学习了javaWeb开发,用的是ssh框架里面自己整合了hibernate 和Struts2 和spring框架,其中 ...
认识Redies
既然是作为了解性文章,那必然不会做很深入的解读.深入的解读以后会加上. 我们先来回答两个问题.通过这两个问题来开始我们的Redies入门之旅. Redies是什么? Redies有什么作用? Redi ...
《机器学习技法》---核型SVM
(本文内容和图片来自林轩田老师<机器学习技法>) 1. 核技巧引入如果要用SVM来做非线性的分类,我们采用的方法是将原来的特征空间映射到另一个更高维的空间,在这个更高维的空间做线性的SV ...
『深度应用』一小时教你上手MaskRCNN·Keras开源实战（Windows&Linux）
0. 前言介绍开源地址:https://github.com/matterport/Mask_RCNN 个人主页:http://www.yansongsong.cn/ MaskRCNN是何凯明基于以 ...
Docker之- 使用Docker 镜像和仓库
目录使用Docker 镜像和仓库什么是 Docker 镜像列出 Docker 镜像 tag 标签 Docker Hub 拉取镜像查找镜像构建镜像创建Docker Hub 账号使用 Doc ...
用python写排序算法
希尔排序希尔排序通过将比较的全部元素分为几个区域来提升插入排序的性能.这样可以让一个元素可以一次性地朝最终位置前进一大步.然后算法再取越来越小的步长进行排序,算法的最后一步就是普通的插入排序,但是到 ...

NLP（七） 信息抽取和文本分类

NLP（七） 信息抽取和文本分类的更多相关文章

随机推荐

热门专题

NLP（七）信息抽取和文本分类

NLP（七）信息抽取和文本分类的更多相关文章