NLP(五) 词性标注和文法
原文链接:http://www.one2know.cn/nlp5/
- NLTK内置词性标注器
用nltk.pos_tag()函数进行词性标注
import nltk
nltk.download('averaged_perceptron_tagger')
simpleSentence = 'Bangalore is the capital of Karnataka.'
# 分词
wordsInSentence = nltk.word_tokenize(simpleSentence)
print(wordsInSentence)
# 词性标注
partsOfSpeechTags = nltk.pos_tag(wordsInSentence)
print(partsOfSpeechTags)
输出:
['Bangalore', 'is', 'the', 'capital', 'of', 'Karnataka', '.']
[('Bangalore', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('capital', 'NN'), ('of', 'IN'), ('Karnataka', 'NNP'), ('.', '.')]
- 自己的词性标注器
import nltk
# 默认:不认识的都标成NN
def learnDefaultTagger(simpleSentence):
wordsInSentence = nltk.word_tokenize(simpleSentence)
tagger = nltk.DefaultTagger('NN')
posEnabledTags = tagger.tag(wordsInSentence)
print(posEnabledTags)
# 正则表达式标注
def learnRETagger(simpleSentence):
# 元组列表,r不能省哦
customPatterns =[
(r'.*ing$','ADJECTIVE'),
(r'.*ly$','ADVERB'),
(r'.*ion$','NOUN'),
(r'(.*ate|.*en|is)$','VERB'),
(r'^an$','INDEFINITE-ARTICLE'),
(r'^(with|on|at)$','PREPOSITION'),
(r'^[0-9]*$','NUMBER'),
(r'.*$',None),
]
tagger = nltk.RegexpTagger(customPatterns)
wordsInSentencs = nltk.word_tokenize(simpleSentence)
posEnabledTags =tagger.tag(wordsInSentencs)
print(posEnabledTags)
# 字典标注
def learnLookupTagger(simpleSentence):
mapping = {
'.':'.','place':'NN','on':'IN','earth':'NN','Mysore':'NNP',
'is':'VBZ','an':'DT','amazing':'JJ',
}
tagger = nltk.UnigramTagger(model=mapping)
wordsInSentencs = nltk.word_tokenize(simpleSentence)
posEnabledTags = tagger.tag(wordsInSentencs)
print(posEnabledTags)
if __name__ == "__main__":
testSentence = 'Mysore is an amazing place on earth. I have visited Mysore 10 times.'
learnDefaultTagger(testSentence)
learnRETagger(testSentence)
learnLookupTagger(testSentence)
输出:
[('Mysore', 'NN'), ('is', 'NN'), ('an', 'NN'), ('amazing', 'NN'), ('place', 'NN'), ('on', 'NN'), ('earth', 'NN'), ('.', 'NN'), ('I', 'NN'), ('have', 'NN'), ('visited', 'NN'), ('Mysore', 'NN'), ('10', 'NN'), ('times', 'NN'), ('.', 'NN')]
[('Mysore', None), ('is', 'VERB'), ('an', 'INDEFINITE-ARTICLE'), ('amazing', 'ADJECTIVE'), ('place', None), ('on', 'PREPOSITION'), ('earth', None), ('.', None), ('I', None), ('have', None), ('visited', None), ('Mysore', None), ('10', 'NUMBER'), ('times', None), ('.', None)]
[('Mysore', 'NNP'), ('is', 'VBZ'), ('an', 'DT'), ('amazing', 'JJ'), ('place', 'NN'), ('on', 'IN'), ('earth', 'NN'), ('.', '.'), ('I', None), ('have', None), ('visited', None), ('Mysore', 'NNP'), ('10', None), ('times', None), ('.', '.')]
- 训练自己的词性标注器
import nltk
import pickle
# 训练集
def sampleData():
return [
'Bangalore is the capital of Karnataka.',
'Steve Jobs was the CEO of Apple.',
'iPhone was Invented by Apple.',
'Books can be purchased in Market.',
]
# 逐句分词,得到词性,将训练集的词和词性放到字典里
def buildDictionary():
dictionary = {}
for sent in sampleData():
partsOfSpeechTags = nltk.pos_tag(nltk.word_tokenize(sent))
for tag in partsOfSpeechTags:
value = tag[0]
pos = tag[1]
dictionary[value] = pos
return dictionary
def saveMyTagger(tagger,fileName):
fileHandle = open(fileName,'wb')
pickle.dump(tagger,fileHandle) # 写入二进制
fileHandle.close()
# 用学习的字典得到tagger
def saveMyTraining(fileName):
tagger = nltk.UnigramTagger(model=buildDictionary())
saveMyTagger(tagger,fileName)
# 读取自己的模型
def loadMyTagger(fileName):
return pickle.load(open(fileName,'rb'))
sentence = 'IPhone is purchased by Steve Jobs in Bangalore Market.'
fileName = 'myTagger.pickle'
saveMyTraining(fileName)
myTagger = loadMyTagger(fileName)
print(myTagger.tag(nltk.word_tokenize(sentence)))
输出:
[('IPhone', None), ('is', 'VBZ'), ('purchased', 'VBN'), ('by', 'IN'), ('Steve', 'NNP'), ('Jobs', 'NNP'), ('in', 'IN'), ('Bangalore', 'NNP'), ('Market', 'NNP'), ('.', '.')]
- 编写自己的文法
上下文无关文法:
1.开始符号/标记
2.终结符号集合
3.非终结符号集合
4.定义开始符号和规则(产生式)
5.语言是英文时,a-z是符号/标记/字母
6.语言是数字时,0-9是符号/标记/字母
产生式是用巴克斯-诺尔(BNF)范式写的
import nltk
import string
from nltk.parse.generate import generate
import sys
# 定义一个起始符号为ROOT的文法
productions = [
'ROOT -> WORD',
'WORD -> \' \'',
'WORD -> NUMBER LETTER',
'WORD -> LETTER NUMBER',
]
# 添加新的生成方式 'NUMBER -> 0|1|2|3'
digits = list(string.digits) # str格式的数字
for digit in digits[:4]:
productions.append('NUMBER -> \'{w}\''.format(w=digit))
# 添加新的生成方式 'LETTER -> a|b|c|d'
letters ="' | '".join(list(string.ascii_lowercase)[:4])
productions.append('LETTER -> \'{w}\''.format(w=letters))
# 将文法分行存于grammarString
grammarString = '\n'.join(productions)
# 创建文法对象,并查看之
grammar = nltk.CFG.fromstring(grammarString)
print(grammar)
# 读取语法树 最多个数:5 最多层数:4
for sentence in generate(grammar,n=5,depth=4):
palindrome = ''.join(sentence).replace(' ','')
print('Generated Word: {} , Size : {}'.format(palindrome,len(palindrome)))
- 输出
Grammar with 12 productions (start state = ROOT)
ROOT -> WORD
WORD -> ' '
WORD -> NUMBER LETTER
WORD -> LETTER NUMBER
NUMBER -> '0'
NUMBER -> '1'
NUMBER -> '2'
NUMBER -> '3'
LETTER -> 'a'
LETTER -> 'b'
LETTER -> 'c'
LETTER -> 'd'
Generated Word: , Size : 0
Generated Word: 0a , Size : 2
Generated Word: 0b , Size : 2
Generated Word: 0c , Size : 2
Generated Word: 0d , Size : 2
- 基于概率的上下文无关文法
所有非终结符号(左侧)的概率之和等于1
| 描述 | 内容 |
|---|---|
| 开始符号 | ROOT |
| 非终结符号 | WORD,P1,P2,P3,P4 |
| 终结符号 | 'A','B','C','D','E','F','G','H' |
import nltk
from nltk.parse.generate import generate
productions = [
"ROOT -> WORD [1.0]",
"WORD -> P1 [0.25]",
"WORD -> P1 P2 [0.25]",
"WORD -> P1 P2 P3 [0.25]",
"WORD -> P1 P2 P3 P4 [0.25]",
"P1 -> 'A' [1.0]",
"P2 -> 'B' [0.5]",
"P2 -> 'C' [0.5]",
"P3 -> 'D' [0.3]",
"P3 -> 'E' [0.3]",
"P3 -> 'F' [0.4]",
"P4 -> 'G' [0.9]",
"P4 -> 'H' [0.1]",
]
grammarString = '\n'.join(productions)
# 创建grammar对象
grammar = nltk.PCFG.fromstring(grammarString)
print(grammar)
for sentence in generate(grammar,n=5,depth=4):
palindrome = ''.join(sentence).replace(' ','')
print('String : {} , Size : {}'.format(palindrome,len(palindrome)))
输出:
Grammar with 13 productions (start state = ROOT)
ROOT -> WORD [1.0]
WORD -> P1 [0.25]
WORD -> P1 P2 [0.25]
WORD -> P1 P2 P3 [0.25]
WORD -> P1 P2 P3 P4 [0.25]
P1 -> 'A' [1.0]
P2 -> 'B' [0.5]
P2 -> 'C' [0.5]
P3 -> 'D' [0.3]
P3 -> 'E' [0.3]
P3 -> 'F' [0.4]
P4 -> 'G' [0.9]
P4 -> 'H' [0.1]
String : A , Size : 1
String : AB , Size : 2
String : AC , Size : 2
String : ABD , Size : 3
String : ABE , Size : 3
- 编写递归的上下文无关文法
以递归方法生成回文为例,回文:比如01语言系统的 010010 等
# 生成偶数回文数字
import nltk
import string
from nltk.parse.generate import generate
productions = [
'ROOT -> WORD',
"WORD -> ' '",
]
alphabets = list(string.digits)
for alphabet in alphabets:
productions.append("WORD -> '{w}' WORD '{w}'".format(w=alphabet))
grammarString = '\n'.join(productions)
grammar = nltk.CFG.fromstring(grammarString)
print(grammar)
for sentence in generate(grammar,n=5,depth=5):
palindrome = ''.join(sentence).replace(' ','')
print('Palindrome : {} , Size : {}'.format(palindrome,len(palindrome)))
输出:
Grammar with 12 productions (start state = ROOT)
ROOT -> WORD
WORD -> ' '
WORD -> '0' WORD '0'
WORD -> '1' WORD '1'
WORD -> '2' WORD '2'
WORD -> '3' WORD '3'
WORD -> '4' WORD '4'
WORD -> '5' WORD '5'
WORD -> '6' WORD '6'
WORD -> '7' WORD '7'
WORD -> '8' WORD '8'
WORD -> '9' WORD '9'
Palindrome : , Size : 0
Palindrome : 00 , Size : 2
Palindrome : 0000 , Size : 4
Palindrome : 0110 , Size : 4
Palindrome : 0220 , Size : 4
NLP(五) 词性标注和文法的更多相关文章
- HanLP使用教程——NLP初体验
话接上篇NLP的学习坑 自然语言处理(NLP)--简介 ,使用HanLP进行分词标注处词性. HanLP使用简介 HanLP是一系列模型与算法组成的NLP工具包,目标是普及自然语言处理在生产环境中的应 ...
- 会话机器人Chatbot的相关资料
Chatbot简介 竹间智能简仁贤:打破千篇一律的聊天机器人 | Chatbot的潮流 重点关注其中关于情感会话机器人的介绍 当你对我不满的时候我应该怎么应对,当你无聊,跟我说你很烦的时候,我应该怎么 ...
- NLP+句法结构(三)︱中文句法结构(CIPS2016、依存句法、文法)
摘录自:CIPS2016 中文信息处理报告<第一章 词法和句法分析研究进展.现状及趋势>P8 -P11 CIPS2016> 中文信息处理报告下载链接:http://cips-uplo ...
- NLP+语篇分析(五)︱中文语篇分析研究现状(CIPS2016)
摘录自:CIPS2016 中文信息处理报告<第三章 语篇分析研究进展.现状及趋势>P21 CIPS2016 中文信息处理报告下载链接:http://cips-upload.bj.bcebo ...
- 【NLP】条件随机场知识扩展延伸(五)
条件随机场知识扩展延伸 作者:白宁超 2016年8月3日19:47:55 [摘要]:条件随机场用于序列标注,数据分割等自然语言处理中,表现出很好的效果.在中文分词.中文人名识别和歧义消解等任务中都有应 ...
- 转:NLP+句法结构(三)︱中文句法结构(CIPS2016、依存句法、文法)
NLP+句法结构(三)︱中文句法结构(CIPS2016.依存句法.文法)转自:https://www.cnblogs.com/maohai/p/6453389.html 摘录自:CIPS2016 中文 ...
- nlp词性标注
nlp词性标注 与分词函数不同,jieba库和pyltp库词性标注函数上形式相差极大. jieba的词性标注函数与分词函数相近,jieba.posseg.cut(sentence,HMM=True)函 ...
- NLP自然语言处理 jieba中文分词,关键词提取,词性标注,并行分词,起止位置,文本挖掘,NLP WordEmbedding的概念和实现
1. NLP 走近自然语言处理 概念 Natural Language Processing/Understanding,自然语言处理/理解 日常对话.办公写作.上网浏览 希望机器能像人一样去理解,以 ...
- NLP(十五)让模型来告诉你文本中的时间
背景介绍 在文章NLP入门(十一)从文本中提取时间 中,笔者演示了如何利用分词.词性标注的方法从文本中获取时间.当时的想法比较简单快捷,只是利用了词性标注这个功能而已,因此,在某些地方,时间的识别 ...
随机推荐
- SSM框架实现原理图(转)
- RobotFramework_4.SeleniumLibrary操作(二)
*:first-child { margin-top: 0 !important; } body>*:last-child { margin-bottom: 0 !important; } /* ...
- $.ajax()在IE9下的兼容性问题
最近在主导一个项目,遇到了一点问题,跟大家分享一下. 最终bug解决方案的链接地址:http://stackoverflow.com/questions/5241088/jquery-call-to- ...
- GDB 基本用法
1.编译文件时需要加上 -g 选项,并非是将源码嵌入可执行文件,只是加入源代码的信息.eg:gcc -g main.c -o main 2.直接按回车键会重复上一条命令 3.基本指令 help,可以查 ...
- 【Intellij】导入 jar 包
选中工具栏上"File"--->"Project Structure"--->选择“Libraries”--->点击“+”--->选择自 ...
- Android Studio项目/Flutter 案例中Gradle报错通用解决方案(包括Unable to tunnel through proxy问题)
目录 Step 1:修改Gradle版本为本地版本 Step 2:修改classpath为Android Studio版本 Step 3:关闭代理 Step 1:修改Gradle版本为本地版本 ...
- python基础之循环与迭代器
循环 python 循环语句有for循环和while循环. while循环while循环语法 while 判断条件: 语句 #while循环示例 i = 0 while i < 10: i += ...
- k8s学习02-----kubeadm部署k8s
机器规划 系统配置 三台机器都执行 1.关闭selinux及firewalld sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux ...
- 前端笔记之微信小程序(四)WebSocket&Socket.io&摇一摇案例&地图|地理位置
一.WebSocket概述 http://www.ruanyifeng.com/blog/2017/05/websocket.html Workerman一款开源高性能异步PHP socket即时通讯 ...
- Java +支付宝 +接入+最全+最佳-实战-demo
一.支付宝配置: 1.需要在支付宝商户平台购买支付的产品并开通支付. 2.购买支付产品登录支付宝:https://auth.alipay.com/login/index.htm 3.登录之后首页点击查 ...