【448】NLP, NER, PoS
目录:
- 停用词 —— stopwords
- 介词 —— prepositions —— part of speech
- Named Entity Recognition (NER) 3.1 Stanford NER
3.2 spaCy
3.3 NLTK - 句子中单词提取(Word extraction)
1. 停用词(stopwords)
ref: Removing stop words with NLTK in Python
ref: Remove Stop Words
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english')) output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
2. 介词(prepositions, part of speech)
ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]
ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
3. Named Entity Recognition (NER)
ref: Introduction to Named Entity Recognition
ref: Named Entity Recognition with NLTK and SpaCy
- Standford NER
- spaCy
- NLTK
3.1 Stanford NER
article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.''' import nltk
from nltk.tag import StanfordNERTagger print('NTLK Version: %s' % nltk.__version__) stanford_ner_tagger = StanfordNERTagger(
r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz",
r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar"
) results = stanford_ner_tagger.tag(article.split()) print('Original Sentence: %s' % (article))
for result in results:
tag_value = result[0]
tag_type = result[1]
if tag_type != 'O':
print('Type: %s, Value: %s' % (tag_type, tag_value)) output:
NTLK Version: 3.4
Original Sentence:
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.
Type: DATE, Value: Tuesday
Type: LOCATION, Value: Europe
Type: ORGANIZATION, Value: Asia-Pacific
Type: LOCATION, Value: Japan
Type: PERCENT, Value: 1.7
Type: PERCENT, Value: percent
Type: ORGANIZATION, Value: Nikkei
Type: PERCENT, Value: 3.1
Type: PERCENT, Value: percent
Type: LOCATION, Value: European
Type: LOCATION, Value: Union
Type: PERSON, Value: Theresa
Type: PERSON, Value: May
3.2 spaCy
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp(article)
for X in doc.ents:
print('Value: %s, Type: %s' % (X.text, X.label_)) output:
Value: Asian, Type: NORP
Value: Tuesday, Type: DATE
Value: Europe, Type: LOC
Value: MSCI’s, Type: ORG
Value: Asia-Pacific, Type: LOC
Value: Japan, Type: GPE
Value: 1.7 percent, Type: PERCENT
Value: 1-1/2, Type: CARDINAL
Value: Australian, Type: NORP
Value: 1.6 percent, Type: PERCENT
Value: Japan, Type: GPE
Value: 3.1 percent, Type: PERCENT
Value: Apple, Type: ORG
Value: 1.286, Type: MONEY
Value: three, Type: CARDINAL
Value: Nov.1, Type: NORP
Value: the
European Union, Type: ORG
Value: Brexit, Type: GPE
Value: British, Type: NORP
Value: Theresa May, Type: PERSON
Value: Monday, Type: DATE
标签含义:https://spacy.io/api/annotation#pos-tagging
| Type | Description |
|---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
3.3 NLTK
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker') def fn_preprocess(art):
art = nltk.word_tokenize(art)
art = nltk.pos_tag(art)
return art
art_processed = fn_preprocess(article)
print(art_processed) output:
[('Asian', 'JJ'), ('shares', 'NNS'), ('skidded', 'VBN'), ('on', 'IN'), ('Tuesday', 'NNP'), ('after', 'IN'), ('a', 'DT'), ('rout', 'NN'), ('in', 'IN'), ('tech', 'JJ'), ('stocks', 'NNS'), ('put', 'VBD'), ('Wall', 'NNP'), ('Street', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('sword', 'NN'), (',', ','), ('while', 'IN'), ('a', 'DT'), ('sharp', 'JJ'), ('drop', 'NN'), ('in', 'IN'), ('oil', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('political', 'JJ'), ('risks', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('pushed', 'VBD'), ('the', 'DT'), ('dollar', 'NN'), ('to', 'TO'), ('16-month', 'JJ'), ('highs', 'NNS'), ('as', 'IN'), ('investors', 'NNS'), ('dumped', 'VBD'), ('riskier', 'JJR'), ('assets', 'NNS'), ('.', '.'), ('MSCI', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('broadest', 'JJS'), ('index', 'NN'), ('of', 'IN'), ('Asia-Pacific', 'NNP'), ('shares', 'NNS'), ('outside', 'IN'), ('Japan', 'NNP'), ('dropped', 'VBD'), ('1.7', 'CD'), ('percent', 'NN'), ('to', 'TO'), ('a', 'DT'), ('1-1/2', 'JJ'), ('week', 'NN'), ('trough', 'NN'), (',', ','), ('with', 'IN'), ('Australian', 'JJ'), ('shares', 'NNS'), ('sinking', 'VBG'), ('1.6', 'CD'), ('percent', 'NN'), ('.', '.'), ('Japan', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('Nikkei', 'NNP'), ('dived', 'VBD'), ('3.1', 'CD'), ('percent', 'NN'), ('led', 'VBN'), ('by', 'IN'), ('losses', 'NNS'), ('in', 'IN'), ('electric', 'JJ'), ('machinery', 'NN'), ('makers', 'NNS'), ('and', 'CC'), ('suppliers', 'NNS'), ('of', 'IN'), ('Apple', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('iphone', 'NN'), ('parts', 'NNS'), ('.', '.'), ('Sterling', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('$', '$'), ('1.286', 'CD'), ('after', 'IN'), ('three', 'CD'), ('straight', 'JJ'), ('sessions', 'NNS'), ('of', 'IN'), ('losses', 'NNS'), ('took', 'VBD'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('lowest', 'JJS'), ('since', 'IN'), ('Nov.1', 'NNP'), ('as', 'IN'), ('there', 'EX'), ('were', 'VBD'), ('still', 'RB'), ('considerable', 'JJ'), ('unresolved', 'JJ'), ('issues', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('European', 'NNP'), ('Union', 'NNP'), ('over', 'IN'), ('Brexit', 'NNP'), (',', ','), ('British', 'NNP'), ('Prime', 'NNP'), ('Minister', 'NNP'), ('Theresa', 'NNP'), ('May', 'NNP'), ('said', 'VBD'), ('on', 'IN'), ('Monday', 'NNP'), ('.', '.')]
4. 句子中单词提取(Word extraction)
ref: An introduction to Bag of Words and how to code it in Python for NLP
import re
def word_extraction(sentence):
ignore = ['a', "the", "is"]
words = re.sub("[^\w]", " ", sentence).split()
cleaned_text = [w.lower() for w in words if w not in ignore]
return cleaned_text a = "alex is. good guy."
print(word_extraction(a)) output:
['alex', 'good', 'guy']
【448】NLP, NER, PoS的更多相关文章
- 【数据处理】各门店POS销售导入
--抓取西部POS数据DELETE FROM POSLSBF INSERT INTO POSLSBFselect * from [192.168.1.100].[SCMIS].DBO.possrlbf ...
- 论文笔记【一】Chinese NER Using Lattice LSTM
论文:Chinese NER Using Lattice LSTM 论文链接:https://arxiv.org/abs/1805.02023 论文作者:Yue Zhang∗and Jie Yang∗ ...
- 【LDA】nlp
http://pythonhosted.org/lda/getting_started.html http://radimrehurek.com/gensim/
- 448. Find All Numbers Disappeared in an Array【easy】
448. Find All Numbers Disappeared in an Array[easy] Given an array of integers where 1 ≤ a[i] ≤ n (n ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
- 【Nodejs】理想论坛帖子爬虫1.01
用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...
- 【BZOJ-1146】网络管理Network DFS序 + 带修主席树
1146: [CTSC2008]网络管理Network Time Limit: 50 Sec Memory Limit: 162 MBSubmit: 3495 Solved: 1032[Submi ...
- 通用js函数集锦<来源于网络> 【二】
通用js函数集锦<来源于网络> [二] 1.数组方法集2.cookie方法集3.url方法集4.正则表达式方法集5.字符串方法集6.加密方法集7.日期方法集8.浏览器检测方法集9.json ...
- 【BZOJ3940】【BZOJ3942】[Usaco2015 Feb]Censoring AC自动机/KMP/hash+栈
[BZOJ3942][Usaco2015 Feb]Censoring Description Farmer John has purchased a subscription to Good Hoov ...
随机推荐
- jquery对象转成dom对象
jQuery库本质上还是JavaScript代码,它只是对JavaScript语言进行包装处理,为的是提供更好更方便快捷的DOM处理与开发中经常使用的功能.我们使用jQuery的同时也能混合Java ...
- Kotlin协程第一个示例剖析及Kotlin线程使用技巧
Kotlin协程第一个示例剖析: 上一次https://www.cnblogs.com/webor2006/p/11712521.html已经对Kotlin中的协程有了理论化的了解了,这次则用代码来直 ...
- Kotlin属性引用详解
继续来学习Kotlin反射相关的,这次主要是跟反射属性相关的东东. 属性引用(Property Reference): 属性引用的用法与函数(方法)引用的用法是完全一致,都是通过::形式来引用的.下面 ...
- python的gui库tkinter
导入tkinter模块 import tkinter as tk 设置窗口名字和大小 frame=tk.Tk() frame.title('数学') frame.geometry('200x440') ...
- HttpClient代码设置代理
由于对接faceBook接口,本地测试时候要设置代理才能调试. (http和https通用) public SSLContext createIgnoreVerifySSL() throws NoSu ...
- idea去除mybatis的xml那个恶心的绿色背景
https://my.oschina.net/qiudaozhang/blog/2877536
- c语言的可变参数实例
可变参数函数实现的步骤如下: 1.在函数中创建一个va_list类型变量 2.使用va_start对其进行初始化 3.使用va_arg访问参数值 4.使用va_end完成清理工作 接下来我们来实现一个 ...
- sort()函数中的key
d = { , , } #for k in d.items(): # print(k) content = list(d.items()) print(content) content.sort(ke ...
- 五.python小数据池,代码块的最详细、深入剖析
一,id,is,== 在Python中,id是什么?id是内存地址,那就有人问了,什么是内存地址呢? 你只要创建一个数据(对象)那么都会在内存中开辟一个空间,将这个数据临时加在到内存中,那么这个空间是 ...
- [Javascript] Check Promise is Promise
const isPromise = obj => Boolean(obj) && typeof obj.then === 'function'; This can be a to ...