【448】NLP, NER, PoS

停用词 —— stopwords
介词 —— prepositions —— part of speech
Named Entity Recognition (NER)　　3.1 Stanford NER
　　3.2 spaCy
　　3.3 NLTK
句子中单词提取（Word extraction）

1. 停用词（stopwords）

ref: Removing stop words with NLTK in Python

import nltk

# nltk.download('stopwords')

from nltk.corpus import stopwords

print(stopwords.words('english'))

output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

2. 介词（prepositions, part of speech）

ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]

ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:

>>> import nltk

>>> sentence = """At eight o'clock on Thursday morning

... Arthur didn't feel very good."""

>>> tokens = nltk.word_tokenize(sentence)

>>> tokens

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',

'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

>>> tagged = nltk.pos_tag(tokens)

>>> tagged[0:6]

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),

('Thursday', 'NNP'), ('morning', 'NN')]

3. Named Entity Recognition (NER)

ref: Introduction to Named Entity Recognition

ref: Named Entity Recognition with NLTK and SpaCy

Standford NER
spaCy
NLTK

3.1 Stanford NER

article = '''

Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a

sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped

riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2

week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in

electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight

sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the

European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

import nltk

from nltk.tag import StanfordNERTagger

print('NTLK Version: %s' % nltk.__version__)

stanford_ner_tagger = StanfordNERTagger(

    r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz",

	r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar"

)

results = stanford_ner_tagger.tag(article.split())

print('Original Sentence: %s' % (article))

for result in results:

    tag_value = result[0]

    tag_type = result[1]

    if tag_type != 'O':

        print('Type: %s, Value: %s' % (tag_type, tag_value))

output:

NTLK Version: 3.4

Original Sentence:

Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a

sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped

riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2

week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in

electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight

sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the

European Union over Brexit, British Prime Minister Theresa May said on Monday.

Type: DATE, Value: Tuesday

Type: LOCATION, Value: Europe

Type: ORGANIZATION, Value: Asia-Pacific

Type: LOCATION, Value: Japan

Type: PERCENT, Value: 1.7

Type: PERCENT, Value: percent

Type: ORGANIZATION, Value: Nikkei

Type: PERCENT, Value: 3.1

Type: PERCENT, Value: percent

Type: LOCATION, Value: European

Type: LOCATION, Value: Union

Type: PERSON, Value: Theresa

Type: PERSON, Value: May

3.2 spaCy

import spacy

from spacy import displacy

from collections import Counter

import en_core_web_sm

nlp = en_core_web_sm.load()

doc = nlp(article)

for X in doc.ents:

	print('Value: %s, Type: %s' % (X.text, X.label_))

output:

Value: Asian, Type: NORP

Value: Tuesday, Type: DATE

Value: Europe, Type: LOC

Value: MSCI’s, Type: ORG

Value: Asia-Pacific, Type: LOC

Value: Japan, Type: GPE

Value: 1.7 percent, Type: PERCENT

Value: 1-1/2, Type: CARDINAL

Value: Australian, Type: NORP

Value: 1.6 percent, Type: PERCENT

Value: Japan, Type: GPE

Value: 3.1 percent, Type: PERCENT

Value: Apple, Type: ORG

Value: 1.286, Type: MONEY

Value: three, Type: CARDINAL

Value: Nov.1, Type: NORP

Value: the

European Union, Type: ORG

Value: Brexit, Type: GPE

Value: British, Type: NORP

Value: Theresa May, Type: PERSON

Value: Monday, Type: DATE

标签含义：https://spacy.io/api/annotation#pos-tagging

Type	Description
`PERSON`	People, including fictional.
`NORP`	Nationalities or religious or political groups.
`FAC`	Buildings, airports, highways, bridges, etc.
`ORG`	Companies, agencies, institutions, etc.
`GPE`	Countries, cities, states.
`LOC`	Non-GPE locations, mountain ranges, bodies of water.
`PRODUCT`	Objects, vehicles, foods, etc. (Not services.)
`EVENT`	Named hurricanes, battles, wars, sports events, etc.
`WORK_OF_ART`	Titles of books, songs, etc.
`LAW`	Named documents made into laws.
`LANGUAGE`	Any named language.
`DATE`	Absolute or relative dates or periods.
`TIME`	Times smaller than a day.
`PERCENT`	Percentage, including ”%“.
`MONEY`	Monetary values, including unit.
`QUANTITY`	Measurements, as of weight or distance.
`ORDINAL`	“first”, “second”, etc.
`CARDINAL`	Numerals that do not fall under another type.

3.3 NLTK

import nltk

from nltk import word_tokenize, pos_tag, ne_chunk

nltk.download('words')

nltk.download('averaged_perceptron_tagger')

nltk.download('punkt')

nltk.download('maxent_ne_chunker')

def fn_preprocess(art):

    art = nltk.word_tokenize(art)

    art = nltk.pos_tag(art)

    return art

art_processed = fn_preprocess(article)

print(art_processed)

output:

[('Asian', 'JJ'), ('shares', 'NNS'), ('skidded', 'VBN'), ('on', 'IN'), ('Tuesday', 'NNP'), ('after', 'IN'), ('a', 'DT'), ('rout', 'NN'), ('in', 'IN'), ('tech', 'JJ'), ('stocks', 'NNS'), ('put', 'VBD'), ('Wall', 'NNP'), ('Street', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('sword', 'NN'), (',', ','), ('while', 'IN'), ('a', 'DT'), ('sharp', 'JJ'), ('drop', 'NN'), ('in', 'IN'), ('oil', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('political', 'JJ'), ('risks', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('pushed', 'VBD'), ('the', 'DT'), ('dollar', 'NN'), ('to', 'TO'), ('16-month', 'JJ'), ('highs', 'NNS'), ('as', 'IN'), ('investors', 'NNS'), ('dumped', 'VBD'), ('riskier', 'JJR'), ('assets', 'NNS'), ('.', '.'), ('MSCI', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('broadest', 'JJS'), ('index', 'NN'), ('of', 'IN'), ('Asia-Pacific', 'NNP'), ('shares', 'NNS'), ('outside', 'IN'), ('Japan', 'NNP'), ('dropped', 'VBD'), ('1.7', 'CD'), ('percent', 'NN'), ('to', 'TO'), ('a', 'DT'), ('1-1/2', 'JJ'), ('week', 'NN'), ('trough', 'NN'), (',', ','), ('with', 'IN'), ('Australian', 'JJ'), ('shares', 'NNS'), ('sinking', 'VBG'), ('1.6', 'CD'), ('percent', 'NN'), ('.', '.'), ('Japan', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('Nikkei', 'NNP'), ('dived', 'VBD'), ('3.1', 'CD'), ('percent', 'NN'), ('led', 'VBN'), ('by', 'IN'), ('losses', 'NNS'), ('in', 'IN'), ('electric', 'JJ'), ('machinery', 'NN'), ('makers', 'NNS'), ('and', 'CC'), ('suppliers', 'NNS'), ('of', 'IN'), ('Apple', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('iphone', 'NN'), ('parts', 'NNS'), ('.', '.'), ('Sterling', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('$', '$'), ('1.286', 'CD'), ('after', 'IN'), ('three', 'CD'), ('straight', 'JJ'), ('sessions', 'NNS'), ('of', 'IN'), ('losses', 'NNS'), ('took', 'VBD'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('lowest', 'JJS'), ('since', 'IN'), ('Nov.1', 'NNP'), ('as', 'IN'), ('there', 'EX'), ('were', 'VBD'), ('still', 'RB'), ('considerable', 'JJ'), ('unresolved', 'JJ'), ('issues', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('European', 'NNP'), ('Union', 'NNP'), ('over', 'IN'), ('Brexit', 'NNP'), (',', ','), ('British', 'NNP'), ('Prime', 'NNP'), ('Minister', 'NNP'), ('Theresa', 'NNP'), ('May', 'NNP'), ('said', 'VBD'), ('on', 'IN'), ('Monday', 'NNP'), ('.', '.')]

4. 句子中单词提取（Word extraction）

ref: An introduction to Bag of Words and how to code it in Python for NLP

import re

def word_extraction(sentence):

	ignore = ['a', "the", "is"]

	words = re.sub("[^\w]", " ",  sentence).split()

	cleaned_text = [w.lower() for w in words if w not in ignore]

	return cleaned_text

a = "alex is. good guy."

print(word_extraction(a))

output:

['alex', 'good', 'guy']

【448】NLP, NER, PoS的更多相关文章

【数据处理】各门店POS销售导入
--抓取西部POS数据DELETE FROM POSLSBF INSERT INTO POSLSBFselect * from [192.168.1.100].[SCMIS].DBO.possrlbf ...
论文笔记【一】Chinese NER Using Lattice LSTM
论文:Chinese NER Using Lattice LSTM 论文链接:https://arxiv.org/abs/1805.02023 论文作者:Yue Zhang∗and Jie Yang∗ ...
【LDA】nlp
http://pythonhosted.org/lda/getting_started.html http://radimrehurek.com/gensim/
448. Find All Numbers Disappeared in an Array【easy】
448. Find All Numbers Disappeared in an Array[easy] Given an array of integers where 1 ≤ a[i] ≤ n (n ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
【Nodejs】理想论坛帖子爬虫1.01
用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...
【BZOJ-1146】网络管理Network DFS序 + 带修主席树
1146: [CTSC2008]网络管理Network Time Limit: 50 Sec Memory Limit: 162 MBSubmit: 3495 Solved: 1032[Submi ...
通用js函数集锦<来源于网络> 【二】
通用js函数集锦<来源于网络> [二] 1.数组方法集2.cookie方法集3.url方法集4.正则表达式方法集5.字符串方法集6.加密方法集7.日期方法集8.浏览器检测方法集9.json ...
【BZOJ3940】【BZOJ3942】[Usaco2015 Feb]Censoring AC自动机/KMP/hash+栈
[BZOJ3942][Usaco2015 Feb]Censoring Description Farmer John has purchased a subscription to Good Hoov ...

随机推荐

GCC编译流程浅析
GCC-GCC编译流程浅析序言对于大多数程序员而言,大家都知道gcc是什么,但是如果不接触到linux平台下的开发,鲜有人真正了解gcc的编译流程,因为windows+IDE的开发模式简直是一条龙 ...
Python_类的私有属性、私有方法
1.私有属性:只需要在初始化时,在属性名前加__ class Cup: #构造函数,初始化属性值 def __init__(self,capacity,color): #私有属性,只需要在属性名字前加 ...
个人第五次作业-alpha2测试
课程属于课程课程链接作业要求作业要求链接团队名称你的代码我的发 https://www.cnblogs.com/skrchou/p/11885706.html 测试人名称颜依婷测试人学号 ...
SpringBoot项目下的mvnw与mvnw.cmd
Maven是一个常用的构建工具,但是Maven的版本和插件的配合并不是那么完美,有时候你不得不切换到一个稍微旧一些的版本,以保证所有东西正常工作. 而Gradle提供了一个Wrapper,可以很好解决 ...
What is the syntax for a for loop in TSQL?
loop 报错英 [luːp] 美 [lup] 口语练习 vi. 打环:翻筋斗 n. 环:圈:弯曲部分:翻筋斗 vt. 使成环:以环连结:使翻筋斗 syntax 报错英 ['sɪntæks ...
Oracle EXPDP导出数据
Oracle expdp导出表数据(带条件): expdp student/123456@orcl dumpfile=student_1.dmp logfile=student_1.log table ...
Basic concepts of docker/kubernete/kata-container
Kubereters An open-source system for automating deployment, scaling, and management of containerized ...
Spark-源码分析03-SubmitTask
1.Rdd rdd中 reduce.fold.aggregate.collect.count这些方法都会调用 sparkContext.runJob ,这些方法称之为Action 触发提交Job d ...
learning svn diff --summarize
# svn diff --summarizeA armbian-custom-dc/test/4g-power.shA armbian-custom-dc/test/4g-reset.shM armb ...
windbg在加载模块时下断点
假设我们希望在加载特定的dll时中断调试器,例如,我想启用一些SOS命令,而clr还没有加载,当您遇到程序中过早发生的异常,并且您不能依赖手动尝试在正确的时间中断时,这尤其有用.例如,在将调试器附加到 ...

【448】NLP, NER, PoS

【448】NLP, NER, PoS的更多相关文章

随机推荐

热门专题