【448】NLP, NER, PoS

停用词 —— stopwords
介词 —— prepositions —— part of speech
Named Entity Recognition (NER)　　3.1 Stanford NER
　　3.2 spaCy
　　3.3 NLTK
句子中单词提取（Word extraction）

1. 停用词（stopwords）

ref: Removing stop words with NLTK in Python

import nltk

# nltk.download('stopwords')

from nltk.corpus import stopwords

print(stopwords.words('english'))

output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

2. 介词（prepositions, part of speech）

ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]

ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:

>>> import nltk

>>> sentence = """At eight o'clock on Thursday morning

... Arthur didn't feel very good."""

>>> tokens = nltk.word_tokenize(sentence)

>>> tokens

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',

'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

>>> tagged = nltk.pos_tag(tokens)

>>> tagged[0:6]

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),

('Thursday', 'NNP'), ('morning', 'NN')]

3. Named Entity Recognition (NER)

ref: Introduction to Named Entity Recognition

ref: Named Entity Recognition with NLTK and SpaCy

Standford NER
spaCy
NLTK

3.1 Stanford NER

article = '''

Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a

sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped

riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2

week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in

electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight

sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the

European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

import nltk

from nltk.tag import StanfordNERTagger

print('NTLK Version: %s' % nltk.__version__)

stanford_ner_tagger = StanfordNERTagger(

    r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz",

	r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar"

)

results = stanford_ner_tagger.tag(article.split())

print('Original Sentence: %s' % (article))

for result in results:

    tag_value = result[0]

    tag_type = result[1]

    if tag_type != 'O':

        print('Type: %s, Value: %s' % (tag_type, tag_value))

output:

NTLK Version: 3.4

Original Sentence:

Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a

sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped

riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2

week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in

electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight

sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the

European Union over Brexit, British Prime Minister Theresa May said on Monday.

Type: DATE, Value: Tuesday

Type: LOCATION, Value: Europe

Type: ORGANIZATION, Value: Asia-Pacific

Type: LOCATION, Value: Japan

Type: PERCENT, Value: 1.7

Type: PERCENT, Value: percent

Type: ORGANIZATION, Value: Nikkei

Type: PERCENT, Value: 3.1

Type: PERCENT, Value: percent

Type: LOCATION, Value: European

Type: LOCATION, Value: Union

Type: PERSON, Value: Theresa

Type: PERSON, Value: May

3.2 spaCy

import spacy

from spacy import displacy

from collections import Counter

import en_core_web_sm

nlp = en_core_web_sm.load()

doc = nlp(article)

for X in doc.ents:

	print('Value: %s, Type: %s' % (X.text, X.label_))

output:

Value: Asian, Type: NORP

Value: Tuesday, Type: DATE

Value: Europe, Type: LOC

Value: MSCI’s, Type: ORG

Value: Asia-Pacific, Type: LOC

Value: Japan, Type: GPE

Value: 1.7 percent, Type: PERCENT

Value: 1-1/2, Type: CARDINAL

Value: Australian, Type: NORP

Value: 1.6 percent, Type: PERCENT

Value: Japan, Type: GPE

Value: 3.1 percent, Type: PERCENT

Value: Apple, Type: ORG

Value: 1.286, Type: MONEY

Value: three, Type: CARDINAL

Value: Nov.1, Type: NORP

Value: the

European Union, Type: ORG

Value: Brexit, Type: GPE

Value: British, Type: NORP

Value: Theresa May, Type: PERSON

Value: Monday, Type: DATE

标签含义：https://spacy.io/api/annotation#pos-tagging

Type	Description
`PERSON`	People, including fictional.
`NORP`	Nationalities or religious or political groups.
`FAC`	Buildings, airports, highways, bridges, etc.
`ORG`	Companies, agencies, institutions, etc.
`GPE`	Countries, cities, states.
`LOC`	Non-GPE locations, mountain ranges, bodies of water.
`PRODUCT`	Objects, vehicles, foods, etc. (Not services.)
`EVENT`	Named hurricanes, battles, wars, sports events, etc.
`WORK_OF_ART`	Titles of books, songs, etc.
`LAW`	Named documents made into laws.
`LANGUAGE`	Any named language.
`DATE`	Absolute or relative dates or periods.
`TIME`	Times smaller than a day.
`PERCENT`	Percentage, including ”%“.
`MONEY`	Monetary values, including unit.
`QUANTITY`	Measurements, as of weight or distance.
`ORDINAL`	“first”, “second”, etc.
`CARDINAL`	Numerals that do not fall under another type.

3.3 NLTK

import nltk

from nltk import word_tokenize, pos_tag, ne_chunk

nltk.download('words')

nltk.download('averaged_perceptron_tagger')

nltk.download('punkt')

nltk.download('maxent_ne_chunker')

def fn_preprocess(art):

    art = nltk.word_tokenize(art)

    art = nltk.pos_tag(art)

    return art

art_processed = fn_preprocess(article)

print(art_processed)

output:

[('Asian', 'JJ'), ('shares', 'NNS'), ('skidded', 'VBN'), ('on', 'IN'), ('Tuesday', 'NNP'), ('after', 'IN'), ('a', 'DT'), ('rout', 'NN'), ('in', 'IN'), ('tech', 'JJ'), ('stocks', 'NNS'), ('put', 'VBD'), ('Wall', 'NNP'), ('Street', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('sword', 'NN'), (',', ','), ('while', 'IN'), ('a', 'DT'), ('sharp', 'JJ'), ('drop', 'NN'), ('in', 'IN'), ('oil', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('political', 'JJ'), ('risks', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('pushed', 'VBD'), ('the', 'DT'), ('dollar', 'NN'), ('to', 'TO'), ('16-month', 'JJ'), ('highs', 'NNS'), ('as', 'IN'), ('investors', 'NNS'), ('dumped', 'VBD'), ('riskier', 'JJR'), ('assets', 'NNS'), ('.', '.'), ('MSCI', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('broadest', 'JJS'), ('index', 'NN'), ('of', 'IN'), ('Asia-Pacific', 'NNP'), ('shares', 'NNS'), ('outside', 'IN'), ('Japan', 'NNP'), ('dropped', 'VBD'), ('1.7', 'CD'), ('percent', 'NN'), ('to', 'TO'), ('a', 'DT'), ('1-1/2', 'JJ'), ('week', 'NN'), ('trough', 'NN'), (',', ','), ('with', 'IN'), ('Australian', 'JJ'), ('shares', 'NNS'), ('sinking', 'VBG'), ('1.6', 'CD'), ('percent', 'NN'), ('.', '.'), ('Japan', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('Nikkei', 'NNP'), ('dived', 'VBD'), ('3.1', 'CD'), ('percent', 'NN'), ('led', 'VBN'), ('by', 'IN'), ('losses', 'NNS'), ('in', 'IN'), ('electric', 'JJ'), ('machinery', 'NN'), ('makers', 'NNS'), ('and', 'CC'), ('suppliers', 'NNS'), ('of', 'IN'), ('Apple', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('iphone', 'NN'), ('parts', 'NNS'), ('.', '.'), ('Sterling', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('$', '$'), ('1.286', 'CD'), ('after', 'IN'), ('three', 'CD'), ('straight', 'JJ'), ('sessions', 'NNS'), ('of', 'IN'), ('losses', 'NNS'), ('took', 'VBD'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('lowest', 'JJS'), ('since', 'IN'), ('Nov.1', 'NNP'), ('as', 'IN'), ('there', 'EX'), ('were', 'VBD'), ('still', 'RB'), ('considerable', 'JJ'), ('unresolved', 'JJ'), ('issues', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('European', 'NNP'), ('Union', 'NNP'), ('over', 'IN'), ('Brexit', 'NNP'), (',', ','), ('British', 'NNP'), ('Prime', 'NNP'), ('Minister', 'NNP'), ('Theresa', 'NNP'), ('May', 'NNP'), ('said', 'VBD'), ('on', 'IN'), ('Monday', 'NNP'), ('.', '.')]

4. 句子中单词提取（Word extraction）

ref: An introduction to Bag of Words and how to code it in Python for NLP

import re

def word_extraction(sentence):

	ignore = ['a', "the", "is"]

	words = re.sub("[^\w]", " ",  sentence).split()

	cleaned_text = [w.lower() for w in words if w not in ignore]

	return cleaned_text

a = "alex is. good guy."

print(word_extraction(a))

output:

['alex', 'good', 'guy']

【448】NLP, NER, PoS的更多相关文章

【数据处理】各门店POS销售导入
--抓取西部POS数据DELETE FROM POSLSBF INSERT INTO POSLSBFselect * from [192.168.1.100].[SCMIS].DBO.possrlbf ...
论文笔记【一】Chinese NER Using Lattice LSTM
论文:Chinese NER Using Lattice LSTM 论文链接:https://arxiv.org/abs/1805.02023 论文作者:Yue Zhang∗and Jie Yang∗ ...
【LDA】nlp
http://pythonhosted.org/lda/getting_started.html http://radimrehurek.com/gensim/
448. Find All Numbers Disappeared in an Array【easy】
448. Find All Numbers Disappeared in an Array[easy] Given an array of integers where 1 ≤ a[i] ≤ n (n ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
【Nodejs】理想论坛帖子爬虫1.01
用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...
【BZOJ-1146】网络管理Network DFS序 + 带修主席树
1146: [CTSC2008]网络管理Network Time Limit: 50 Sec Memory Limit: 162 MBSubmit: 3495 Solved: 1032[Submi ...
通用js函数集锦<来源于网络> 【二】
通用js函数集锦<来源于网络> [二] 1.数组方法集2.cookie方法集3.url方法集4.正则表达式方法集5.字符串方法集6.加密方法集7.日期方法集8.浏览器检测方法集9.json ...
【BZOJ3940】【BZOJ3942】[Usaco2015 Feb]Censoring AC自动机/KMP/hash+栈
[BZOJ3942][Usaco2015 Feb]Censoring Description Farmer John has purchased a subscription to Good Hoov ...

随机推荐

centos7和centos6安装httpd
编译安装httpd http://apr.apache.org/download.cgi 下载 apr-util-1.6.1.tar.bz2 apr-1.6.5.tar.bz2 http://http ...
使用kubeadm搭建高可用k8s v1.16.3集群
目录 1.部署环境说明 2.集群架构及部署准备工作 2.1.集群架构说明 2.2.修改hosts及hostname 2.3.其他准备 3.部署keepalived 3.1.安装 3.2.配置 3.3. ...
Pollard-rho的质因数分解
思路:见参考文章(原理我是写不粗来了) 代码: 用到了快速幂,米勒罗宾素性检验. #include <iostream> #include <time.h> #include ...
docker学习2-快速搭建centos7-python3.6环境
前言当我们在一台电脑上搭建了python3.6的环境,下次换了个电脑,或者换成linux的系统了,又得重新搭建一次,设置环境变量.下载pip等操作. 好不容易安装好,一会Scrips目录找不到pip ...
app开发-3
一.Audio 模块实现开启手机摄像头基于html5 plus http://www.html5plus.org/doc/zh_cn/audio.html 栗子: 自定义: scanQR.HTM ...
与你一起学习MS Project——基础篇：Project基础应用
为了更清晰容易地熟悉掌握Project的基础应用,我们在基础篇中一起来学习掌握在Project中如何做进度计划.资源计划.成本计划以及跟踪项目的执行情况并生成所需的项目报表. 一.进度计划这里,首先 ...
application内置对象
application 实现用户间的数据共享,可存放全局变量 setAttribute() getAttribute() getServerInfo(); //获取引擎名和版本号,如:Apache T ...
[Javascript] Window.matchMedia()
window.matchMedia() allow to listen to browser window size changes and trigger the callback for diff ...
LeetCode 721. Accounts Merge
原题链接在这里:https://leetcode.com/problems/accounts-merge/ 题目: Given a list accounts, each element accoun ...
Windows 2008R2 安装PostgreSQL 11.6
前些天在CentOS 7.5 下安装了PostgreSQL 11.6.除了在无外网环境下需要另外配置之外,其他没有什么差别.今天主要写一下在Windows下面安装PostgreSQL的问题. 在官网看 ...

【448】NLP, NER, PoS

【448】NLP, NER, PoS的更多相关文章

随机推荐

热门专题