目录:

  1. 停用词 —— stopwords
  2. 介词 —— prepositions —— part of speech
  3. Named Entity Recognition (NER)  3.1 Stanford NER
      3.2 spaCy
      3.3 NLTK
  4. 句子中单词提取(Word extraction)

1. 停用词(stopwords)

ref: Removing stop words with NLTK in Python

ref: Remove Stop Words

import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english')) output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

2. 介词(prepositions, part of speech)

ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]

ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:

>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]

3. Named Entity Recognition (NER)

ref: Introduction to Named Entity Recognition

ref: Named Entity Recognition with NLTK and SpaCy

  • Standford NER
  • spaCy
  • NLTK

3.1 Stanford NER

article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.''' import nltk
from nltk.tag import StanfordNERTagger print('NTLK Version: %s' % nltk.__version__) stanford_ner_tagger = StanfordNERTagger(
r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz",
r"D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar"
) results = stanford_ner_tagger.tag(article.split()) print('Original Sentence: %s' % (article))
for result in results:
tag_value = result[0]
tag_type = result[1]
if tag_type != 'O':
print('Type: %s, Value: %s' % (tag_type, tag_value)) output:
NTLK Version: 3.4
Original Sentence:
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.
Type: DATE, Value: Tuesday
Type: LOCATION, Value: Europe
Type: ORGANIZATION, Value: Asia-Pacific
Type: LOCATION, Value: Japan
Type: PERCENT, Value: 1.7
Type: PERCENT, Value: percent
Type: ORGANIZATION, Value: Nikkei
Type: PERCENT, Value: 3.1
Type: PERCENT, Value: percent
Type: LOCATION, Value: European
Type: LOCATION, Value: Union
Type: PERSON, Value: Theresa
Type: PERSON, Value: May

3.2 spaCy

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp(article)
for X in doc.ents:
print('Value: %s, Type: %s' % (X.text, X.label_)) output:
Value: Asian, Type: NORP
Value: Tuesday, Type: DATE
Value: Europe, Type: LOC
Value: MSCI’s, Type: ORG
Value: Asia-Pacific, Type: LOC
Value: Japan, Type: GPE
Value: 1.7 percent, Type: PERCENT
Value: 1-1/2, Type: CARDINAL
Value: Australian, Type: NORP
Value: 1.6 percent, Type: PERCENT
Value: Japan, Type: GPE
Value: 3.1 percent, Type: PERCENT
Value: Apple, Type: ORG
Value: 1.286, Type: MONEY
Value: three, Type: CARDINAL
Value: Nov.1, Type: NORP
Value: the
European Union, Type: ORG
Value: Brexit, Type: GPE
Value: British, Type: NORP
Value: Theresa May, Type: PERSON
Value: Monday, Type: DATE

标签含义:https://spacy.io/api/annotation#pos-tagging

Type Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including ”%“.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.

3.3 NLTK

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker') def fn_preprocess(art):
art = nltk.word_tokenize(art)
art = nltk.pos_tag(art)
return art
art_processed = fn_preprocess(article)
print(art_processed) output:
[('Asian', 'JJ'), ('shares', 'NNS'), ('skidded', 'VBN'), ('on', 'IN'), ('Tuesday', 'NNP'), ('after', 'IN'), ('a', 'DT'), ('rout', 'NN'), ('in', 'IN'), ('tech', 'JJ'), ('stocks', 'NNS'), ('put', 'VBD'), ('Wall', 'NNP'), ('Street', 'NNP'), ('to', 'TO'), ('the', 'DT'), ('sword', 'NN'), (',', ','), ('while', 'IN'), ('a', 'DT'), ('sharp', 'JJ'), ('drop', 'NN'), ('in', 'IN'), ('oil', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('political', 'JJ'), ('risks', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('pushed', 'VBD'), ('the', 'DT'), ('dollar', 'NN'), ('to', 'TO'), ('16-month', 'JJ'), ('highs', 'NNS'), ('as', 'IN'), ('investors', 'NNS'), ('dumped', 'VBD'), ('riskier', 'JJR'), ('assets', 'NNS'), ('.', '.'), ('MSCI', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('broadest', 'JJS'), ('index', 'NN'), ('of', 'IN'), ('Asia-Pacific', 'NNP'), ('shares', 'NNS'), ('outside', 'IN'), ('Japan', 'NNP'), ('dropped', 'VBD'), ('1.7', 'CD'), ('percent', 'NN'), ('to', 'TO'), ('a', 'DT'), ('1-1/2', 'JJ'), ('week', 'NN'), ('trough', 'NN'), (',', ','), ('with', 'IN'), ('Australian', 'JJ'), ('shares', 'NNS'), ('sinking', 'VBG'), ('1.6', 'CD'), ('percent', 'NN'), ('.', '.'), ('Japan', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('Nikkei', 'NNP'), ('dived', 'VBD'), ('3.1', 'CD'), ('percent', 'NN'), ('led', 'VBN'), ('by', 'IN'), ('losses', 'NNS'), ('in', 'IN'), ('electric', 'JJ'), ('machinery', 'NN'), ('makers', 'NNS'), ('and', 'CC'), ('suppliers', 'NNS'), ('of', 'IN'), ('Apple', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('iphone', 'NN'), ('parts', 'NNS'), ('.', '.'), ('Sterling', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('$', '$'), ('1.286', 'CD'), ('after', 'IN'), ('three', 'CD'), ('straight', 'JJ'), ('sessions', 'NNS'), ('of', 'IN'), ('losses', 'NNS'), ('took', 'VBD'), ('it', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('lowest', 'JJS'), ('since', 'IN'), ('Nov.1', 'NNP'), ('as', 'IN'), ('there', 'EX'), ('were', 'VBD'), ('still', 'RB'), ('considerable', 'JJ'), ('unresolved', 'JJ'), ('issues', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('European', 'NNP'), ('Union', 'NNP'), ('over', 'IN'), ('Brexit', 'NNP'), (',', ','), ('British', 'NNP'), ('Prime', 'NNP'), ('Minister', 'NNP'), ('Theresa', 'NNP'), ('May', 'NNP'), ('said', 'VBD'), ('on', 'IN'), ('Monday', 'NNP'), ('.', '.')]

  

4. 句子中单词提取(Word extraction)

ref: An introduction to Bag of Words and how to code it in Python for NLP

import re
def word_extraction(sentence):
ignore = ['a', "the", "is"]
words = re.sub("[^\w]", " ", sentence).split()
cleaned_text = [w.lower() for w in words if w not in ignore]
return cleaned_text a = "alex is. good guy."
print(word_extraction(a)) output:
['alex', 'good', 'guy']

【448】NLP, NER, PoS的更多相关文章

  1. 【数据处理】各门店POS销售导入

    --抓取西部POS数据DELETE FROM POSLSBF INSERT INTO POSLSBFselect * from [192.168.1.100].[SCMIS].DBO.possrlbf ...

  2. 论文笔记【一】Chinese NER Using Lattice LSTM

    论文:Chinese NER Using Lattice LSTM 论文链接:https://arxiv.org/abs/1805.02023 论文作者:Yue Zhang∗and Jie Yang∗ ...

  3. 【LDA】nlp

    http://pythonhosted.org/lda/getting_started.html http://radimrehurek.com/gensim/

  4. 448. Find All Numbers Disappeared in an Array【easy】

    448. Find All Numbers Disappeared in an Array[easy] Given an array of integers where 1 ≤ a[i] ≤ n (n ...

  5. 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】

    转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...

  6. 【Nodejs】理想论坛帖子爬虫1.01

    用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...

  7. 【BZOJ-1146】网络管理Network DFS序 + 带修主席树

    1146: [CTSC2008]网络管理Network Time Limit: 50 Sec  Memory Limit: 162 MBSubmit: 3495  Solved: 1032[Submi ...

  8. 通用js函数集锦<来源于网络> 【二】

    通用js函数集锦<来源于网络> [二] 1.数组方法集2.cookie方法集3.url方法集4.正则表达式方法集5.字符串方法集6.加密方法集7.日期方法集8.浏览器检测方法集9.json ...

  9. 【BZOJ3940】【BZOJ3942】[Usaco2015 Feb]Censoring AC自动机/KMP/hash+栈

    [BZOJ3942][Usaco2015 Feb]Censoring Description Farmer John has purchased a subscription to Good Hoov ...

随机推荐

  1. Linux操作系统启动故障排错之/boot目录被删除恢复案例

    Linux操作系统启动故障排错之"/boot"目录被删除恢复案例 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.模拟删除/boot分区 1>.查看/b ...

  2. Linux操作系统的进程管理

    Linux操作系统的进程管理 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.进程相关概念 1>.进程概述 内核的功用: 进程管理.文件系统.网络功能.内存管理.驱动程序. ...

  3. django项目登录中使用图片验证码

    应用下创建untils文件夹放置封装图片验证码的函数 创建validCode.py文件定义验证码规则 import random def get_random_color(): return (ran ...

  4. Kotlin协程重要概念详解【纯理论】

    在之前对Kotlin的反射进行了详细的学习,接下来进入一个全新的篇章,就是关于Koltin的协程[coroutine],在正式撸码之前先对它有一个全面理论化的了解: 协程的定义: 协和通过将复杂性放入 ...

  5. 51nod 1053 最大M子段和 V2

    N个整数组成的序列a[1],a[2],a[3],…,a[n],将这N个数划分为互不相交的M个子段,并且这M个子段的和是最大的.如果M >= N个数中正数的个数,那么输出所有正数的和. 例如:-2 ...

  6. 动态创建自绘的CListBox注意事项

    Create(WS_VISIBLE|WS_CHILD|LBS_NOTIFY|LBS_OWNERDRAWFIXED|LBS_HASSTRINGS|LBS_NOINTEGRALHEIGHT ,rcWnd, ...

  7. git添加doc文件维护

    原文地址:https://www.cnblogs.com/yezuhui/p/6853271.html 说明: git 一般只能对纯文本文件进行版本控制,但是如果有其他中间转化软件的协助,就可以对任意 ...

  8. 日常SQL总结

    THEN '理财帐户' ELSE '其他' end case后可以加入条件在写when,length(String) 为string的字符长度 length(),括号里不仅可以是string的字符串, ...

  9. MySQL备份的三中方式

    一.备份的目的 做灾难恢复:对损坏的数据进行恢复和还原需求改变:因需求改变而需要把数据还原到改变以前测试:测试新功能是否可用 二.备份需要考虑的问题 可以容忍丢失多长时间的数据:恢复数据要在多长时间内 ...

  10. javaWeb开发中entityBean的习惯用法

    entity bean的作用是将与数据库交互的过程封装成对象,在servelet编写的过程中,只需要会写java,不必考虑与数据库交互细节. 实体类: 基本与数据库的表相对应,表示一个实在的对象. 例 ...