spacy

官方文档： https://spacy.io/api

Spacy功能简介

可以用于进行分词，命名实体识别，词性识别等等，但是首先需要下载预训练模型

pip install --user spacy

python -m spacy download en_core_web_sm

pip install neuralcoref

pip install textacy

sentencizer

将文章切分成句子，原理是Spacy通过将文章中某些单词的is_sent_start属性设置为True，来实现对文章的句子的切分，这些特殊的单词在规则上对应于句子的开头。

import spacy

nlp = spacy.load('en_core_web_sm')# 加载预训练模型

txt = "some text read from one paper ..."

doc = nlp(txt)

for sent in doc.sents:

    print(sent)

    print('#'*50)

Tokenization

将句子切分成单词，英文中一般使用空格分隔

import spacy

nlp = spacy.load('en_core_web_sm')

txt = "A magnetic monopole is a hypothetical elementary particle."

doc = nlp(txt)

tokens = [token for token in doc]

print(tokens)

Part-of-speech tagging

词性标注，标注句子中每个单词的词性，是名词动词还是形容词。

pos = [token.pos_ for token in doc]

print(pos)

>>> ['DET', 'ADJ', 'NOUN', 'VERB', 'DET', 'ADJ', 'ADJ', 'NOUN', 'PUNCT']

# 对应于中文是 【冠词，形容词，名词，动词，冠词，形容词，形容词，名词，标点】

# 原始句子是 [A, magnetic, monopole, is, a, hypothetical, elementary, particle, .]

Lemmatization

找到单词的原型，即词性还原，将am, is, are, have been 还原成be，复数还原成单数(cats -> cat)，过去时态还原成现在时态 (had -> have)。在代码中使用 token.lemma_ 提取

lem = [token.lemma_ for token in doc]

print(lem)

>>> ['a', 'magnetic', 'monopole', 'be', 'a', 'hypothetical', 'elementary', 'particle', '.']

Stop words

识别停用词，a,the等等。

stop_words = [token.is_stop for token in doc]

print(stop_words)

>>> [True, False, False, True, True, False, False, False, False]

# 可以看到，这个磁单极的例子中停用词有 a 和 is。

Dependency Parsing

依存分析，标记单词是主语，谓语，宾语还是连接词。程序中使用 token.dep_ 提取。

dep = [token.dep_ for token in doc]

print(dep)

>>> ['det', 'amod', 'nsubj', 'ROOT', 'det', 'amod', 'amod', 'attr', 'punct']

Spacy的依存分析采用了 ClearNLP 的依存分析标签 ClearNLP Dependency Labels。根据这个网站提供的标签字典，翻译成人话：[限定词，形容词修饰, 名词主语，根节点, 限定词, 形容词修饰, 形容词修饰, 属性, 标点]

Noun Chunks

提取名词短语，程序中使用doc.noun_chunks获取。

noun_chunks = [nc for nc in doc.noun_chunks]

print(noun_chunks)

>>> [A magnetic monopole, a hypothetical elementary particle]

Named Entity Recognization

命名实体识别，识别人名，地名，组织机构名，日期，时间，金额，事件，产品等等。程序中使用 doc.ents 获取。

txt = ''''European authorities fined Google a record $5.1 billion

on Wednesday for abusing its power in the mobile phone market and

ordered the company to alter its practices'

'''

doc = nlp(txt)

ners = [(ent.text, ent.label_) for ent in doc.ents]

print(ners)

>>> [('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]
更详细的命名实体简写列表。

https://upload-images.jianshu.io/upload_images/11452592-d7776c24334f0a94.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/720/format/webp

Coreference Resolution

指代消解，寻找句子中代词 he，she，it 所对应的实体。为了使用这个模块，需要使用神经网络预训练的指代消解系数，如果前面没有安装，可运行命令：pip install neuralcoref

txt = "My sister has a son and she loves him."

# 将预训练的神经网络指代消解加入到spacy的管道中

import neuralcoref

neuralcoref.add_to_pipe(nlp)

doc = nlp(txt)

doc._.coref_clusters

>>> [My sister: [My sister, she], a son: [a son, him]]

Display

可视化。把这条功能单独列出来，是因为它太酷了。举几个简单的例子，第一个例子是对依存分析的可视化，

txt = '''In particle physics, a magnetic monopole is a

hypothetical elementary particle.'''

displacy.render(nlp(txt), style='dep', jupyter=True,\

                options = {'distance': 90})

第二个例子是对命名实体识别的可视化

from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

知识提取

这一部分使用了 textacy, 需要通过pip命令进行安装，textacy.extract 里面的 semistructured_statements() 函数可以提取主语是 Magnetic Monopole，谓语原型是 be 的所有事实。首先将维基百科上的关于磁单极的这篇介绍的文字拷贝到 magneti_monopole.txt 中。

import textacy.extract

nlp = spacy.load('en_core_web_sm')

with open("magnetic_monopole.txt", "r") as fin:

    txt = fin.read()

doc = nlp(txt)

statements = textacy.extract.semistructured_statements(doc, "monopole")

for statement in statements:

    subject, verb, fact = statement

    print(f" - {fact}")

如果搜索Magnetic Monopole, 输出只有第三条，如果搜索 monopole, 结果如下：

- a singular solution of Maxwell's equation (because it requires removing the worldline from spacetime

- a [[topological defect]] in a compact U(1) gauge theory

- a new [[elementary particle]], and would violate [[Gauss's law for magnetism

import spacy

from spacy import displacy

nlp = spacy.load('en')

# nlp = spacy.load("en_core_web_sm")

filename = "test.txt"

document = open(filename,encoding="utf-8").read()

document = nlp(document)

# display.display()

#可视化

displacy.render(document,style="ent",jupyter=True)

displacy.render(document, style='dep', jupyter=True,\

                options = {'distance': 90})

print([token.orth_ for token in document if not token.is_punct | token.is_space])   #分词

all_tags = {w.pos: w.pos_ for w in document} #词性标注 可以使用.pos_ 和 .tag_方法访问粗粒度POS标记和细粒度POS标记

print(all_tags)

labels = set([w.label_ for w in document.ents])  #实体识别

print([(i, i.label_, i.label) for i in document.ents])

spacy的更多相关文章

spaCy is a library for advanced natural language processing in Python and Cython:spaCy 工业级自然语言处理工具
spaCy is a library for advanced natural language processing in Python and Cython. spaCy is built on ...
python 使用spaCy 进行NLP处理
原文:http://mp.weixin.qq.com/s/sqa-Ca2oXhvcPHJKg9PuVg import spacy nlp = spacy.load("en_core_web_ ...
Spacy 使用
# 前提是必须安装: python -m spacy download ennlp = spacy.load('en')text = u"you are best. it is lemmat ...
spaCy 并行分词
spaCy 并行分词在使用spacy的时候,感觉比nltk慢了许多,一直在寻找并行化的方案,好在找到了,下面给出spaCy并行化的分词方法使用示例: import spacy nlp = spacy ...
初识Spacy
之所以想接触Spacy,是看到其自称为工业级的应用,所以想尝试下 windows下安装Spacy: 直接安装pip install spacy是会报错的解决方法: 到 htt ...
Sense2vec with spaCy and Gensim
如果你在2015年做过文本分析项目,那么你大概率用的是word2vec模型.Sense2vec是基于word2vec的一个新模型,你可以利用它来获取更详细的.与上下文相关的词向量.本文主要介绍该模型的 ...
NLTK vs SKLearn vs Gensim vs TextBlob vs spaCy
Generally, NLTK is used primarily for general NLP tasks (tokenization, POS tagging, parsing, etc.) S ...
spaCy 第二篇：语言模型
spaCy处理文本的过程是模块化的,当调用nlp处理文本时,spaCy首先将文本标记化以生成Doc对象,然后,依次在几个不同的组件中处理Doc,这也称为处理管道.语言模型默认的处理管道依次是:tagg ...
spaCy 第一篇：核心类型
spaCy 是一个号称工业级的自然语言处理工具包,最核心的数据结构是Doc和Vocab.Doc对象包含Token的序列和Token的注释(Annotation),Vocab对象是spaCy使用的词汇表 ...
Mac下，spacy配置
pip3 install -U spacy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com python3 -m spa ...

随机推荐

远程ubuntu虚拟机（VirtualBox）
环境实机win10,虚拟软件是Oracle VM VirtualBox 下载地址https://www.virtualbox.org/ ubuntu虚拟机配置网络选桥接网卡, 原因是桥接网卡下,根 ...
schemer校验器的简单应用
from schemer import Schema def func(account, password): request_params = { 'account': account, 'pass ...
jubyter notebook 安装conda 虚拟环境
No.1.2
列表标签无序列表标签名说明 ul 表示无序列表的整体,用于包裹li标签 li 表示无序列表的每一项,用于包含每一行的内容有序列表标签名说明 ol 表示有序列表的整体,用于包裹li标签 li ...
Lecture 2. Fundamental Concepts and ISA - Carnegie Mellon - Computer Architecture 2015 - Onur Mutlu
并不只有冯诺依曼模型,按照控制流顺序执行指令还有 data flow 模型,按照数据流顺序执行指令冯诺依曼模型和数据流模型的编程语言的一个对比 Control-driven 编程模型和 data- ...
antd timePicker组件限制当前之前的时间不可选择
import React from 'react'; import ReactDOM from 'react-dom'; import {Input,DatePicker,Form,Col,Butto ...
react-router 路由入门必看
如果您已经入门reactjs,请绕道~ 这篇博客只适合初学者,初学reactjs的时候,如果你不会webpack,相信很多人都会被官方的例子绕的晕头转向. ES6的例子也会搞死一批入门者.之前一直用的 ...
spider_requests库简单ip代理
"""使用requests 设置ip代理"""import requestsdef func(): url = 'http://ip.273 ...
【Nday】Spring-Cloud-SpEL-表达式注入漏洞复现
# 环境搭建 JDK 15下载: https://www.oracle.com/java/technologies/javase/jdk15-archive-downloads.html 在Cen ...
flutter-android
多端开发框架含义:通过一套代码编译成在 H5/微信小程序/React Native/百度小程序/支付宝小程序等端运行的代码. 技术简介:Taro,uniapp Taro是一个开放式跨端跨框架解决方案 ...