# -*- coding: utf-8 -*-
"""
Created on Sun Nov 13 09:14:13 2016 @author: daxiong
"""
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize,word_tokenize #英文停止词,set()集合函数消除重复项
list_stopWords=list(set(stopwords.words('english')))
example_text="Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of bad captivity."
#分句
list_sentences=sent_tokenize(example_text)
#分词
list_words=word_tokenize(example_text)
#过滤停止词
filtered_words=[w for w in list_words if not w in list_stopWords]

Stop words with NLTK

The idea of Natural Language Processing is to do some form of
analysis, or processing, where the machine can understand, at least to
some level, what the text means, says, or implies.

This is an obviously massive challenge, but there are steps to
doing it that anyone can follow. The main idea, however, is that
computers simply do not, and will not, ever understand words directly.
Humans don't either *shocker*. In humans, memory is broken down into
electrical signals in the brain, in the form of neural groups that fire
in patterns. There is a lot about the brain that remains unknown, but,
the more we break down the human brain to the basic elements, we find
out basic the elements really are. Well, it turns out computers store
information in a very similar way! We need a way to get as close to that
as possible if we're going to mimic how humans read and understand
text. Generally, computers use numbers for everything, but we often see
directly in programming where we use binary signals (True or False,
which directly translate to 1 or 0, which originates directly from
either the presence of an electrical signal (True, 1), or not (False,
0)). To do this, we need a way to convert words to values, in numbers,
or signal patterns. The process of converting data to something a
computer can understand is referred to as "pre-processing." One of the
major forms of pre-processing is going to be filtering out useless data.
In natural language processing, useless words (data), are referred to
as stop words.

Immediately, we can recognize ourselves that some words carry more
meaning than other words. We can also see that some words are just
plain useless, and are filler words. We use them in the English
language, for example, to sort of "fluff" up the sentence so it is not
so strange sounding. An example of one of the most common, unofficial,
useless words is the phrase "umm." People stuff in "umm" frequently,
some more than others. This word means nothing, unless of course we're
searching for someone who is maybe lacking confidence, is confused, or
hasn't practiced much speaking. We all do it, you can hear me saying
"umm" or "uhh" in the videos plenty of ...uh ... times. For most
analysis, these words are useless.

We would not want these words taking up space in our database, or
taking up valuable processing time. As such, we call these words "stop
words" because they are useless, and we wish to do nothing with them.
Another version of the term "stop words" can be more literal: Words we
stop on.

For example, you may wish to completely cease analysis if you
detect words that are commonly used sarcastically, and stop immediately.
Sarcastic words, or phrases are going to vary by lexicon and corpus.
For now, we'll be considering stop words as words that just contain no
meaning, and we want to remove them.

You can do this easily, by storing a list of words that you
consider to be stop words. NLTK starts you off with a bunch of words
that they consider to be stop words, you can access it via the NLTK
corpus with:

from nltk.corpus import stopwords

Here is the list:

>>> set(stopwords.words('english'))
{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there',
'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they',
'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into',
'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who',
'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below',
'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me',
'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our',
'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she',
'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and',
'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then',
'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not',
'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too',
'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't',
'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it',
'how', 'further', 'was', 'here', 'than'}

Here is how you might incorporate using the stop_words set to remove the stop words from your text:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize example_sent = "This is a sample sentence, showing off the stop words filtration." stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example_sent) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence = [] for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w) print(word_tokens)
print(filtered_sentence)

Our output here:
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']

Our database thanks us. Another form of data pre-processing is 'stemming,' which is what we're going to be talking about next.

 

自然语言13_Stop words with NLTK的更多相关文章

  1. 自然语言处理(1)之NLTK与PYTHON

    自然语言处理(1)之NLTK与PYTHON 题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间.碰巧这几天在亚马逊上找书时发现了 ...

  2. 自然语言23_Text Classification with NLTK

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/text-classification-nltk-tutorial/?compl ...

  3. 自然语言20_The corpora with NLTK

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed= ...

  4. 自然语言19.1_Lemmatizing with NLTK(单词变体还原)

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/lemmatizing-nltk-tutorial/?completed=/na ...

  5. 自然语言14_Stemming words with NLTK

    https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ # -*- ...

  6. 自然语言处理2.1——NLTK文本语料库

    1.获取文本语料库 NLTK库中包含了大量的语料库,下面一一介绍几个: (1)古腾堡语料库:NLTK包含古腾堡项目电子文本档案的一小部分文本.该项目目前大约有36000本免费的电子图书. >&g ...

  7. python自然语言处理函数库nltk从入门到精通

    1. 关于Python安装的补充 若在ubuntu系统中同时安装了Python2和python3,则输入python或python2命令打开python2.x版本的控制台:输入python3命令打开p ...

  8. Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器

    http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-% ...

  9. 推荐《用Python进行自然语言处理》中文翻译-NLTK配套书

    NLTK配套书<用Python进行自然语言处理>(Natural Language Processing with Python)已经出版好几年了,但是国内一直没有翻译的中文版,虽然读英文 ...

随机推荐

  1. php 实现创建文件并追加数据

    最近因为后台有其他事情忙,所以我最近又开始学习php的内容了. (不过话说回来从客户端写到后台的感觉还是很爽的,嘿嘿) 需求是这样:从前台发来一些信息,存成文本文档,以后再统一处理(比如,存入用户账户 ...

  2. linux基础-第十五单元 软件包的管理

    使用RPM安装及移除软件 什么是RPM rpm的文件名 rpm软件安装与移除工作中经常使用的选项 查看RPM软件包中的信息 查询已安装的软件包信息 RPM包的属性依赖性问题 什么是RPM包的属性依赖性 ...

  3. 转 浅谈算法和数据结构: 十 平衡查找树之B树

    前面讲解了平衡查找树中的2-3树以及其实现红黑树.2-3树种,一个节点最多有2个key,而红黑树则使用染色的方式来标识这两个key. 维基百科对B树的定义为"在计算机科学中,B树(B-tre ...

  4. EF(Entity Framework)发生错误”正在创建模型,此时不可使用上下文“的解决办法。 正在创建模型,此时不可使用上下文。如果在 OnModelCreating 方法内使用上下文或如果多个线程同时访问同一上下文实例,可能引发此异常。请注意不保证 DbContext 的实例成员和相关类是线程安全的。 临时解决了这个问题,在Context的构造函数中,禁用了自动初始化:

    解决方案: 禁止上下创建. 修改.删除,默认为true public DataDbContext() : base("name=DataDbContext") {  this.Da ...

  5. bzoj3998: [TJOI2015]弦论

    SAM小裸题qwq #include <iostream> #include <cstdio> #include <cmath> #include <cstr ...

  6. python 列表与元组的操作简介

    上一篇:Python 序列通用操作介绍 列表 列表是可变的(mutable)--可以改变列表的内容,这不同于字符串和元组,字符串和元组都是不可变的.接下来讨论一下列表所提供的方法. list函数 可以 ...

  7. VisualSVNServerTools(在线修改VisualSVN密码)

    采用的是apache htpasswd的命令行参数进行修改,部署时,采用独立的apache server进行. 源码:https://github.com/easonjim/VisualSVNServ ...

  8. hdu5175 gcd 求约数

    题意:求满足条件GCD(N,M) = N XOR M的M的个数 sol:和uva那题挺像的.若gcd(a,b)=a xor b=c,则b=a-c 暴力枚举N的所有约数K,令M=NxorK,再判断gcd ...

  9. Discuz X1.5 X2.5 X3 UC_KEY Getshell Write PHPCODE into config/config_ucenter.php Via /api/uc.php Vul

    目录 . 漏洞描述 . 漏洞触发条件 . 漏洞影响范围 . 漏洞代码分析 . 防御方法 . 攻防思考 1. 漏洞描述 在Discuz中,uc_key是UC客户端与服务端通信的通信密钥.因此使用uc_k ...

  10. [SVN Mac的SVN使用]

    在Windows环境中,我们一般使用TortoiseSVN来搭建svn环境.在Mac环境下,由于Mac自带了svn的服务器端和客户端功能,所以我们可以在不装任何第三方软件的前提下使用svn功能,不过还 ...