自然语言14_Stemming words with NLTK
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
# -*- coding: utf-8 -*-
"""
Created on Sun Nov 13 09:14:13 2016 @author: daxiong
"""
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
#生成波特词干算法实例
ps=PorterStemmer()
'''
ps.stem('emancipation')
Out[17]: 'emancip' ps.stem('love')
Out[18]: 'love' ps.stem('loved')
Out[19]: 'love' ps.stem('loving')
Out[20]: 'love' ''' example_words=['python','pythoner','pythoning','pythoned','pythonly']
example_text="Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of bad captivity."
#分句
list_sentences=sent_tokenize(example_text)
#分词
list_words=word_tokenize(example_text)
#词干提取
list_stemWords=[ps.stem(w) for w in example_words]
''' ['python', 'python', 'python', 'python', 'pythonli']''' list_stemWords1=[ps.stem(w) for w in list_words]

Stemming words with NLTK
The idea of stemming is a sort of normalizing method. Many
variations of words carry the same meaning, other than when tense is
involved.
The reason why we stem is to shorten the lookup, and normalize sentences.
Consider:
I was riding in the car.
This sentence means the same thing. in the car is the same. I was is
the same. the ing denotes a clear past-tense in both cases, so is it
truly necessary to differentiate between ride and riding, in the case of
just trying to figure out the meaning of what this past-tense activity
was?
No, not really.
This is just one minor example, but imagine every word in the English
language, every possible tense and affix you can put on a word. Having
individual dictionary entries per version would be highly redundant and
inefficient, especially since, once we convert to numbers, the "value"
is going to be identical.
One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.
First, we're going to grab and define our stemmer:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer()
Now, let's choose some words with a similar stem, like:
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
Next, we can easily stem by doing something like:
for w in example_words:
print(ps.stem(w))
Our output:
python
python
python
python
pythonli
Now let's try stemming a typical sentence, rather than some words:
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text) for w in words:
print(ps.stem(w))
Now our result is:
It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.
Next up, we're going to discuss something a bit more advanced from the NLTK module, Part of Speech tagging, where we can use the NLTK module to identify the parts of speech for each word in a sentence.
自然语言14_Stemming words with NLTK的更多相关文章
- 自然语言处理(1)之NLTK与PYTHON
自然语言处理(1)之NLTK与PYTHON 题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间.碰巧这几天在亚马逊上找书时发现了 ...
- 自然语言23_Text Classification with NLTK
QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/text-classification-nltk-tutorial/?compl ...
- 自然语言20_The corpora with NLTK
QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed= ...
- 自然语言19.1_Lemmatizing with NLTK(单词变体还原)
QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/lemmatizing-nltk-tutorial/?completed=/na ...
- 自然语言13_Stop words with NLTK
https://www.pythonprogramming.net/stop-words-nltk-tutorial/?completed=/tokenizing-words-sentences-nl ...
- 自然语言处理2.1——NLTK文本语料库
1.获取文本语料库 NLTK库中包含了大量的语料库,下面一一介绍几个: (1)古腾堡语料库:NLTK包含古腾堡项目电子文本档案的一小部分文本.该项目目前大约有36000本免费的电子图书. >&g ...
- python自然语言处理函数库nltk从入门到精通
1. 关于Python安装的补充 若在ubuntu系统中同时安装了Python2和python3,则输入python或python2命令打开python2.x版本的控制台:输入python3命令打开p ...
- Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器
http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-% ...
- 推荐《用Python进行自然语言处理》中文翻译-NLTK配套书
NLTK配套书<用Python进行自然语言处理>(Natural Language Processing with Python)已经出版好几年了,但是国内一直没有翻译的中文版,虽然读英文 ...
随机推荐
- 安装myeclipse10后破解时发现没有common文件夹
今天重新安装了myeclipse10软件,然后在破解的时候发现安装目录下没有common,这是因为以前安装过myeclipse,卸载时没有完全清除,再次安装时,myeclipse会自动找到以前安装的c ...
- extjs store的操作
先来个声明,看着不错,贴过来的,没都测试过. Store.getCount()返回的是store中的所有数据记录,然后使用for循环遍历整个store,从而得到每条记录. 除了使用getCount() ...
- 如何升级Ceph版本及注意事项
升级软件版本在日常运维中是一个常见操作. 本文分享一下Ceph版本升级的一些经验. 一般升级流程和注意如下: 1. 关注社区Release notes 和 ceph-user邮件订阅列表,获取社区发 ...
- navicat cannot create file 文件名、目录名或卷标语法不正确 解决方法
配置了mycat,用navicat连接8066端口,点击“查询”的时候发现出现报错: 开始以为是mycat的配置有问题,找了好久都没发现错误.根据提示信息进入到相应的目录发现每个连接其实就是一个win ...
- C#.NET里面抽象类和接口有什么区别?
声明方法的存在而不去实现它的类被叫做抽象类(abstract class),它用于要创建一个体现某些基本行为的类,并为该类声明方法,但不能在该类中实现该类的情况.不能创建abstract 类的实例.然 ...
- [转]JSON 入门指南
原文地址:http://www.ibm.com/developerworks/cn/web/wa-lo-json/ 尽管有许多宣传关于 XML 如何拥有跨平台,跨语言的优势,然而,除非应用于 Web ...
- jsp内置对象作业2-留言簿
1.留言簿页面:liuYan.jsp <%@ page language="java" contentType="text/html; charset=UTF-8& ...
- iOS开发--换肤简单实现以及工具类的抽取
一.根据美工提供的图片,可以有两种换肤的方案. <1>美工提供图片的类型一: <2>美工提供图片的类型二:这种分了文件夹文件名都一样的情况,拖入项目后最后用真实文件夹(蓝色文件 ...
- iOS不得姐项目--封装状态栏指示器(UIWindow实现)
一.头文件 #import <UIKit/UIKit.h> @interface ChaosStatusBarHUD : NSObject /** 显示成功信息 */ + (void)sh ...
- Selenium+WebDriver+Python 定时控制任务
为了更对得起"自动化测试"的名号,我们可以设置定时任务,使我们自动化脚本在某个时间点自动运行脚本,这样就可以让测试在夜间进行,减少了时间成本.通过程序来控制test case在什么 ...

