# -*- coding: utf-8 -*-
"""
Created on Sun Nov 13 09:14:13 2016 @author: daxiong
"""
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
#生成波特词干算法实例
ps=PorterStemmer()
'''
ps.stem('emancipation')
Out[17]: 'emancip' ps.stem('love')
Out[18]: 'love' ps.stem('loved')
Out[19]: 'love' ps.stem('loving')
Out[20]: 'love' ''' example_words=['python','pythoner','pythoning','pythoned','pythonly']
example_text="Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of bad captivity."
#分句
list_sentences=sent_tokenize(example_text)
#分词
list_words=word_tokenize(example_text)
#词干提取
list_stemWords=[ps.stem(w) for w in example_words]
''' ['python', 'python', 'python', 'python', 'pythonli']''' list_stemWords1=[ps.stem(w) for w in list_words]

Stemming words with NLTK

The idea of stemming is a sort of normalizing method. Many
variations of words carry the same meaning, other than when tense is
involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

I was taking a ride in the car.
I was riding in the car.

This sentence means the same thing. in the car is the same. I was is
the same. the ing denotes a clear past-tense in both cases, so is it
truly necessary to differentiate between ride and riding, in the case of
just trying to figure out the meaning of what this past-tense activity
was?

No, not really.

This is just one minor example, but imagine every word in the English
language, every possible tense and affix you can put on a word. Having
individual dictionary entries per version would be highly redundant and
inefficient, especially since, once we convert to numbers, the "value"
is going to be identical.

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.

First, we're going to grab and define our stemmer:

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer()

Now, let's choose some words with a similar stem, like:

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

Next, we can easily stem by doing something like:

for w in example_words:
print(ps.stem(w))

Our output:

python
python
python
python
pythonli

Now let's try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)

for w in words:
print(ps.stem(w))

Now our result is:

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.

Next up, we're going to discuss something a bit more advanced from the NLTK module, Part of Speech tagging, where we can use the NLTK module to identify the parts of speech for each word in a sentence.

 

自然语言14_Stemming words with NLTK的更多相关文章

  1. 自然语言处理(1)之NLTK与PYTHON

    自然语言处理(1)之NLTK与PYTHON 题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间.碰巧这几天在亚马逊上找书时发现了 ...

  2. 自然语言23_Text Classification with NLTK

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/text-classification-nltk-tutorial/?compl ...

  3. 自然语言20_The corpora with NLTK

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed= ...

  4. 自然语言19.1_Lemmatizing with NLTK(单词变体还原)

    QQ:231469242 欢迎喜欢nltk朋友交流 https://www.pythonprogramming.net/lemmatizing-nltk-tutorial/?completed=/na ...

  5. 自然语言13_Stop words with NLTK

    https://www.pythonprogramming.net/stop-words-nltk-tutorial/?completed=/tokenizing-words-sentences-nl ...

  6. 自然语言处理2.1——NLTK文本语料库

    1.获取文本语料库 NLTK库中包含了大量的语料库,下面一一介绍几个: (1)古腾堡语料库:NLTK包含古腾堡项目电子文本档案的一小部分文本.该项目目前大约有36000本免费的电子图书. >&g ...

  7. python自然语言处理函数库nltk从入门到精通

    1. 关于Python安装的补充 若在ubuntu系统中同时安装了Python2和python3,则输入python或python2命令打开python2.x版本的控制台:输入python3命令打开p ...

  8. Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器

    http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-% ...

  9. 推荐《用Python进行自然语言处理》中文翻译-NLTK配套书

    NLTK配套书<用Python进行自然语言处理>(Natural Language Processing with Python)已经出版好几年了,但是国内一直没有翻译的中文版,虽然读英文 ...

随机推荐

  1. Mvc多级Views目录 asp.net mvc4 路由重写及 修改view 的寻找视图的规则

    一般我们在mvc开发过程中,都会碰到这样的问题.页面总是写在Views文件夹下,而且还只能一个Controller的页面只能写在相应的以Controller名命名的文件夹下.如果我们写到别处呢?那么肯 ...

  2. [转]Hibernate查询对象所有字段,单个字段 ,几个字段取值的问题

    原文地址:http://www.ablanxue.com/prone_3552_1.html 1. 查询整个映射对象所有字段 Java代码 //直接from查询出来的是一个映射对象,即:查询整个映射对 ...

  3. [转]Java_List元素的遍历和删除

    原文地址:http://blog.csdn.net/insistgogo/article/details/19619645 1.创建一个ArrayList List<Integer> li ...

  4. [转]Hibernate时间总结

    原文地址:http://blog.csdn.net/woshisap/article/details/6543027 1:Hibernate操作时间需要注意的问题 hibernate很大的一个特点就是 ...

  5. URI 中特殊字符处理

    一.问题阐述 今天写 url 请求时,不管是get 请求还是 post 请求,如果参数中带有 + % # 等特殊符号,就无法正常获得参数 具体现象就是 用URL传参数的时候,用&符号连接,如果 ...

  6. Java算法-插入排序

    插入排序的基本思想是在遍历数组的过程中,假设在序号 i 之前的元素即 [0..i-1] 都已经排好序,本趟需要找到 i 对应的元素 x 的正确位置 k ,并且在寻找这个位置 k 的过程中逐个将比较过的 ...

  7. linux如何查看系统占用磁盘空间最大的文件及让文件按大小排序

    [root@localhost web_bak]  find / -type f -size +10G在Linux下如何让文件让按大小单位为M,G等易读格式,S size大小排序. [root@loc ...

  8. 【BZOJ-2503】相框 并查集 + 分类讨论

    2503: 相框 Time Limit: 3 Sec  Memory Limit: 128 MBSubmit: 71  Solved: 31[Submit][Status][Discuss] Desc ...

  9. 【BZOJ-1218】激光炸弹 前缀和 + 枚举

    1218: [HNOI2003]激光炸弹 Time Limit: 10 Sec  Memory Limit: 162 MBSubmit: 1778  Solved: 833[Submit][Statu ...

  10. 【uoj261】 NOIP2016—天天爱跑步

    http://uoj.ac/problem/261 (题目链接) 题意 给出一棵树,给出一些起点和终点,没走一条路径耗费时间1,每个节点上有一个权值w,问有多少条路径经过这个节点时所用的时间恰好是w. ...