python 安装nltk，使用（英文分词处理，词干化等）（Green VPN）

安装pip命令之后：

sudo pip install -U pyyaml nltk

import nltk
nltk.download()

等待ing

目前访问不了，故使用Green VPN

http://www.evergreenvpn.com/ubuntu-pptp-vpn-setting/

nltk使用

http://www.cnblogs.com/yuxc/archive/2011/08/29/2157415.html

http://blog.csdn.net/huyoo/article/details/12188573

http://www.52nlp.cn/tag/nltk

1.空格进行英文分词.split（python自带）

>>> slower
'we all like the book'
>>> ssplit = slower.split()
>>> ssplit
['we', 'all', 'like', 'the', 'book']
>>>

或

>>> import nltk
>>> s = u"我们都Like the book"
>>> m = [word for word in nltk.tokenize.word_tokenize(s)]
>>> for word in m:
...     print word
...
我们都Like
the
book

或

>>> tokens = nltk.word_tokenize(s)
>>> tokens
[u'\u6211\u4eec\u90fdLike', u'the', u'book']
>>> for word in tokens
  File "<stdin>", line 1
    for word in tokens
                     ^
SyntaxError: invalid syntax
>>> for word in tokens:
...     print word
...
我们都Like
the
book

2.词性标注

>>> tagged = nltk.pos_tag(tokens)
>>> for word in tagged:
...     print word
...
(u'\u6211\u4eec\u90fdLike', 'IN')
(u'the', 'DT')
(u'book', 'NN')
>>>

3.句法分析

>>> entities= nltk.chunk.ne_chunk(tagged)
>>> entities
Tree('S', [(u'\u6211\u4eec\u90fdLike', 'IN'), (u'the', 'DT'), (u'book', 'NN')])
>>>

---------------------------------------------------------------------------------------------------------------------------------------------------------

4.转换为小写（Python自带）

>>> s
'We all like the book'
>>> slower = s.lower()
>>> slower
'we all like the book'
>>>

5.空格进行英文分词.split（python自带）

>>> slower
'we all like the book'
>>> ssplit = slower.split()
>>> ssplit
['we', 'all', 'like', 'the', 'book']
>>>

6.标号与单词分离

>>> s
'we all like the book,it\xe2\x80\x98s so interesting.'
>>> s = 'we all like the book, it is so interesting.'
>>> wordtoken = nltk.tokenize.word_tokenize(s)
>>> wordtoken
['we', 'all', 'like', 'the', 'book', ',', 'it', 'is', 'so', 'interesting', '.']
>>> wordtoken = nltk.word_tokenize(s)
>>> wordtoken
['we', 'all', 'like', 'the', 'book', ',', 'it', 'is', 'so', 'interesting', '.']
>>> wordsplit = s.split()
>>> wordsplit
['we', 'all', 'like', 'the', 'book,', 'it', 'is', 'so', 'interesting.']
>>>

7.去停用词（nltk自带127个英文停用词）

>>> wordEngStop = nltk.corpus.stopwords.words('english')
>>> wordEngStop
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
>>> len(wordEngStop)
127
>>>

>>> len(wordEngStop)
127
>>> s
'we all like the book, it is so interesting.'
>>> wordtoken
['we', 'all', 'like', 'the', 'book', ',', 'it', 'is', 'so', 'interesting', '.']
>>> for word in wordtoken:
...     if not word in wordEngStop:
...             print word
...
like
book
,
interesting
.
>>>

8.去标点符号

>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '!', '@', '#', '%', '$', '*']
>>> wordtoken
['we', 'all', 'like', 'the', 'book', ',', 'it', 'is', 'so', 'interesting', '.']
>>> for word in wordtoken:
...     if not word in english_punctuations:
...             print word
...
we
all
like
the
book
it
is
so
interesting
>>>

9.词干化

“我们对这些英文单词词干化（Stemming)，NLTK提供了好几个相关工具接口可供选择，具体参考这个页面: http://nltk.org/api/nltk.stem.html , 可选的工具包括Lancaster Stemmer, Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer:” 来自：我爱自然语言处理 http://www.52nlp.cn/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6%E4%B8%89

http://lutaf.com/212.htm 词干化的主流方法

http://blog.sina.com.cn/s/blog_6d65717d0100z4hu.html

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> wordtoken
['we', 'all', 'like', 'the', 'book', ',', 'it', 'is', 'so', 'interesting', '.']
>>> st.stem(wordtoken)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/stem/lancaster.py", line 195, in stem
AttributeError: 'list' object has no attribute 'lower'
>>> for word in wordtoken:
...     print st.stem(word)
...
we
al
lik
the
book
,
it
is
so
interest
.
>>>

两者各有优缺点

>>> from nltk.stem import PorterStemmer
>>> wordtoken
['we', 'all', 'like', 'the', 'book', ',', 'it', 'is', 'so', 'interesting', '.']
>>> PorterStemmer().stem(wordtoken)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 632, in stem
AttributeError: 'list' object has no attribute 'lower'
>>> PorterStemmer().stem('all')
u'all'
>>> for word in wordtoken:
...     print PorterStemmer().stem(word)
...
we
all
like
the
book
,
it
is
so
interest
.
>>> PorterStemmer().stem("better")
u'better'
>>> PorterStemmer().stem("supplies")
u'suppli'
>>> st.stem('supplies')
u'supply'
>>>

# -*- coding:utf8 -*-
import nltk
import os

wordEngStop = nltk.corpus.stopwords.words('english')
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '!', '@', '#', '%', '$', '*','=','abstract=', '{', '}']
porterStem=nltk.stem.PorterStemmer()
lancasterStem=nltk.stem.lancaster.LancasterStemmer()

fin = open('/home/xdj/myOutput.txt', 'r')
fout  = open('/home/xdj/myOutputLancasterStemmer.txt','w')
for eachLine in fin:
        eachLine = eachLine.lower().decode('utf-8', 'ignore') #小写
        tokens = nltk.word_tokenize(eachLine)                 #分词（与标点分开）
        wordLine = ''
        for word in tokens:
            if not word in english_punctuations:          #去标点
                if not word in wordEngStop:          #去停用词
                    #word = porterStem.stem(word)
                    word = lancasterStem.stem(word)
                    wordLine+=word+' '
        fout.write(wordLine.encode('utf-8')+'\n')
fin.close()
fout.close()

python 安装nltk，使用（英文分词处理，词干化等）（Green VPN）的更多相关文章

python安装Jieba中文分词组件并测试
python安装Jieba中文分词组件 1.下载http://pypi.python.org/pypi/jieba/ 2.解压到解压到python目录下: 3.“win+R”进入cmd:依次输入如下代 ...
python中nltk的下载安装方式
首先去http://nltk.org/install.html下载相关的安装程序,然后在cmd窗口中,进入到python的文件夹内的 Scripts内,运行easy_install pip 安装Py ...
转：python的nltk中文使用和学习资料汇总帮你入门提高
python的nltk中文使用和学习资料汇总帮你入门提高转:http://blog.csdn.net/huyoo/article/details/12188573 nltk的安装 nltk初步使用入 ...
【python】NLTK好文
From:http://m.blog.csdn.net/blog/huyoo/12188573 nltk是一个python工具包, 用来处理和自然语言处理相关的东西. 包括分词(tokenize), ...
linux环境下安装sphinx中文支持分词搜索(coreseek+mmseg)
linux环境下安装sphinx中文支持分词搜索(coreseek+mmseg) 2013-11-10 16:51:14 分类: 系统运维为什么要写这篇文章? 答:通过常规的三大步(./confi ...
探索 Python、机器学习和 NLTK 库开发一个应用程序，使用 Python、NLTK 和机器学习对 RSS 提要进行分类
挑战:使用机器学习对 RSS 提要进行分类最近,我接到一项任务,要求为客户创建一个 RSS 提要分类子系统.目标是读取几十个甚至几百个 RSS 提要,将它们的许多文章自动分类到几十个预定义的主题领域 ...
win安装NLTK出现的问题
一.今天学习Python自然语言处理(NLP processing) 需要安装自然语言工具包NLTK Natural Language Toolkit 按照教程在官网https://pypi.pyth ...
Python安装、配置图文详解(转载)
Python安装.配置图文详解目录: 一. Python简介二. 安装python 1. 在windows下安装 2. 在Linux下安装三. 在windows下配置python集成开发环境(I ...
【和我一起学python吧】Python安装、配置图文详解
Python安装.配置图文详解目录: 一. Python简介二. 安装python 1. 在windows下安装 2. 在Linux下安装三. 在windows下配置python集成开发环境( ...

随机推荐

python练手基础
Python相关文档0.1. Python标准文档0.2. Python实用大全0.3. 迷人的Python0.4. 深入理解Python0.5. Python扩展库网址 http://pypi.py ...
html嵌入样式表
1.针对文件中的字体还有属性进行设置主要设置文字的大小及其颜色问题,未涉及div飘操作处理页面CSS 先检测该内容部分是否已经设定了样式,如果没有单独设定再按照总体设计进行限定. eg: h1 h ...
php 操作数组（合并，拆分，追加，查找，删除等）
1. 合并数组 array_merge()函数将数组合并到一起,返回一个联合的数组.所得到的数组以第一个输入数组参数开始,按后面数组参数出现的顺序依次迫加.其形式为: array array_merg ...
JAVA Hibernate工作原理及为什么要用
hibernate 简介:hibernate是一个开源框架,它是对象关联关系映射的框架,它对JDBC做了轻量级的封装,而我们java程序员可以使用面向对象的思想来操纵数据库.hibernate核心接口 ...
Android笔记：异步消息处理
1. Message Message 是在线程之间传递的消息,它可以在内部携带少量的信息,用于在不同线程之间交换数据.上一小节中我们使用到了Message 的what 字段,除此之外还可以使用arg1 ...
CentOS7—HAProxy安装与配置
概述 Haproxy下载地址:http://pkgs.fedoraproject.org/repo/pkgs/haproxy/ 关闭SElinux.配置防火墙 1.vi /etc/selinux/co ...
python计算文件的md5值
前言最近要开发一个基于python的合并文件夹/目录的程序,本来的想法是基于修改时间的比较,即判断文件有没有改变,比较两个文件的修改时间即可.这个想法在windows的pc端下测试没有问题. 但是当 ...
InnoDB VS MyISAM
首先都是MySql存储引擎.数据库的考虑点一般就是事务(ACID),然后牵扯出的锁机制.如果你需要事务,那就只能选InnoDB了.如果你还需要外键约束,你也只能选择InnoDB.这个是两者最大的区别. ...
Tomcat中JVM内存溢出及合理配置及maxThreads如何配置(转)
来源:http://www.tot.name/html/20150530/20150530102930.htm Tomcat本身不能直接在计算机上运行,需要依赖于硬件基础之上的操作系统和一个Java虚 ...
【MyEcplise】build workspace卡死
1.window-Perferences-MyEclipse-Validation 将Manual全部勾掉,Build只留Classpath DependencyValidator,其它全部勾掉. 2 ...

python 安装nltk，使用（英文分词处理，词干化等）（Green VPN）

python 安装nltk，使用（英文分词处理，词干化等）（Green VPN）的更多相关文章

随机推荐

热门专题