自然语言27_Converting words to Features with NLTK
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
https://www.pythonprogramming.net/words-as-features-nltk-tutorial/
Converting words to Features with NLTK
In this tutorial, we're going to be building off the previous
video and compiling feature lists of words from positive reviews and
words from the negative reviews to hopefully see trends in specific
types of words in positive or negative reviews.
To start, our code:
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:3000]
Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features
Next, we can print one feature set like:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:
featuresets = [(find_features(rev), category) for (rev, category) in documents]
Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let's go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 4 09:27:48 2016 @author: daxiong
"""
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) #dict_allWords是一个字典,存储所有文字的频率分布
dict_allWords = nltk.FreqDist(all_words)
#字典keys()列出所有单词,[:3000]表示列出前三千文字
word_features = list(dict_allWords.keys())[:3000]
'''
'combating',
'mouthing',
'markings',
'directon',
'ppk',
'vanishing',
'victories',
'huddleston',
...]
''' def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features words=movie_reviews.words('neg/cv000_29416.txt')
'''
Out[78]: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
type(words)
Out[65]: nltk.corpus.reader.util.StreamBackedCorpusView ''' #去重,words1为集合形式
words1 = set(words)
'''
words1 {'!',
'"',
'&',
"'",
'(',
')',.......
'witch',
'with',
'world',
'would',
'wrapped',
'write',
'world',
'would',
'wrapped',
'write',
'years',
'you',
'your'}
'''
features = {} #victories单词不在words1,输出false
('victories' in words1)
'''
Out[73]: False
''' features['victories'] = ('victories' in words1)
'''
features
Out[75]: {'victories': False}
''' print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
'''
'schwarz': False,
'supervisors': False,
'geyser': False,
'site': False,
'fevered': False,
'acknowledged': False,
'ronald': False,
'wroth': False,
'degredation': False,
...}
''' featuresets = [(find_features(rev), category) for (rev, category) in documents]
featuresets 特征集合一共有2000个文件,每个文件是一个元组,元组包含字典(“glory”:False)和neg/pos分类
python风控评分卡建模和风控常识(博客主亲自录制视频教程)
自然语言27_Converting words to Features with NLTK的更多相关文章
- 自然语言18.1_Named Entity Recognition with NLTK
QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...
- 自然语言15_Part of Speech Tagging with NLTK
https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tut ...
- 自然语言12_Tokenizing Words and Sentences with NLTK
https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ # -*- coding: utf-8 -*- ...
- 自然语言处理NLP程序包(NLTK/spaCy)使用总结
NLTK和SpaCy是NLP的Python应用,提供了一些现成的处理工具和数据接口.下面介绍它们的一些常用功能和特性,便于对NLP研究的组成形式有一个基本的了解. NLTK Natural Langu ...
- Python 自然语言处理(1) 计数词汇
Python有一个自然语言处理的工具包,叫做NLTK(Natural Language ToolKit),可以帮助你实现自然语言挖掘,语言建模等等工作.但是没有NLTK,也一样可以实现简单的词类统计. ...
- 【Python自然语言处理】第一章学习笔记——搜索文本、计数统计和字符串链表
这本书主要是基于Python和一个自然语言工具包(Natural Language Toolkit, NLTK)的开源库进行讲解 NLTK 介绍:NLTK是一个构建Python程序以处理人类语言数据的 ...
- python笔记10-----便捷网络数据NLTK语料库
1.NLTK的概念 NLTK:Natural language toolkit,是一套基于python的自然语言处理工具. 2.NLTK中集成了语料与模型等的包管理器,通过在python编辑器中执行. ...
- 【JulyEdu-Python基础】第 1 课:入门基础
一些学习资源的收集: 可汗学院 视频 公开课 Grossin 编程教室: 一个非常简单,对初学者非常友好的教程和在线联系 廖雪峰教程 书籍: Python核心编程: 这本书应该是最清楚.最深入全面的书 ...
- python文件打开模式&time&python第三方库
r:以只读方式打开文件.文件的指针将会放在文件的开头.这是默认模式. w:打开一个文件只用于写入.如果该文件已存在则将其覆盖.如果该文件不存在,创建新文件. a:打开一个文件用于追加.如果该文件已存在 ...
随机推荐
- Python魔术方法-Magic Method
介绍 在Python中,所有以"__"双下划线包起来的方法,都统称为"Magic Method",例如类的初始化方法 __init__ ,Python中所有的魔 ...
- appdata
C:/Users/用户名/AppData里面默认有三个文件夹,分别是Local.LocalLow.Roaming,简单地来说,都是用来存放软件的配置文件和临时文件的,里面有很多以软件公司或者软件名称命 ...
- java常用的文件读写操作
现在算算已经做java开发两年了,回过头想想还真是挺不容易的,java的东西是比较复杂但是如果基础功扎实的话能力的提升就很快,这次特别整理了点有关文件操作的常用代码和大家分享 1.文件的读取(普通方式 ...
- openstack api快速入门
原文:http://my.oschina.net/guol/blog/105430 openstack官方有提供api供开发者使用,可以使用api做一些外围的小工具,用来简化对openstack的管理 ...
- 3. Python 简介
3. Python 简介 下面的例子中,输入和输出分别由大于号和句号提示符 ( >>> 和 ... ) 标注:如果想重现这些例子,就要在解释器的提示符后,输入 (提示符后面的) 那些 ...
- Web报表工具JS开发之日期校验
在报表开发过程中,我们常常需要对查询界面进行日期校验.例如有两个参数:开始日期和结束日期,我们要校验的是:开始日期与结束日期不能为空,结束日期必须在开始日期之后以及结束日期必须在开始日期后的某个时间段 ...
- POJ2794 Double Patience[离散概率 状压DP]
Double Patience Time Limit: 3000MS Memory Limit: 65536K Total Submissions: 694 Accepted: 368 Cas ...
- 转: 深入理解Linux修改hostname
from: http://www.cnblogs.com/kerrycode/p/3595724.html 写的相当详细!!! 深入理解Linux修改hostname 2014-03-12 10:17 ...
- hbase-site.xml中HBASE_CLASSPATH 的设置
http://www.dataguru.cn/thread-95064-1-1.html
- jS字符串大小写转换实现方式
toLocaleUpperCase 方法:将字符转换为大写 stringVar.tolocaleUpperCase( ) 必选的 stringVar 引用是一个 String 对象,值或文字. //转 ...