自然语言27_Converting words to Features with NLTK
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/
Converting words to Features with NLTK
In this tutorial, we're going to be building off the previous
video and compiling feature lists of words from positive reviews and
words from the negative reviews to hopefully see trends in specific
types of words in positive or negative reviews.
To start, our code:
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:3000]
Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features
Next, we can print one feature set like:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:
featuresets = [(find_features(rev), category) for (rev, category) in documents]
Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let's go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 4 09:27:48 2016 @author: daxiong
"""
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) #dict_allWords是一个字典,存储所有文字的频率分布
dict_allWords = nltk.FreqDist(all_words)
#字典keys()列出所有单词,[:3000]表示列出前三千文字
word_features = list(dict_allWords.keys())[:3000]
'''
'combating',
'mouthing',
'markings',
'directon',
'ppk',
'vanishing',
'victories',
'huddleston',
...]
''' def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features words=movie_reviews.words('neg/cv000_29416.txt')
'''
Out[78]: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
type(words)
Out[65]: nltk.corpus.reader.util.StreamBackedCorpusView ''' #去重,words1为集合形式
words1 = set(words)
'''
words1 {'!',
'"',
'&',
"'",
'(',
')',.......
'witch',
'with',
'world',
'would',
'wrapped',
'write',
'world',
'would',
'wrapped',
'write',
'years',
'you',
'your'}
'''
features = {} #victories单词不在words1,输出false
('victories' in words1)
'''
Out[73]: False
''' features['victories'] = ('victories' in words1)
'''
features
Out[75]: {'victories': False}
''' print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
'''
'schwarz': False,
'supervisors': False,
'geyser': False,
'site': False,
'fevered': False,
'acknowledged': False,
'ronald': False,
'wroth': False,
'degredation': False,
...}
''' featuresets = [(find_features(rev), category) for (rev, category) in documents]
featuresets 特征集合一共有2000个文件,每个文件是一个元组,元组包含字典(“glory”:False)和neg/pos分类


python风控评分卡建模和风控常识(博客主亲自录制视频教程)
自然语言27_Converting words to Features with NLTK的更多相关文章
- 自然语言18.1_Named Entity Recognition with NLTK
QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...
- 自然语言15_Part of Speech Tagging with NLTK
https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tut ...
- 自然语言12_Tokenizing Words and Sentences with NLTK
https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ # -*- coding: utf-8 -*- ...
- 自然语言处理NLP程序包(NLTK/spaCy)使用总结
NLTK和SpaCy是NLP的Python应用,提供了一些现成的处理工具和数据接口.下面介绍它们的一些常用功能和特性,便于对NLP研究的组成形式有一个基本的了解. NLTK Natural Langu ...
- Python 自然语言处理(1) 计数词汇
Python有一个自然语言处理的工具包,叫做NLTK(Natural Language ToolKit),可以帮助你实现自然语言挖掘,语言建模等等工作.但是没有NLTK,也一样可以实现简单的词类统计. ...
- 【Python自然语言处理】第一章学习笔记——搜索文本、计数统计和字符串链表
这本书主要是基于Python和一个自然语言工具包(Natural Language Toolkit, NLTK)的开源库进行讲解 NLTK 介绍:NLTK是一个构建Python程序以处理人类语言数据的 ...
- python笔记10-----便捷网络数据NLTK语料库
1.NLTK的概念 NLTK:Natural language toolkit,是一套基于python的自然语言处理工具. 2.NLTK中集成了语料与模型等的包管理器,通过在python编辑器中执行. ...
- 【JulyEdu-Python基础】第 1 课:入门基础
一些学习资源的收集: 可汗学院 视频 公开课 Grossin 编程教室: 一个非常简单,对初学者非常友好的教程和在线联系 廖雪峰教程 书籍: Python核心编程: 这本书应该是最清楚.最深入全面的书 ...
- python文件打开模式&time&python第三方库
r:以只读方式打开文件.文件的指针将会放在文件的开头.这是默认模式. w:打开一个文件只用于写入.如果该文件已存在则将其覆盖.如果该文件不存在,创建新文件. a:打开一个文件用于追加.如果该文件已存在 ...
随机推荐
- mysql启动失败:不能创建pid文件
2016-03-09T07:51:38.905444Z 0 [ERROR] /usr/sbin/mysqld: Can't create/write to file '/var/run/mysqld/ ...
- PHP函数整理(二)
以下均参考自 php.net 1.curl_setopt_array() 此函数为CURL传输会话批量设置选项.这个函数对于需要设置大量的curl选项是非常有用的,不需要重复的调用curl_setop ...
- GIT/node使用
一. 为不同域名的库自动保存不同的用户名和密码 比如 公司的库是 http://source.sohu.com,另一个是 http://www.github.com,命令行中分别两个命令就搞定了 gi ...
- jquery2源码分析系列
学习jquery的源码对于提高前端的能力很有帮助,下面的系列是我在网上看到的对jquery2的源码的分析.等有时间了好好研究下.我们知道jquery2开始就不支持IE6-8了,从jquery2的源码中 ...
- C# XML技术总结之XDocument 和XmlDocument
引言 虽然现在Json在我们的数据交换中越来越成熟,但XML格式的数据还有很重要的地位. C#中对XML的处理也不断优化,那么我们如何选择XML的这几款处理类 XmlReader,XDocument ...
- 编码中的setCharacterEncoding 理解
1.pageEncoding="UTF-8"的作用是设置JSP编译成Servlet时使用的编码. 2.contentType="text/html;charset=UT ...
- http协议进阶(一)http概述
参考书籍——<HTTP权威指南> 1.web客户端和服务器 http客户端发出请求,其中包含请求内容,发给服务器,服务器再返回内容中回送请求的数据,http客户端和服务器构成了万维网的基本 ...
- java多线程系类:JUC原子类:05之AtomicIntegerFieldUpdater原子类
概要 AtomicIntegerFieldUpdater, AtomicLongFieldUpdater和AtomicReferenceFieldUpdater这3个修改类的成员的原子类型的原理和用法 ...
- Swift中的willSet与didSet
Swift中的willSet与didSet 周银辉 在Swift语言中用了willSet和didSet这两个特性来监视属性的除初始化之外的属性值变化 无需说太多,看看下面的代码你就能很快明白的 imp ...
- ToolProvider.getSystemJavaCompiler() Return NULL!
http://www.cnblogs.com/fangwenyu/archive/2011/10/12/2209051.html
