自然语言27_Converting words to Features with NLTK
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/
Converting words to Features with NLTK
In this tutorial, we're going to be building off the previous
video and compiling feature lists of words from positive reviews and
words from the negative reviews to hopefully see trends in specific
types of words in positive or negative reviews.
To start, our code:
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:3000]
Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features
Next, we can print one feature set like:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:
featuresets = [(find_features(rev), category) for (rev, category) in documents]
Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let's go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 4 09:27:48 2016 @author: daxiong
"""
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) #dict_allWords是一个字典,存储所有文字的频率分布
dict_allWords = nltk.FreqDist(all_words)
#字典keys()列出所有单词,[:3000]表示列出前三千文字
word_features = list(dict_allWords.keys())[:3000]
'''
'combating',
'mouthing',
'markings',
'directon',
'ppk',
'vanishing',
'victories',
'huddleston',
...]
''' def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features words=movie_reviews.words('neg/cv000_29416.txt')
'''
Out[78]: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
type(words)
Out[65]: nltk.corpus.reader.util.StreamBackedCorpusView ''' #去重,words1为集合形式
words1 = set(words)
'''
words1 {'!',
'"',
'&',
"'",
'(',
')',.......
'witch',
'with',
'world',
'would',
'wrapped',
'write',
'world',
'would',
'wrapped',
'write',
'years',
'you',
'your'}
'''
features = {} #victories单词不在words1,输出false
('victories' in words1)
'''
Out[73]: False
''' features['victories'] = ('victories' in words1)
'''
features
Out[75]: {'victories': False}
''' print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
'''
'schwarz': False,
'supervisors': False,
'geyser': False,
'site': False,
'fevered': False,
'acknowledged': False,
'ronald': False,
'wroth': False,
'degredation': False,
...}
''' featuresets = [(find_features(rev), category) for (rev, category) in documents]
featuresets 特征集合一共有2000个文件,每个文件是一个元组,元组包含字典(“glory”:False)和neg/pos分类


python风控评分卡建模和风控常识(博客主亲自录制视频教程)
自然语言27_Converting words to Features with NLTK的更多相关文章
- 自然语言18.1_Named Entity Recognition with NLTK
QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...
- 自然语言15_Part of Speech Tagging with NLTK
https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tut ...
- 自然语言12_Tokenizing Words and Sentences with NLTK
https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ # -*- coding: utf-8 -*- ...
- 自然语言处理NLP程序包(NLTK/spaCy)使用总结
NLTK和SpaCy是NLP的Python应用,提供了一些现成的处理工具和数据接口.下面介绍它们的一些常用功能和特性,便于对NLP研究的组成形式有一个基本的了解. NLTK Natural Langu ...
- Python 自然语言处理(1) 计数词汇
Python有一个自然语言处理的工具包,叫做NLTK(Natural Language ToolKit),可以帮助你实现自然语言挖掘,语言建模等等工作.但是没有NLTK,也一样可以实现简单的词类统计. ...
- 【Python自然语言处理】第一章学习笔记——搜索文本、计数统计和字符串链表
这本书主要是基于Python和一个自然语言工具包(Natural Language Toolkit, NLTK)的开源库进行讲解 NLTK 介绍:NLTK是一个构建Python程序以处理人类语言数据的 ...
- python笔记10-----便捷网络数据NLTK语料库
1.NLTK的概念 NLTK:Natural language toolkit,是一套基于python的自然语言处理工具. 2.NLTK中集成了语料与模型等的包管理器,通过在python编辑器中执行. ...
- 【JulyEdu-Python基础】第 1 课:入门基础
一些学习资源的收集: 可汗学院 视频 公开课 Grossin 编程教室: 一个非常简单,对初学者非常友好的教程和在线联系 廖雪峰教程 书籍: Python核心编程: 这本书应该是最清楚.最深入全面的书 ...
- python文件打开模式&time&python第三方库
r:以只读方式打开文件.文件的指针将会放在文件的开头.这是默认模式. w:打开一个文件只用于写入.如果该文件已存在则将其覆盖.如果该文件不存在,创建新文件. a:打开一个文件用于追加.如果该文件已存在 ...
随机推荐
- Python基础1
本节内容2016-05-30 Python介绍 发展史 Python 2 0r 3? 安装 Hello word程序 变量 用户输入 模块初识 .pyc? 数据类型初识 数据运算 if...else语 ...
- 烂泥:学习tomcat之通过shell批量管理多个tomcat
本文由ilanniweb提供友情赞助,首发于烂泥行天下 想要获得更多的文章,可以关注我的微信ilanniweb 公司的业务是使用tomcat做web容器,为了更有效的利用服务器的性能,我们一般部署多个 ...
- WPF 自定义绕圈进度条
在设计界面时,有时会遇到进度条,本次讲解如何设计自定义的绕圈进度条,直接上代码: 1.控件界面 <UserControl x:Class="ProgressBarControl&quo ...
- Linux第一天 ssh登录和软件安装详解
Linux学习第一天 操作环境: Ubuntu 16.04 Win10系统,使用putty_V0.63 本身学习Linux就是想在服务器上使用的.实际情况,可能我很难直接到坐在服务器前,使用界面操作系 ...
- 关于stm32定时器的理解
TIM_OCInitStructure.TIM_OCPolarity = TIM_OCPolarity_High; 表面意思是输出控制极性为高,但是意思是定时器输入0,不反相,输出0: 输出控制极性为 ...
- 2015.1.25 Delphi打开网址链接的几种方法
Delphi打开网址链接的几种方法1.使用shellapi打开系统中默认的浏览器 首先需在头部引用 shellapi单元即在uses中添加shellapi,这里我们需要知道有 ...
- windows下安装并配置mysql
前言:前面三篇文章将django的环境搭建完后,还只能编写静态网页,如果要用到数据库编写动态网页,那么还需要数据库 本章讲解mysql5.6数据库的安装和配置,对于其他版本仅供参考,不一定试用!推荐使 ...
- 基于.net开发chrome核心浏览器【七】
这是一个系列的文章,前面六篇文章的地址如下: 基于.net开发chrome核心浏览器[六] 基于.net开发chrome核心浏览器[五] 基于.net开发chrome核心浏览器[四] 基于.net开发 ...
- NOIP模拟赛20161007
%hzwer http://hzwer.com/7602.html 题目名称 “与” 小象涂色 行动!行动! 输入文件 and.in elephant.in move.in 输出文件 and.out ...
- 第37课 深度解析QMap与QHash
1. QMap深度解析 (1)QMap是一个以升序键顺序存储键值对的数据结构 ①QMap原型为 class QMap<K, T>模板 ②QMap中的键值对根据Key进行了排序 ③QMap中 ...
