自然语言27_Converting words to Features with NLTK
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/
Converting words to Features with NLTK
In this tutorial, we're going to be building off the previous
video and compiling feature lists of words from positive reviews and
words from the negative reviews to hopefully see trends in specific
types of words in positive or negative reviews.
To start, our code:
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:3000]
Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features
Next, we can print one feature set like:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:
featuresets = [(find_features(rev), category) for (rev, category) in documents]
Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let's go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 4 09:27:48 2016 @author: daxiong
"""
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) #dict_allWords是一个字典,存储所有文字的频率分布
dict_allWords = nltk.FreqDist(all_words)
#字典keys()列出所有单词,[:3000]表示列出前三千文字
word_features = list(dict_allWords.keys())[:3000]
'''
'combating',
'mouthing',
'markings',
'directon',
'ppk',
'vanishing',
'victories',
'huddleston',
...]
''' def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features words=movie_reviews.words('neg/cv000_29416.txt')
'''
Out[78]: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
type(words)
Out[65]: nltk.corpus.reader.util.StreamBackedCorpusView ''' #去重,words1为集合形式
words1 = set(words)
'''
words1 {'!',
'"',
'&',
"'",
'(',
')',.......
'witch',
'with',
'world',
'would',
'wrapped',
'write',
'world',
'would',
'wrapped',
'write',
'years',
'you',
'your'}
'''
features = {} #victories单词不在words1,输出false
('victories' in words1)
'''
Out[73]: False
''' features['victories'] = ('victories' in words1)
'''
features
Out[75]: {'victories': False}
''' print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
'''
'schwarz': False,
'supervisors': False,
'geyser': False,
'site': False,
'fevered': False,
'acknowledged': False,
'ronald': False,
'wroth': False,
'degredation': False,
...}
''' featuresets = [(find_features(rev), category) for (rev, category) in documents]
featuresets 特征集合一共有2000个文件,每个文件是一个元组,元组包含字典(“glory”:False)和neg/pos分类


python风控评分卡建模和风控常识(博客主亲自录制视频教程)
自然语言27_Converting words to Features with NLTK的更多相关文章
- 自然语言18.1_Named Entity Recognition with NLTK
QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...
- 自然语言15_Part of Speech Tagging with NLTK
https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tut ...
- 自然语言12_Tokenizing Words and Sentences with NLTK
https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ # -*- coding: utf-8 -*- ...
- 自然语言处理NLP程序包(NLTK/spaCy)使用总结
NLTK和SpaCy是NLP的Python应用,提供了一些现成的处理工具和数据接口.下面介绍它们的一些常用功能和特性,便于对NLP研究的组成形式有一个基本的了解. NLTK Natural Langu ...
- Python 自然语言处理(1) 计数词汇
Python有一个自然语言处理的工具包,叫做NLTK(Natural Language ToolKit),可以帮助你实现自然语言挖掘,语言建模等等工作.但是没有NLTK,也一样可以实现简单的词类统计. ...
- 【Python自然语言处理】第一章学习笔记——搜索文本、计数统计和字符串链表
这本书主要是基于Python和一个自然语言工具包(Natural Language Toolkit, NLTK)的开源库进行讲解 NLTK 介绍:NLTK是一个构建Python程序以处理人类语言数据的 ...
- python笔记10-----便捷网络数据NLTK语料库
1.NLTK的概念 NLTK:Natural language toolkit,是一套基于python的自然语言处理工具. 2.NLTK中集成了语料与模型等的包管理器,通过在python编辑器中执行. ...
- 【JulyEdu-Python基础】第 1 课:入门基础
一些学习资源的收集: 可汗学院 视频 公开课 Grossin 编程教室: 一个非常简单,对初学者非常友好的教程和在线联系 廖雪峰教程 书籍: Python核心编程: 这本书应该是最清楚.最深入全面的书 ...
- python文件打开模式&time&python第三方库
r:以只读方式打开文件.文件的指针将会放在文件的开头.这是默认模式. w:打开一个文件只用于写入.如果该文件已存在则将其覆盖.如果该文件不存在,创建新文件. a:打开一个文件用于追加.如果该文件已存在 ...
随机推荐
- linux
托瓦兹跟BBS上面一堆工秳师一样, 他发现Minix虽然真癿很棒,但是谭宁邦教授就是丌愿意迚行功能癿加强,导致一堆工秳师在操作系统功能上面癿欲求丌满! 这个时候年轻癿托瓦兹就想:『既然如此,那我何丌自 ...
- Chrome/Firefox 中头toFixed方法四舍五入兼容性问题
每个Number的toFixed()方法可把 Number 四舍五入为指定小数位数的数字.四舍五入顾名思义,4及以下舍去,5及以上加1. 四舍 1.31.toFixed(1) // 1.3 1.32. ...
- ELF Format 笔记(十一)—— 程序头结构
ilocker:关注 Android 安全(新手) QQ: 2597294287 程序头表 (program header table) 是一个结构体数组,数组中的每个结构体元素是一个程序头 (pro ...
- Linux下定时执行脚本(转自Decode360)
文章来自:http://www.blogjava.net/decode360/archive/2009/09/18/287743.html Decode360's Blog 老师(业精于勤而荒于嬉 ...
- 【WPF系列】基础学习-XAML
引言 WPF框架中已经提到,WPF框架提供XAML基本服务.WPF中XAML的引入向开发者提供UI设计和代码分离的编程型.XAML是WPF中提出的一个具有重要意义的新技术,基本涉及WPF中所有UI开发 ...
- HTML总结
几个知识点: HTML 指的是超文本标记语言 (Hyper Text Markup Language) HTML框架结构: <!DOCTYPE html> <html> < ...
- Apache、nginx配置的网站127.0.0.1可以正常访问,内外网的ip地址无法访问,谁的锅?
最近做开发,发现一个比较尴尬的问题.因为我是一个web开发者,经常要用到Apache或者nginx等服务器软件,经过我测试发现,只要我打开了adsafe,我便不能通过ip地址访问我本地的网站了,比如我 ...
- CODEVS1643 线段覆盖3[贪心]
1643 线段覆盖 3 时间限制: 2 s 空间限制: 256000 KB 题目等级 : 黄金 Gold 题解 题目描述 Description 在一个数轴上有n条线段,现要选 ...
- 数据库_MYSQL
数据库简介 数据库分类:关系型数据库.非关系型数据库 常用的关系型数据库有:orcale .mysql .sql server等等 常用的非关系型数据库有: Memcached.redis.mong ...
- JS实现Observable观察者模式
欢迎讨论与交流 : ) 注 代码参考自——汇智网 RxJS教程 前言 Observable观察者模式令小白笔者眼前一亮.数据生产者(observable)负责生产新鲜的数据,同时在生产完毕后'通知“消 ...
