sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/

Converting words to Features with NLTK

In this tutorial, we're going to be building off the previous
video and compiling feature lists of words from positive reviews and
words from the negative reviews to hopefully see trends in specific
types of words in positive or negative reviews.

To start, our code:

import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:3000]

Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:

def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features

Next, we can print one feature set like:

print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let's go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!

# -*- coding: utf-8 -*-
"""
Created on Sun Dec 4 09:27:48 2016 @author: daxiong
"""
import nltk
import random
from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words():
all_words.append(w.lower()) #dict_allWords是一个字典,存储所有文字的频率分布
dict_allWords = nltk.FreqDist(all_words)
#字典keys()列出所有单词,[:3000]表示列出前三千文字
word_features = list(dict_allWords.keys())[:3000]
'''
'combating',
'mouthing',
'markings',
'directon',
'ppk',
'vanishing',
'victories',
'huddleston',
...]
''' def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words) return features words=movie_reviews.words('neg/cv000_29416.txt')
'''
Out[78]: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
type(words)
Out[65]: nltk.corpus.reader.util.StreamBackedCorpusView ''' #去重,words1为集合形式
words1 = set(words)
'''
words1 {'!',
'"',
'&',
"'",
'(',
')',.......
'witch',
'with',
'world',
'would',
'wrapped',
'write',
'world',
'would',
'wrapped',
'write',
'years',
'you',
'your'}
'''
features = {} #victories单词不在words1,输出false
('victories' in words1)
'''
Out[73]: False
''' features['victories'] = ('victories' in words1)
'''
features
Out[75]: {'victories': False}
''' print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
'''
'schwarz': False,
'supervisors': False,
'geyser': False,
'site': False,
'fevered': False,
'acknowledged': False,
'ronald': False,
'wroth': False,
'degredation': False,
...}
''' featuresets = [(find_features(rev), category) for (rev, category) in documents]

featuresets 特征集合一共有2000个文件,每个文件是一个元组,元组包含字典(“glory”:False)和neg/pos分类

python风控评分卡建模和风控常识(博客主亲自录制视频教程)

自然语言27_Converting words to Features with NLTK的更多相关文章

  1. 自然语言18.1_Named Entity Recognition with NLTK

    QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...

  2. 自然语言15_Part of Speech Tagging with NLTK

    https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tut ...

  3. 自然语言12_Tokenizing Words and Sentences with NLTK

    https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ # -*- coding: utf-8 -*- ...

  4. 自然语言处理NLP程序包(NLTK/spaCy)使用总结

    NLTK和SpaCy是NLP的Python应用,提供了一些现成的处理工具和数据接口.下面介绍它们的一些常用功能和特性,便于对NLP研究的组成形式有一个基本的了解. NLTK Natural Langu ...

  5. Python 自然语言处理(1) 计数词汇

    Python有一个自然语言处理的工具包,叫做NLTK(Natural Language ToolKit),可以帮助你实现自然语言挖掘,语言建模等等工作.但是没有NLTK,也一样可以实现简单的词类统计. ...

  6. 【Python自然语言处理】第一章学习笔记——搜索文本、计数统计和字符串链表

    这本书主要是基于Python和一个自然语言工具包(Natural Language Toolkit, NLTK)的开源库进行讲解 NLTK 介绍:NLTK是一个构建Python程序以处理人类语言数据的 ...

  7. python笔记10-----便捷网络数据NLTK语料库

    1.NLTK的概念 NLTK:Natural language toolkit,是一套基于python的自然语言处理工具. 2.NLTK中集成了语料与模型等的包管理器,通过在python编辑器中执行. ...

  8. 【JulyEdu-Python基础】第 1 课:入门基础

    一些学习资源的收集: 可汗学院 视频 公开课 Grossin 编程教室: 一个非常简单,对初学者非常友好的教程和在线联系 廖雪峰教程 书籍: Python核心编程: 这本书应该是最清楚.最深入全面的书 ...

  9. python文件打开模式&time&python第三方库

    r:以只读方式打开文件.文件的指针将会放在文件的开头.这是默认模式. w:打开一个文件只用于写入.如果该文件已存在则将其覆盖.如果该文件不存在,创建新文件. a:打开一个文件用于追加.如果该文件已存在 ...

随机推荐

  1. django ORM

    http://www.cnblogs.com/alex3714/articles/5512568.html 常用ORM操作 一.示例Models from django.db import model ...

  2. JVM之方法区

     基本特性: 线程共享区域,存储被JVM加载的类信息.常量.静态变量.即时编译器编译的代码等 堆的逻辑部分,不限定方法去内的内存位置和编译代码的管理策略,不限定实现垃圾回收 容量可不定也可动态扩展,不 ...

  3. sql server2014不允许保存更改。阻止保存要求重新创建表的更改

    错误描述: SQL Server2014在原有的数据表中修改表结构后,保存数据表,提示错误如下: 不允许保存更改.您所做的更改要求删除并重新创建以下您对无法重新创建的表进行了更改或启用了"阻 ...

  4. C++STL - 类模板

    类的成员变量,成员函数,成员类型,以及基类中如果包含参数化的类型,那么该类就是一个类模板   1.定义 template<typename 类型形参1, typename 类型形参2,...&g ...

  5. 安装Ubuntu的那些事儿(续)

    由于我的第一篇Blog并没有给出完全解决进Ubuntu系统时显卡所造成的问题,至于那个装显卡驱动的方法本人也没有去做,感兴趣的朋友可以在网上教程试一试. 至于我的那个在高系选项中进行配置也不是好的方法 ...

  6. android xml 布局错误

    最近重新安装了下android开发环境,发现在调整页面的时候 ,老是报以下错误,导致无法静态显示ui效果. Missing styles. Is the correct theme chosen fo ...

  7. python数据库操作对主机批量管理

    import paramiko import MySQLdb conn = MySQLdb.connect(host=',db='host') cur = conn.cursor(cursorclas ...

  8. 第12章 Java字符串

    1.什么是Java中的字符串 字符串String并不是一种数据类型,而是一个类对象,在java.lang包中,只不过在默认情况下java都是自动导入的,所以可以直接使用创建一个String对象的方法有 ...

  9. SortedMap接口:进行排序操作。

    回顾:SortedSet是TreeSet的实现接口,此接口可以排序. SortedMap接口同样可以排序,是TreeMap的实现接口,父类. 定义如下: public class TreeMap< ...

  10. 15-前端开发之JavaScript

    什么是 JavaScript ? JavaScript是一门编程语言,浏览器内置了JavaScript语言的解释器,所以在浏览器上按照JavaScript语言的规则编写相应代码之,浏览器可以解释并做出 ...