Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。

 import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)

结果如下,accuracy为0.86:

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

Document Classification的更多相关文章

  1. Support Vector Machines for classification

    Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...

  2. Classification of text documents: using a MLComp dataset

    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...

  3. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. Link-based Classification相关数据集

    Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...

  6. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  7. Text Classification

    Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...

  8. Machine Learning Algorithms Study Notes(2)--Supervised Learning

    Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...

  9. Similarity-based Learning

    Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...

随机推荐

  1. java 内存分配全面解析

    JVM是什么? 首先要知道的是Java程序运行在JVM(Java Virtual Machine,Java虚拟机)上;可以把JVM理解成Java程序和操作系统之间的桥梁,JVM实现了 Java的平台无 ...

  2. redis 工具类 单个redis、JedisPool 及多个redis、shardedJedisPool与spring的集成配置

    http://www.cnblogs.com/edisonfeng/p/3571870.html http://javacrazyer.iteye.com/blog/1840161 http://ww ...

  3. Review Board的使用

    代码审核工具.先在命令行界面,进入到工程的Main目录下,然后使用命令 svn diff>yus.diff  这样就将Main里面的所有内容生成了,然后在浏览器里进入到自己的Review Boa ...

  4. java设计原则:16种原则

    一   类的设计原则   1 依赖倒置原则-Dependency Inversion Principle (DIP) 2 里氏替换原则-Liskov Substitution Principle (L ...

  5. itext操作PDF文件添加水印

    功能描述:添加图片和文字水印 /** * * [功能描述:添加图片和文字水印] [功能详细描述:功能详细描述] * @param srcFile 待加水印文件 * @param destFile 加水 ...

  6. HDFS在Linux下的命令

    1.对hdfs操作的命令格式是 1.1hadoop fs  -ls <path> 表示对hdfs下一级目录的查看 1.2 hadoop fs -lsr <path> 表示对hd ...

  7. 解决phpmyadmin上传文件大小限制的配置方法(转)

    phpmyadmin导入SQL文件时涉及到phpmyadmin上传文件大小限制问题,默认phpmyadmin上传文件大小为2M,如果想要 phpmyadmin上传超过2M大文件,就需要修改phpmya ...

  8. 1084:XX开公司<回溯>

    Description 2020年,xx开了一家拥有N个员工的大公司.每天,xx都要分配N项工作给他的员工,但是,由于能力的不同,每个人对处理相同工作所需要的时间有快有慢.众所周知,xx是一个非常重视 ...

  9. jquery奇怪的问题

    Jquery中 $("#data_table4 tr:eq(0)").after("<tr><td>" +result+data.row ...

  10. Split()特殊字符

    关于点的问题是用string.split("[.]") 解决. 关于竖线的问题用 string.split("\\|")解决. 关于星号的问题用 string. ...