Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。

 import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)

结果如下,accuracy为0.86:

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

Document Classification的更多相关文章

  1. Support Vector Machines for classification

    Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...

  2. Classification of text documents: using a MLComp dataset

    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...

  3. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. Link-based Classification相关数据集

    Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...

  6. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  7. Text Classification

    Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...

  8. Machine Learning Algorithms Study Notes(2)--Supervised Learning

    Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...

  9. Similarity-based Learning

    Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...

随机推荐

  1. Checking the Calendar

    Checking the Calendar time limit per test 1 second memory limit per test 256 megabytes input standar ...

  2. POJ 1182 食物链 经典并查集+关系向量简单介绍

    题目: 动物王国中有三类动物A,B,C,这三类动物的食物链构成了有趣的环形.A吃B, B吃C,C吃A. 现有N个动物,以1-N编号.每个动物都是A,B,C中的一种,但是我们并不知道它到底是哪一种. 有 ...

  3. HDU 5768 Lucky7 (容斥原理 + 中国剩余定理 + 状态压缩 + 带膜乘法)

    题意:……应该不用我说了,看起来就很容斥原理,很中国剩余定理…… 方法:因为题目中的n最大是15,使用状态压缩可以将所有的组合都举出来,然后再拆开成数组,进行中国剩余定理的运算,中国剩余定理能够求出同 ...

  4. 线关节(Line Joint)

    package{ import Box2D.Common.Math.b2Vec2; import Box2D.Dynamics.b2Body; import Box2D.Dynamics.Joints ...

  5. java synchronized内置锁的可重入性和分析总结

    最近在读<<Java并发编程实践>>,在第二章中线程安全中降到线程锁的重进入(Reentrancy) 当一个线程请求其它的线程已经占有的锁时,请求线程将被阻塞.然而内部锁是可重 ...

  6. POJ 3307 Smart Sister

    先找出所有的数,排序,然后o(1)效率询问 #include<cstdio> #include<cstring> #include<cmath> #include& ...

  7. 帝国CMS系统结合项图文教程

    为了使信息列表可实现按多种条件输出数据,帝国CMS独创可设置无限条件的模型结合项功能.帝国CMS的结合项功能是指按模型多个字段内容来结合显示对应的信息. 二.结合项的语法说明 结合项访问地址: /e/ ...

  8. 实现RGB,CMY(K),YUV,YIQ,YCbCr颜色的转换算法

    源:http://blog.sina.com.cn/s/blog_4d80055a01000atu.html import java.lang.Math; import java.awt.*; pub ...

  9. tabBarItem动画

    1.有时,我们需要为tabBarItem设置一些动画.在网上查了半天,没有结果.自己写了一个简单的动画 代码如下: - (void)tabBarController:(UITabBarControll ...

  10. WeakSelf和StrongSelf

    转载自:http://sherlockyao.com/blog/2015/08/08/weakself-and-strongself-in-blocks/ 现在我们用 Objective-C 写代码时 ...