Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。

 import nltk
from nltk.corpus import movie_reviews as mr def document_features(document,words_features):
document_words=set(document)
features={}
for word in words_features:
features['has(%s)' %word] = (word in document_words)
return features def test_doc_classification():
documents=[(list(mr.words(fileid)),category)
for category in mr.categories()
for fileid in mr.fileids(categories=category)]
all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
words_features=[word[0] for word in words_freq] featuresets=[(document_features(doc,words_features),c) for (doc,c) in
documents] train_set, test_set= featuresets[100:],featuresets[:100]
classifier=nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier,test_set) classifier.show_most_informative_features(5)

结果如下,accuracy为0.86:

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

Document Classification的更多相关文章

  1. Support Vector Machines for classification

    Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...

  2. Classification of text documents: using a MLComp dataset

    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...

  3. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. Link-based Classification相关数据集

    Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...

  6. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  7. Text Classification

    Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...

  8. Machine Learning Algorithms Study Notes(2)--Supervised Learning

    Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...

  9. Similarity-based Learning

    Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...

随机推荐

  1. Android OpenGL ES(六)创建实例应用OpenGLDemos程序框架 .

    有了前面关于Android OpenGL ES的介绍,可以开始创建示例程序OpenGLDemos. 使用Eclipse 创建一个Android项目 Project Name: OpenGLDemos ...

  2. Java Concurrent Topics

    To prevent Memory Consistency Errors(MCEs), it is good practice to specify synchronized class specif ...

  3. Swift 响应式编程 浅析

    这里我讲一下响应式编程(Reactive Programming)是如何将异步编程推到一个全新高度的. 异步编程真的很难 大多数有关响应式编程的演讲和文章都是在展示Reactive框架如何好如何惊人, ...

  4. IE6下绝对定位元素和浮动元素并列绝对定位元素消失

    <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...

  5. java基本输入型数据Scanner

    import java.util.Scanner; public class Example2_3 { public static void main (String args[ ]){ System ...

  6. zencart 新页面调用好功能代码集:

    其实很多就是看变量,就可以直接调用,而变量的定义地方很多,比如language 1.  includes\languages\语言.php 2. 写个文件,放进includes\extra_confi ...

  7. L6,Percy Buttons

    expressions: knock at敲打 knock off 碰掉,I knock the vase off the table 下班,He always knocks off six o'cl ...

  8. 把APP做成libary的注意事项

    首先把build.gradle(app里的),里面改成这样 apply plugin: 'com.android.library'然后删掉applicationId这一行 注意,千万不能用注解,要把所 ...

  9. JSON对象长度和遍历方法(转)

    最 近在修改一个HTML页面的JS的时候遍历JSON对象,却怎么也调试不通过.怪这个HTML网页不知道用了什么方法禁止了js错误提示,刚开始的时候不 知道有这个问题,用chrome的开发人员工具都没发 ...

  10. 一道js题

    <script> var a = 5; function test(){ this.a = 10; a = 15 this.func = function(){ var a = 20 ; ...