Document Classification

Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题，获取电影文本特征词的代码有些微改动。

 import nltk

 from nltk.corpus import movie_reviews as mr   

 def document_features(document,words_features):

     document_words=set(document)

     features={}

     for word in words_features:

         features['has(%s)' %word] = (word in document_words)

     return features   

 def test_doc_classification():

     documents=[(list(mr.words(fileid)),category)

                 for category in mr.categories()

                 for fileid in mr.fileids(categories=category)]

     all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())

     words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]

     words_features=[word[0] for word in words_freq]

     featuresets=[(document_features(doc,words_features),c) for (doc,c) in

                     documents]

     train_set, test_set= featuresets[100:],featuresets[:100]

     classifier=nltk.NaiveBayesClassifier.train(train_set)

     print nltk.classify.accuracy(classifier,test_set)

     classifier.show_most_informative_features(5)

结果如下，accuracy为0.86：

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

Document Classification的更多相关文章

Support Vector Machines for classification
Support Vector Machines for classification To whet your appetite for support vector machines, here’s ...
Classification of text documents: using a MLComp dataset
注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html ...
[Tensorflow] RNN - 04. Work with CNN for Text Classification
Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...
论文列表——text classification
https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...
Link-based Classification相关数据集
Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dat ...
#论文阅读# Universial language model fine-tuing for text classification
论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结文章研究了一些在general corous上pretrain LM,然后把得到的model t ...
Text Classification
Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. ...
Machine Learning Algorithms Study Notes(2)--Supervised Learning
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
Similarity-based Learning
Similarity-based approaches to machine learning come from the idea that the best way to make a predi ...

随机推荐

RING0,RING1,RING2,RING3
Intel的CPU将特权级别分为4个级别:RING0,RING1,RING2,RING3.Windows只使用其中的两个级别RING0和RING3,RING0只给操作系统用,RING3谁都能用.如果普 ...
linux视频学习6（mysql的安装/）
1.mysql的优点: 免费,跨平台,轻,支持多并发. 2.mysql的安装步骤: 把安装文件准备好,拷贝到home目录下.mount /mnt/cdrom cp mysql* /home 把安装文件 ...
ALAssetsLibrary 照片相关浅析
ALAssetsLibrary 提供了访问iOS设备下”照片”应用下所有照片和视频的接口: 从 ALAssetsLibrary 中可读取所有的相册数据,即 ALAssetsGroup 对象列表: 从每 ...
Loadrunner之脚本的调试和保存（六）
一.调试脚本脚本录制完毕后,按F5键或单击菜单上的RUN按钮,可以运行脚本. 在VIRTUAL USER GENERATOR中运行脚本的作用,主要是查看录制的脚本能否正常通过,如果有问题 ...
[转]View属性之 paddingStart & paddingEnd
[CAUSE] 在写一个自定义View时, 直接复制了Android-Source的XML布局文件, 默认开发SDK版本是4.2.2(Level-API-17), 后因其他原因将SDK版本改为4.1. ...
HDU1162-Eddy's picture（最小生成树）
Problem Description Eddy begins to like painting pictures recently ,he is sure of himself to become ...
ExtJS4 的dom
Ext使用了三个核心的工具类对我们掌握的DOM进行了完美的封装. ┣ Ext.Element(几乎对DOM的一切进行了封彻底装) ┣ Ext.DomHelper(一个强大的操控UI界面的工具类) ┣ ...
CodeForces 429 B B. Working out
Description Summer is coming! It's time for Iahub and Iahubina to work out, as they both want to loo ...
Html基础详解之（jquery）
jquery选择器: #id 根据给定的ID匹配一个元素,如果选择器中包含特殊字符,可以用两个斜杠转义.(注:查找 ID 为"myDiv"的元素.) <!DOCTYPE ht ...
第19章网络通信----TCP程序设计基础
TCP网络程序设计是指利用Socket类编写通信程序.利用TCP协议进行通信的两个应用程序是有主次之分的,一个称为服务器程序,另一个称为客户机程序,两者的功能和编写方法大不一样. 1.InetAddr ...

Document Classification

Document Classification的更多相关文章

随机推荐

热门专题