注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html

运行结果为:

Loading 20 newsgroups training set...
20 newsgroups dataset for document classification (http://people.csail.mit.edu/jrennie/20Newsgroups)
13180 documents
20 categories
Extracting features from the dataset using a sparse vectorizer
done in 139.231000s
n_samples: 13180, n_features: 130274
Loading 20 newsgroups test set...
done in 0.000000s
Predicting the labels of the test set...
5648 documents
20 categories
Extracting features from the dataset using the same vectorizer
done in 7.082000s
n_samples: 5648, n_features: 130274
Testbenching a linear classifier...
parameters: {'penalty': 'l2', 'loss': 'hinge', 'alpha': 1e-05, 'fit_intercept': True, 'n_iter': 50}
done in 22.012000s
Percentage of non zeros coef: 30.074190
Predicting the outcomes of the testing set
done in 0.172000s
Classification report on test set for classifier:
SGDClassifier(alpha=1e-05, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
penalty='l2', power_t=0.5, random_state=None, shuffle=True,
verbose=0, warm_start=False) precision recall f1-score support alt.atheism 0.95 0.93 0.94 245
comp.graphics 0.85 0.91 0.88 298
comp.os.ms-windows.misc 0.88 0.88 0.88 292
comp.sys.ibm.pc.hardware 0.82 0.80 0.81 301
comp.sys.mac.hardware 0.90 0.92 0.91 256
comp.windows.x 0.92 0.88 0.90 297
misc.forsale 0.87 0.89 0.88 290
rec.autos 0.93 0.94 0.94 324
rec.motorcycles 0.97 0.97 0.97 294
rec.sport.baseball 0.97 0.97 0.97 315
rec.sport.hockey 0.98 0.99 0.99 302
sci.crypt 0.97 0.96 0.96 297
sci.electronics 0.87 0.89 0.88 313
sci.med 0.97 0.97 0.97 277
sci.space 0.97 0.97 0.97 305
soc.religion.christian 0.95 0.96 0.95 293
talk.politics.guns 0.94 0.94 0.94 246
talk.politics.mideast 0.97 0.99 0.98 296
talk.politics.misc 0.96 0.92 0.94 236
talk.religion.misc 0.89 0.84 0.86 171 avg / total 0.93 0.93 0.93 5648 Confusion matrix:
[[227 0 0 0 0 0 0 0 0 0 0 1 2 1 1 1 0 1
0 11]
[ 0 271 3 8 2 5 2 0 0 1 0 0 3 1 1 0 0 1
0 0]
[ 0 7 256 14 5 6 1 0 0 0 0 0 2 0 1 0 0 0
0 0]
[ 1 8 12 240 9 3 12 2 0 0 0 1 12 0 0 1 0 0
0 0]
[ 0 1 3 6 235 2 4 0 0 0 0 1 3 0 1 0 0 0
0 0]
[ 0 17 9 4 0 260 0 0 1 1 0 0 2 0 2 0 1 0
0 0]
[ 0 1 3 7 3 0 257 7 2 0 0 1 8 0 1 0 0 0
0 0]
[ 0 0 0 2 1 0 5 305 2 3 0 0 4 1 0 0 1 0
0 0]
[ 0 0 0 0 1 0 3 3 285 0 0 0 1 0 0 1 0 0
0 0]
[ 0 0 0 0 0 0 3 2 0 305 2 1 1 0 0 0 0 0
1 0]
[ 0 0 0 0 0 0 1 0 1 0 300 0 0 0 0 0 0 0
0 0]
[ 0 0 1 1 0 2 0 1 0 0 0 284 0 1 1 0 2 2
1 1]
[ 0 2 2 10 2 2 6 5 1 0 1 1 279 1 1 0 0 0
0 0]
[ 0 3 0 0 1 1 1 0 0 0 0 0 0 269 0 1 1 0
0 0]
[ 0 5 0 0 1 0 0 0 0 0 2 0 1 0 295 0 0 0
1 0]
[ 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 282 1 0
0 3]
[ 0 0 1 0 0 0 0 0 1 3 0 0 1 0 0 1 232 1
5 1]
[ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 293
0 0]
[ 0 2 0 0 0 0 2 0 0 1 0 1 0 1 0 0 7 4
216 2]
[ 11 0 0 0 0 0 0 0 0 0 0 1 0 2 0 9 2 1
2 143]]
Testbenching a MultinomialNB classifier...
parameters: {'alpha': 0.01}
done in 0.608000s
Percentage of non zeros coef: 100.000000
Predicting the outcomes of the testing set
done in 0.203000s
Classification report on test set for classifier:
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True) precision recall f1-score support alt.atheism 0.90 0.92 0.91 245
comp.graphics 0.81 0.89 0.85 298
comp.os.ms-windows.misc 0.87 0.83 0.85 292
comp.sys.ibm.pc.hardware 0.82 0.83 0.83 301
comp.sys.mac.hardware 0.90 0.92 0.91 256
comp.windows.x 0.90 0.89 0.89 297
misc.forsale 0.90 0.84 0.87 290
rec.autos 0.93 0.94 0.93 324
rec.motorcycles 0.98 0.97 0.97 294
rec.sport.baseball 0.97 0.97 0.97 315
rec.sport.hockey 0.97 0.99 0.98 302
sci.crypt 0.95 0.95 0.95 297
sci.electronics 0.90 0.86 0.88 313
sci.med 0.97 0.96 0.97 277
sci.space 0.95 0.97 0.96 305
soc.religion.christian 0.91 0.97 0.94 293
talk.politics.guns 0.89 0.96 0.93 246
talk.politics.mideast 0.95 0.98 0.97 296
talk.politics.misc 0.93 0.87 0.90 236
talk.religion.misc 0.92 0.74 0.82 171 avg / total 0.92 0.92 0.92 5648 Confusion matrix:
[[226 0 0 0 0 0 0 0 0 1 0 0 0 0 2 7 0 0
0 9]
[ 1 266 7 4 1 6 2 2 0 0 0 3 4 1 1 0 0 0
0 0]
[ 0 11 243 22 4 7 1 0 0 0 0 1 2 0 0 0 0 0
1 0]
[ 0 7 12 250 8 4 9 0 0 1 1 0 9 0 0 0 0 0
0 0]
[ 0 3 3 5 235 2 3 1 0 0 0 2 1 0 1 0 0 0
0 0]
[ 0 19 5 3 2 263 0 0 0 0 0 1 0 1 1 0 2 0
0 0]
[ 0 1 4 9 3 1 243 9 2 3 1 0 8 0 0 0 2 2
2 0]
[ 0 0 0 1 1 0 5 304 1 2 0 0 3 2 3 1 1 0
0 0]
[ 0 0 0 0 0 2 2 3 285 0 0 0 1 0 0 0 0 0
0 1]
[ 0 1 0 0 0 1 1 3 0 304 5 0 0 0 0 0 0 0
0 0]
[ 0 0 0 0 0 0 0 0 1 2 299 0 0 0 0 0 0 0
0 0]
[ 0 2 2 1 0 1 2 0 0 0 0 283 1 0 0 0 2 1
2 0]
[ 0 11 1 9 3 1 3 5 1 0 1 4 270 1 3 0 0 0
0 0]
[ 0 2 0 1 1 1 0 0 0 0 0 1 0 266 2 1 0 0
2 0]
[ 0 2 0 0 1 0 0 0 0 0 0 2 1 1 296 0 1 1
0 0]
[ 3 1 0 0 0 0 0 0 0 0 1 0 0 2 0 283 0 1
2 0]
[ 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 237 1
3 1]
[ 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3 0 291
0 0]
[ 1 1 0 0 1 1 0 1 0 0 0 0 0 0 1 1 17 6
206 0]
[ 18 1 0 0 0 0 0 0 0 1 0 0 0 0 0 14 4 2
4 127]]

步骤为:

一、preprocessing

1.加载训练集(training set)

2.训练集特征提取,用TfidfVectorizer,得到训练集上的x_train和y_train

3.加载测试集(test set)

4.测试集特征提取,用TfidfVectorizer,得到测试集上的x_train和y_train

二、定义Benchmark classifiers

5.训练,clf = clf_class(**params).fit(X_train, y_train)

6.测试,pred = clf.predict(X_test)

7.测试集上分类报告,print(classification_report(y_test, pred,target_names=news_test.target_names))

8.confusion matrix,cm = confusion_matrix(y_test, pred)

三、训练

9.调用两个分类器,SGDClassifier和MultinomialNB

Classification of text documents: using a MLComp dataset的更多相关文章

  1. Clustering text documents using k-means

    源代码的链接为http://scikit-learn.org/stable/auto_examples/text/document_clustering.html Loading 20 newsgro ...

  2. scikit-learn:4.2.3. Text feature extraction

    http://scikit-learn.org/stable/modules/feature_extraction.html 4.2节内容太多,因此将文本特征提取单独作为一块. 1.the bag o ...

  3. Python scikit-learn机器学习工具包学习笔记

    feature_selection模块 Univariate feature selection:单变量的特征选择 单变量特征选择的原理是分别单独的计算每个变量的某个统计指标,根据该指标来判断哪些指标 ...

  4. 特征选择 (feature_selection)

    目录 特征选择 (feature_selection) Filter 1. 移除低方差的特征 (Removing features with low variance) 2. 单变量特征选择 (Uni ...

  5. sklearn—特征工程

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...

  6. scikit-learn:3.3. Model evaluation: quantifying the quality of predictions

    參考:http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter 三种方法评估模型的预測质量: Est ...

  7. [Scikit-learn] 1.1 Generalized Linear Models - Comparing various online solvers

    数据集分割 一.Online learning for 手写识别 From: Comparing various online solvers An example showing how diffe ...

  8. [Scikit-learn] Yield miniBatch for online learning.

    From: Out-of-core classification of text documents Code:  """ ======================= ...

  9. sklearn中的模型评估-构建评估函数

    1.介绍 有三种不同的方法来评估一个模型的预测质量: estimator的score方法:sklearn中的estimator都具有一个score方法,它提供了一个缺省的评估法则来解决问题. Scor ...

随机推荐

  1. 转 Oracle12c/11个 Client安装出现"[INS-30131]"错误“请确保当前用户具有访问临时位置所需的权限”解决办法之完整版

    错误分析:安装时exe会自动解压到C:\Users\Administrator\AppData\Local\Temp再进行安装,当文件夹权限不足时就会拒绝安装程序的访问: 第一步:  在win+R输入 ...

  2. Lua 迭代器

    第一种:lua迭代器的实现依赖于闭包(closure)特性 1.1 第一个简单的写法 --迭代器写法 function self_iter( t ) local i = 0 return functi ...

  3. java设计模式案例详解:代理模式

    代理模式就是用一个第三者的身份去完成工作,其实际意义跟字面意思其实是一样的,理解方式有很多,还是例子直观. 本例的实现类是实现买票功能,实际应用想要添加身份验证功能,利用代理模式添加验证步骤.上例子: ...

  4. runtime基础知识

    看到一篇不错的runtime方面博客: 引言 相信很多同学都听过运行时,但是我相信还是有很多同学不了解什么是运行时,到底在项目开发中怎么用?什么时候适合使用?想想我们的项目中,到底在哪里使用过运行时呢 ...

  5. HDU 2671 Can't be easier

    简单的几何题目 点(a,b)关于直线Ax+By+C=1对称点的公式 #include<cstdio> #include<cstring> #include<cmath&g ...

  6. 吾爱破解脱壳练习第五期------upx壳

    内存镜像法: 载入OD:

  7. ListView使用的时候遇到的一些问题

    昨天在做项目时,请求服务器的好友动态后,将好友动态和评论显示到界面上,用ListView显示,发现一进这个界面时,listView的适配器的getVIew()方法就会执行6次,后来发现原来是ListV ...

  8. Mysql获取去重后的总数

    如果一张表中某个字段存在重复的值,现在我想去重后获取这个字段值的总数 先看这张表 这张表中的openid有重复值 怎么通过sql语句获取openid的去重总数呢 select count(distin ...

  9. apache RewriteCond RewriteRule

    http://www.rockbb.com/blog/?p=319 http://www.cnblogs.com/scgw/archive/2011/12/10/2283029.html 我的理解:当 ...

  10. AngularJS Front-End App with Cloud Storage Tutorial Part 1: Building a Minimal App in Seven Steps

    原文 : http://www.codeproject.com/Articles/1027709/AngularJS-Front-End-App-with-Cloud-Storage-Tutoria ...