sohu_news搜狐新闻类型分类

数据获取

数据是从搜狐新闻开放的新闻xml数据，经过一系列的处理之后，生成的一个excel文件
该xml文件的处理有单独的处理过程，就是用pandas处理，该过程在此省略

import numpy as np

import pandas as pd

读取新闻文本文件，查看文本的长度

df=pd.read_excel('sohu_data.xlsx')

df['length']=df['content'].apply(lambda x: len(x)).values

去掉长度小于50的文本

df_data = df[df['length']>=50][['content','category']]

查看新闻类型的分布，共9类

df_data['category'].value_counts()

# 可以看到这里面存在类别不平衡，最大的差距有17倍。

health      30929

news        27613

auto        22841

stock       18152

it          13875

yule        13785

women        4667

book         4411

business     1769

Name: category, dtype: int64

使用sklearn中的处理模块的labelEncoder方法对类标进行处理

from sklearn.preprocessing import LabelEncoder

class_le=LabelEncoder()

class_le.fit(np.unique(df['category'].values)

print(list(class_le.classes_))

y=class_le.transform(df['category'].values)

# 查看前20个新闻样本的类别

y[:20]

['auto', 'book', 'business', 'health', 'it', 'news', 'stock', 'women', 'yule']

array([7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7], dtype=int64)

导入jieba，进行分词

import jieba

def chinese_word_cut(mytext):

    return " ".join(jieba.cut(mytext))

X=pd.DataFrame()

X['cut_content']=df["content"].apply(chinese_word_cut)

X['cut_content'].head()

Building prefix dict from the default dictionary ...

Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cache

Loading model cost 0.966 seconds.

Prefix dict has been built succesfully.

1    产品名称 ：  规格 及 价格 ： ３ ０ ｍ ｌ ／ ３ ０ ０ 　 元  羽西 当归...

2    常见问题  Ｑ ： 为什么 我 提交 不了 试用 申请 　 Ａ ： 试用 申请 必须 同时...

3    产品名称 ： 肌醇 （ Ｐ ｕ ｒ ｅ 　 Ｓ ｋ ｉ ｎ ） 深层 卸妆 凝胶  规格 ...

4    欧诗漫 的 试用装 终于 延期 而 至 ， 果然 不负 所望 ， 包装 很 精美 。 从 快...

5    试用 申请 步骤  １ 注册 并 完善 个人资料 　 登入 搜狐 试用 频道 ， 填写 并...

Name: cut_content, dtype: object

使用词袋模型进行文本处理，去除停用词，去除数字特征，使用朴素贝叶斯进行分类

from sklearn.model_selection import  train_test_split

X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)

def get_custom_stopwords(stop_words_file):

    with open(stop_words_file,encoding="utf-8") as f:

        custom_stopwords_list=[i.strip() for i in f.readlines()]

    return custom_stopwords_list

stop_words_file = "stopwords.txt"

stopwords = get_custom_stopwords(stop_words_file) # 获取停用词

from sklearn.feature_extraction.text import  CountVectorizer

vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords))

from sklearn.naive_bayes import MultinomialNB

nb=MultinomialNB()

from sklearn.pipeline import make_pipeline

pipe=make_pipeline(vect,nb)

pipe.fit(X_train.cut_content, y_train)

y_pred = pipe.predict(X_test.cut_content)

from sklearn import  metrics

print(metrics.accuracy_score(y_test,y_pred))

metrics.confusion_matrix(y_test,y_pred)

0.897216216938

array([[6266,  163,    2,  249,    5,  345,   66,   74,   53],

       [   5, 1118,    0,    0,    0,   31,    2,    5,   37],

       [   8,    4,   15,    0,    0,  104,  329,    5,    3],

       [   4,    1,    0, 8230,    0,   64,    6,    1,    0],

       [  59,   29,    0,   10, 3672,   66,   29,   26,   45],

       [  72,   71,    6,   26,    1, 5683,  756,   60,  193],

       [  28,    0,   10,    0,    0,  381, 4275,    0,    2],

       [   9,   90,    0,    5,    1,   74,    5,  890,  132],

       [   2,   38,    1,    2,    0,   44,    1,   11, 3467]], dtype=int64)

可以看到朴素贝叶斯对该测试集的正确率达到了接近90%
对训练集进行评估，正确率91%

y_pred = pipe.predict(X_train.cut_content)

from sklearn import  metrics

print(metrics.accuracy_score(y_train,y_pred))

0.913158362989

from sklearn.linear_model import LogisticRegression

随后我们使用逻辑回归模型进行拟合模型并对测试集进行预测，测试集准确率达到94%

lr=LogisticRegression()

from sklearn.pipeline import make_pipeline

pipe=make_pipeline(vect,lr)

pipe.fit(X_train.cut_content, y_train)

y_pred = pipe.predict(X_test.cut_content)

from sklearn import  metrics

print(metrics.accuracy_score(y_test,y_pred))

metrics.confusion_matrix(y_test,y_pred)

0.944644620599

array([[7079,    3,    3,    5,   19,   62,   27,   10,   15],

       [  43, 1131,    1,    0,    3,    3,    4,    6,    7],

       [  16,    0,   36,    1,    1,  106,  296,    7,    5],

       [   7,    0,    0, 8298,    0,    1,    0,    0,    0],

       [  48,    1,    0,    0, 3870,    9,    2,    1,    5],

       [  60,   12,   22,   14,    9, 6453,  218,   35,   45],

       [  36,    1,  140,    0,    7,  415, 4090,    3,    4],

       [  48,   28,    1,    1,   10,   54,    6, 1008,   50],

       [  44,   12,    0,    1,   10,   38,    4,   29, 3428]], dtype=int64)

from sklearn.tree import DecisionTreeClassifier

tree=DecisionTreeClassifier(criterion='entropy',random_state=1)

from sklearn.ensemble import BaggingClassifier

bag=BaggingClassifier(base_estimator=tree,

                     n_estimators=10,

                     max_samples=1.0,

                     max_features=1.0,

                     bootstrap=True,

                     bootstrap_features=False,

                     n_jobs=4,random_state=1)

pipe=make_pipeline(vect,bag)

pipe.fit(X_train.cut_content, y_train)

y_pred = pipe.predict(X_test.cut_content)

metrics.accuracy_score(y_test,y_pred)

使用bagging的方法将决策树进行ensemble，得到的准确率比逻辑回归低了1%

0.9294045426642111
通过对混淆矩阵进行分析，我们发现对第三类，也就是business类的误分类较多，后续需要改进的模型
- 可以使用td-idf进行文本特征处理
- word2vector与深度学习的方式进行结合，测试文本分类效果
- LSTM
- embedding
- 其他NLP 方法

sohu_news搜狐新闻类型分类的更多相关文章

基于jieba,TfidfVectorizer,LogisticRegression进行搜狐新闻文本分类
一.简介此文是对利用jieba,word2vec,LR进行搜狐新闻文本分类的准确性的提升,数据集和分词过程一样,这里就不在叙述,读者可参考前面的处理过程经过jieba分词,产生24000条分词结果 ...
利用jieba,word2vec,LR进行搜狐新闻文本分类
一.简介 1)jieba 中文叫做结巴,是一款中文分词工具,https://github.com/fxsjy/jieba 2)word2vec 单词向量化工具,https://radimrehurek ...
使用百度NLP接口对搜狐新闻做分类
一.简介本文主要是要利用百度提供的NLP接口对搜狐的新闻做分类,百度对NLP接口有提供免费的额度可以拿来练习,主要是利用了NLP里面有个文章分类的功能,可以顺便测试看看百度NLP分类做的准不准.详细 ...
【NLP】3000篇搜狐新闻语料数据预处理器的python实现
3000篇搜狐新闻语料数据预处理器的python实现白宁超 2017年5月5日17:20:04 摘要: 关于自然语言处理模型训练亦或是数据挖掘.文本处理等等,均离不开数据清洗,数据预处理的工作.这里 ...
利用朴素贝叶斯分类算法对搜狐新闻进行分类（python）
数据来源 https://www.sogou.com/labs/resource/cs.php介绍:来自搜狐新闻2012年6月—7月期间国内,国际,体育,社会,娱乐等18个频道的新闻数据,提供URL ...
搜狗输入法弹出搜狐新闻的解决办法（sohunews.exe）
狗输入法弹出搜狐新闻的解决办法(sohunews.exe) 1.找到搜狗输入法的安装目录(一般是C:\program files\sougou input\版本号\)2.右键点击sohunews.ex ...
利用搜狐新闻语料库训练100维的word2vec——使用python中的gensim模块
关于word2vec的原理知识参考文章https://www.cnblogs.com/Micang/p/10235783.html 语料数据来自搜狐新闻2012年6月—7月期间国内,国际,体育,社会, ...
搜狐新闻APP是如何使用HUAWEI DevEco IDE快速集成HUAWEI HiAI Engine
6月12日,搜狐新闻APP最新版本在华为应用市场正式上线啦! 那么,这一版本的搜狐新闻APP有什么亮点呢? 先抛个图,来直接感受下—— 模糊图片,瞬间清晰! 效果杠杠的吧. 而藏在这项神操作背后的 ...
世界更清晰，搜狐新闻客户端集成HUAWEI HiAI 亮相荣耀Play发布会!
6月6日,搭载有“很吓人”技术的荣耀Play正式发布,来自各个领域的大咖纷纷为新机搭载的惊艳技术站台打call,其中,搜狐公司董事局主席兼首席执行官张朝阳揭秘:华为和搜狐新闻客户端在硬件AI方面做 ...

随机推荐

开源框架完美组合之Spring.NET + NHibernate + ASP.NET MVC + jQuery + easyUI 中英文双语言小型企业网站Demo（转）
热衷于开源框架探索的我发现ASP.NET MVC与jQuery easyUI的组合很给力.由于原先一直受Ext JS框架的licence所苦恼,于是痛下决心寻找一个完全免费的js框架——easyUI. ...
发布库到仓库 maven jcenter JitPack MD
Markdown版本笔记我的GitHub首页我的博客我的微信我的邮箱 MyAndroidBlogs baiqiantao baiqiantao bqt20094 baiqiantao@sina ...
通俗理解word2vec
https://www.jianshu.com/p/471d9bfbd72f 独热编码独热编码即 One-Hot 编码,又称一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有 ...
Gradle Build Tool
转自知乎: nonesuccess 通俗的说:gradle是打包用的. 你觉得解决你的问题了吗?如果没解决,那是你的问题提得不够好.比如我猜你应该提:为什么要打包发布,打包发布有几种常见方法,为什么这 ...
C语言定义共享全局变量
好久没写C语言了,突然忘记怎么定义全局共享变量了,由于老项目的Code Base都是C的风格,其中又大量用了全局变量,只能跟着糊一坨shit上去了.没办法. 再共享全局变量的global_shared ...
JAVA 自定义注解在自动化测试中的使用
在UI自动化测试中,相信很多人都喜欢用所谓的PO模式,其中的P,也就是page的意思,于是乎,在脚本里,或者在其它的page里,会要new很多的page对象,这样很麻烦,前面我们也讲到了注解的使用,很 ...
全球最全路由DNS服务器IP地址
全球只有13台路由DNS根服务器,在13台路由服务器中,名字分别为“A”至“M”,其中10台设置在美国,另外各有一台设置于英国.瑞典和日本.下表是这些机器的管理单位.设置地点及最新的IP地址. 供应商 ...
linux达人养成计划学习笔记（七）—— 用户登录查看命令
一.查看用户登录信息 1.命令格式 w 2.命令结果第一行信息是:系统当前时间系统运行总时间登录用户数量一分钟/五分钟/十分钟的系统负载(越大越差) 二.who命令 1 ...
php static 变量的例子
class test { public static function a(){} public function b(){} } $obj = new test; 调用代码 test::a(); ...
Atitit 数据库view视图使用推荐规范与最佳实践与方法
Atitit 数据库view视图使用推荐规范与最佳实践与方法 1. 视图的优点:1 1.1. **提升可读性定制用户数据,聚焦特定的数据1 1.2. 使用视图,可以简化数据操作. 1 ...

sohu_news搜狐新闻类型分类

数据获取

sohu_news搜狐新闻类型分类的更多相关文章

随机推荐

热门专题