美团店铺评价语言处理以及分类（tfidf，SVM，决策树，随机森林，Knn，ensemble）

支持向量机分类
支持向量机网格搜索
临近法
决策树
随机森林
bagging方法

import pandas as pd

import numpy as np

import  matplotlib.pyplot as  plt

import time

df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]

df.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	comment	star
0	还行吧，建议不要排队那个烤鸭和羊肉串，因为烤肉时间本来就不够，排那个要半小时，然后再回来吃烤...	40
1	去过好几次了东西还是老样子没增添什么新花样环境倒是挺不错离我们这也挺近味道还可以 ...	40
2	一个字：好！！！ #羊肉串# #五花肉# #牛舌# #很好吃# #鸡软骨# #拌菜# #抄河...	50
3	第一次来吃，之前看过好多推荐说这个好吃，真的抱了好大希望，排队的人挺多的，想吃得趁早来啊。还...	20
4	羊肉串真的不太好吃，那种说膻不膻说臭不臭的味。烤鸭还行，大虾没少吃，也就到那吃大虾了，吃完了...	30

df.shape

(17400, 2)

df['sentiment']=df['star'].apply(lambda x:1 if x>30 else 0)

df=df.drop_duplicates() ## 去掉重复的评论

df=df.dropna()

X=pd.concat([df[['comment']],df[['comment']],df[['comment']]])

y=pd.concat([df.sentiment,df.sentiment,df.sentiment])

X.columns=['comment']

X.reset_index

X.shape

(3138, 1)

import jieba

def chinese_word_cut(mytext):

    return " ".join(jieba.cut(mytext))

X['cut_comment']=X["comment"].apply(chinese_word_cut)

X['cut_comment'].head()

Building prefix dict from the default dictionary ...

Loading model from cache C:\Users\FRED-H~1\AppData\Local\Temp\jieba.cache

Loading model cost 0.651 seconds.

Prefix dict has been built succesfully.

0    还行 吧 ， 建议 不要 排队 那个 烤鸭 和 羊肉串 ， 因为 烤肉 时间 本来 就 不够...

1    去过 好 几次 了   东西 还是 老 样子   没 增添 什么 新花样   环境 倒 是 ...

2    一个 字 ： 好 ！ ！ ！   # 羊肉串 #   # 五花肉 #   # 牛舌 #   ...

3    第一次 来 吃 ， 之前 看过 好多 推荐 说 这个 好吃 ， 真的 抱 了 好 大 希望 ...

4    羊肉串 真的 不太 好吃 ， 那种 说 膻 不 膻 说 臭 不 臭 的 味 。 烤鸭 还 行...

Name: cut_comment, dtype: object

from sklearn.model_selection import  train_test_split

X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)

def get_custom_stopwords(stop_words_file):

    with open(stop_words_file,encoding="utf-8") as f:

        custom_stopwords_list=[i.strip() for i in f.readlines()]

    return custom_stopwords_list

stop_words_file = "stopwords.txt"

stopwords = get_custom_stopwords(stop_words_file)

stopwords[-10:]

['100', '01', '02', '03', '04', '05', '06', '07', '08', '09']

from sklearn.feature_extraction.text import  CountVectorizer

vect=CountVectorizer()

vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',

        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',

        lowercase=True, max_df=1.0, max_features=None, min_df=1,

        ngram_range=(1, 1), preprocessor=None, stop_words=None,

        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',

        tokenizer=None, vocabulary=None)

vect.fit_transform(X_train["cut_comment"])

<2353x1965 sparse matrix of type '<class 'numpy.int64'>'

	with 20491 stored elements in Compressed Sparse Row format>

vect.fit_transform(X_train["cut_comment"]).toarray().shape

(2353, 1965)

# pd.DataFrame(vect.fit_transform(X_train["cut_comment"]).toarray(),columns=vect.get_feature_names()).iloc[:10,:22]

# print(vect.get_feature_names())

# #  数据维数1956，不算很大（未使用停用词）

vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords)) # 去除停用词

pd.DataFrame(vect.fit_transform(X_train['cut_comment']).toarray(), columns=vect.get_feature_names()).head()

# 1691 columns,去掉以数字为特征值的列，减少了三列编程1691

# max_df = 0.8 # 在超过这一比例的文档中出现的关键词（过于平凡），去除掉。

# min_df = 3 # 在低于这一数量的文档中出现的关键词（过于独特），去除掉。

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	amazing	happy	ktv	pm2	一万个	一个多	一个月	一串	一人	一件	...	麻烦	麻酱	黄喉	黄桃	黄花鱼	黄金	黑乎乎	黑椒	黑胡椒	齐全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 1691 columns

from sklearn.pipeline import make_pipeline

from sklearn.svm import SVC

from sklearn import  metrics

svc_cl=SVC()

pipe=make_pipeline(vect,svc_cl)

pipe.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,

     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',

        lowercase=True, max_df=1.0, max_features=None, min_df=1,

        ngram_range=(1, 1), preprocessor=None,

        stop_words=...,

  max_iter=-1, probability=False, random_state=None, shrinking=True,

  tol=0.001, verbose=False))])

y_pred = pipe.predict(X_test.cut_comment)

metrics.accuracy_score(y_test,y_pred)

0.6318471337579618

metrics.confusion_matrix(y_test,y_pred)

array([[  0, 289],

       [  0, 496]], dtype=int64)

支持向量机分类

from sklearn.svm import SVC

svc_cl=SVC() # 实例化

pipe=make_pipeline(vect,svc_cl)

pipe.fit(X_train.cut_comment, y_train)

y_pred = pipe.predict(X_test.cut_comment)

metrics.accuracy_score(y_test,y_pred)

0.6318471337579618

支持向量机网格搜索

from sklearn.model_selection import GridSearchCV

from sklearn.svm import SVC

from sklearn.pipeline import  Pipeline

# svc=SVC(random_state=1)

from sklearn.linear_model import SGDClassifier

from sklearn.feature_extraction.text import TfidfTransformer

tfidf=TfidfTransformer()

# ('tfidf',

#                       TfidfTransformer()),

#                      ('clf',

#                       SGDClassifier(max_iter=1000)),

# svc=SGDClassifier(max_iter=1000)

svc=SVC()

# pipe=make_pipeline(vect,SVC)

pipe_svc=Pipeline([("scl",vect),('tfidf',tfidf),("clf",svc)])

para_range=[0.0001,0.001,0.01,0.1,1.0,10,100,1000]

para_grid=[

    {'clf__C':para_range,

    'clf__kernel':['linear']},

    {'clf__gamma':para_range,

    'clf__kernel':['rbf']}

]

gs=GridSearchCV(estimator=pipe_svc,param_grid=para_grid,cv=10,n_jobs=-1)

gs.fit(X_train.cut_comment,y_train)

GridSearchCV(cv=10, error_score='raise',

       estimator=Pipeline(memory=None,

     steps=[('scl', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',

        lowercase=True, max_df=1.0, max_features=None, min_df=1,

        ngram_range=(1, 1), preprocessor=None,

        stop_words=frozenset({'...,

  max_iter=-1, probability=False, random_state=None, shrinking=True,

  tol=0.001, verbose=False))]),

       fit_params=None, iid=True, n_jobs=-1,

       param_grid=[{'clf__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000], 'clf__kernel': ['linear']}, {'clf__gamma': [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000], 'clf__kernel': ['rbf']}],

       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',

       scoring=None, verbose=0)

gs.best_estimator_.fit(X_train.cut_comment,y_train)

Pipeline(memory=None,

     steps=[('scl', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',

        lowercase=True, max_df=1.0, max_features=None, min_df=1,

        ngram_range=(1, 1), preprocessor=None,

        stop_words=frozenset({'...,

  max_iter=-1, probability=False, random_state=None, shrinking=True,

  tol=0.001, verbose=False))])

y_pred = gs.best_estimator_.predict(X_test.cut_comment)

metrics.accuracy_score(y_test,y_pred)

0.9503184713375796

临近法

from sklearn.neighbors import  KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski')

pipe=make_pipeline(vect,knn)

pipe.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,

     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',

        lowercase=True, max_df=1.0, max_features=None, min_df=1,

        ngram_range=(1, 1), preprocessor=None,

        stop_words=...owski',

           metric_params=None, n_jobs=1, n_neighbors=5, p=2,

           weights='uniform'))])

y_pred = pipe.predict(X_test.cut_comment)

metrics.accuracy_score(y_test,y_pred)

0.7070063694267515

metrics.confusion_matrix(y_test,y_pred)

array([[ 87, 202],

       [ 28, 468]], dtype=int64)

决策树

from sklearn.tree import DecisionTreeClassifier

tree=DecisionTreeClassifier(criterion='entropy',random_state=1)

pipe=make_pipeline(vect,tree)

pipe.fit(X_train.cut_comment, y_train)

y_pred = pipe.predict(X_test.cut_comment)

metrics.accuracy_score(y_test,y_pred)

0.9388535031847134

metrics.confusion_matrix(y_test,y_pred)

array([[256,  33],

       [ 15, 481]], dtype=int64)

随机森林



from sklearn.ensemble import RandomForestClassifier

forest=RandomForestClassifier(criterion='entropy',random_state=1,n_jobs=2)

pipe=make_pipeline(vect,forest)

pipe.fit(X_train.cut_comment, y_train)

y_pred = pipe.predict(X_test.cut_comment)

metrics.accuracy_score(y_test,y_pred)

# 加上tfidf反而准确率96.5降低至95.0，

0.9656050955414013

metrics.confusion_matrix(y_test,y_pred)

array([[265,  24],

       [  3, 493]], dtype=int64)

bagging方法

from sklearn.ensemble import BaggingClassifier

from sklearn.tree import DecisionTreeClassifier

tree=DecisionTreeClassifier(criterion='entropy',random_state=1)

bag=BaggingClassifier(base_estimator=tree,

                     n_estimators=10,

                     max_samples=1.0,

                     max_features=1.0,

                     bootstrap=True,

                     bootstrap_features=False,

                     n_jobs=1,random_state=1)

pipe=make_pipeline(vect,tfidf,bag)

pipe.fit(X_train.cut_comment, y_train)

y_pred = pipe.predict(X_test.cut_comment)

metrics.accuracy_score(y_test,y_pred)  #  没用转化td-idf 93.2%, 加上转化步骤，准确率提升到95.5

0.9554140127388535

metrics.confusion_matrix(y_test,y_pred)

array([[260,  29],

       [  6, 490]], dtype=int64)

美团店铺评价语言处理以及分类（tfidf，SVM，决策树，随机森林，Knn，ensemble）的更多相关文章

美团店铺评价语言处理以及文本分类（logistic regression）
美团店铺评价语言处理以及分类(LogisticRegression) 第一篇数据清洗与分析部分第二篇可视化部分, 第三篇朴素贝叶斯文本分类本文是该系列的第四篇主要讨论逻辑回归分类算法的参数 ...
AI学习---分类算法[K-近邻 + 朴素贝叶斯 + 决策树 + 随机森林 ]
分类算法:对目标值进行分类的算法 1.sklearn转换器(特征工程)和预估器(机器学习) 2.KNN算法(根据邻居确定类别 + 欧氏距离 + k的确定),时间复杂度高,适合小数据 ...
scikit-learn机器学习(四)使用决策树做分类,并画出决策树,随机森林对比
数据来自 UCI 数据集匹马印第安人糖尿病数据集载入数据 # -*- coding: utf-8 -*- import pandas as pd import matplotlib matplot ...
随机森林分类（Random Forest Classification）
其实,之前就接触过随机森林,但仅仅是用来做分类和回归.最近,因为要实现一个idea,想到用随机森林做ensemble learning才具体的来看其理论知识.随机森林主要是用到决策树的理论,也就是用决 ...
机器学习之路：python 集成分类器随机森林分类RandomForestClassifier 梯度提升决策树分类GradientBoostingClassifier 预测泰坦尼克号幸存者
python3 学习使用随机森林分类器梯度提升决策树分类的api,并将他们和单一决策树预测结果做出对比附上我的git,欢迎大家来参考我其他分类器的代码: https://github.com/l ...
[转载]Magento 店铺多语言设置
本文以扩展中文包为例: 首先进入自己 Magento 后台系统 -> 管理商店(System -> Manage Stores) 单击 “创建店铺视图”(Create Store Vie ...
R语言分类算法之随机森林
R语言分类算法之随机森林 1.原理分析: 随机森林是通过自助法(boot-strap)重采样技术,从原始训练样本集N中有放回地重复随机抽取k个样本生成新的训练集样本集合,然后根据自助样本集生成k个决策 ...
R语言-聚类与分类
一.聚类: 一般步骤: 1.选择合适的变量 2.缩放数据 3.寻找异常点 4.计算距离 5.选择聚类算法 6.采用一种或多种聚类方法 7.确定类的数目 8.获得最终聚类的解决方案 9.结果可视化 10 ...
R语言常用包分类总结
常用包: ——数据处理:lubridata ,plyr ,reshape2,stringr,formatR,mcmc: ——机器学习:nnet,rpart,tree,party,lars,boost, ...

随机推荐

Keras模型的导出和pb文件的转换
Keras有两种类型的模型,序贯模型(Sequential)和函数式模型(Model),函数式模型应用更为广泛,序贯模型是函数式模型的一种特殊情况. 两类模型有一些方法是相同的: model.summ ...
Bootstrap 标签页（Tab）插件
摘自: http://www.runoob.com/bootstrap/bootstrap-tab-plugin.html Bootstrap 标签页(Tab)插件标签页(Tab)在 Bootstr ...
linux修改目录为可读写
发布到linux上遇到的问题, 上传目录没有写入权限新建目录 mkdir /guangkuo/html/portal/images/upload 改为可读写 chmod -R a+w /guangk ...
Ubuntu16.04上使用Anaconda3的Python3.6的pip安装UWSGI报错解决办法
具体报错信息: lto1: fatal error: bytecode stream generated with LTO version 6.0 instead of the expected 4. ...
Eclipse 个人使用配置
个人最喜欢使用的是eclipse,但是每次有新的版本或者是在不同的电脑上都要一遍一遍的配置.下面收集自己每次用eclipse需要注意配置的地方: 快捷键只需要修改一个code assitant 修改显 ...
用MATLAB生成模糊控制离线查询表
实时采样得到的数据经过模糊化处理后输入机器,通过查询模糊规则表便可得到应有的输出模糊量,从而避免了近似推理过程.实际应用中,特别是在控制系统较为简单而采用单片机控制时,常常采用这种查表法. 模糊控制表 ...
android makefile文件批量拷贝文件的方法
该方法是shell 和makefile组合使用 wallpapers := $(shell ls packages/apps/hyst_apps/NewBingoLauncher_C/default_ ...
用 JAAS 和 JSSE 实现 Java 安全性
JAAS 和 JSSE 概述 JAAS 提供了一种灵活的.说明性的机制,用于对用户进行认证并验证他们访问安全资源的能力.JSSE 定义了通过安全套接字层(SSL)进行安全 Web 通信的一种全 Jav ...
Andriod书籍准备
老大说公司准备开发MFC项目,过了一段时间又说开发Andriod,好吧,我现在准备Andriod. 鬼知道过段时间会变成什么. http://pan.baidu.com/share/link?shar ...
【C#】C#中的属性与字段
目录结构: contents structure [+] 属性和字段的区别无参属性自动实现的属性对象和集合初始化器匿名类型 System.Tuple类型有参属性属性的可访问性在这篇文章中 ...

	amazing	happy	ktv	pm2	一万个	一个多	一个月	一串	一人	一件	...	麻烦	麻酱	黄喉	黄桃	黄花鱼	黄金	黑乎乎	黑椒	黑胡椒	齐全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	amazing	happy	ktv	pm2	一万个	一个多	一个月	一串	一人	一件	...	麻烦	麻酱	黄喉	黄桃	黄花鱼	黄金	黑乎乎	黑椒	黑胡椒	齐全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

美团店铺评价语言处理以及分类（tfidf，SVM，决策树，随机森林，Knn，ensemble）

支持向量机分类

支持向量机 网格搜索

临近法

决策树

随机森林

bagging方法

美团店铺评价语言处理以及分类（tfidf，SVM，决策树，随机森林，Knn，ensemble）的更多相关文章

随机推荐

热门专题

支持向量机网格搜索

	amazing	happy	ktv	pm2	一万个	一个多	一个月	一串	一人	一件	...	麻烦	麻酱	黄喉	黄桃	黄花鱼	黄金	黑乎乎	黑椒	黑胡椒	齐全
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0