中文情感分析 glove+LSTM

最近尝试了一下中文的情感分析。

主要使用了Glove和LSTM。语料数据集采用的是中文酒店评价语料

1、首先是训练Glove，获得词向量（这里是用的300d）。这一步使用的是jieba分词和中文维基。

2、将中文酒店评价语料进行清洗，并分词。分词后转化为词向量的表示形式。

3、使用LSTM网络进行训练。

最终的正确率在91%左右

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

Created on Wed May 30 13:52:23 2018

@author: xyli

处理酒店评价语料数据，

分词，并转化为Glove向量

"""

import sys

import os

import chardet

import jieba

import re

import gensim

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.utils.np_utils import to_categorical

from keras.layers import Masking

from keras.layers import Dense, Input, Flatten, Activation

from keras.layers import Conv1D, GlobalMaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional,Reshape

from keras.models import Sequential, Model

from Attention_layer import Attention_layer

from keras.layers import Convolution2D, MaxPooling2D

from keras.utils import np_utils 

def loadGLoveModel(filename):

    embeddings_index = {}

    f = open(filename)

    for line in f:

        values = line.split()

        word = values[0]

        coefs = np.asarray(values[1:], dtype='float32')

        embeddings_index[word] = coefs

    f.close()

    return embeddings_index

def word2Glovec(List,model):

    vec=[]

    insert = [float(0) for i in range(300)] #300表示vec的维度

    insert = np.asarray(insert, dtype='float32')

    for w in List:

        v = model.get(w)

        if v is None:

            vec.append(insert)

        else:

            vec.append(v)

    return vec

def clean_str(string):

    """

    Tokenization/string cleaning for dataset

    Every dataset is lower cased except

    """

#    string = string.decode('utf-8')

    string = re.sub(r"\\", "", string)

    string = re.sub(r"\'", "", string)

    string = re.sub(r"\"", "", string)

    string = re.sub(r"\r\n", "", string)

    string = re.sub(r"\r", "", string)

    string = re.sub(r"\,","",string)

    string = re.sub(r"\.","",string)

    string = re.sub(r"\，","",string)

    string = re.sub(r"\。","",string)

    string = re.sub(r"\（","",string)

    string = re.sub(r"\）","",string)

    string = re.sub(r"\(","",string)

    string = re.sub(r"\)","",string)

    string = re.sub(r"\“","",string)

    string = re.sub(r"\”","",string)

    return string.strip()

def fitList(List,n):

    L = len(List)

#    insert = [0 for i in range(300)]

    insert = '!'

    if L < n:

        d=n-L

        appList=[insert for i in range(d)]

        List+=appList

    else:

        if L>n:

            List=List[0:n]

    return List

def readData(filename):

    with open(filename, 'rb') as f:

        data = f.read()

        data=data.decode('gb18030','ignore')

        data=clean_str(data)

        seg_list = jieba.cut(data)  # 默认是精确模式

    segList=[]

    for s in seg_list:

        s=clean_str(s)

        segList.append(s)

    return segList

def loadData():

    Corpus_DIR = "data/ChnSentiCorp_htl_unba_10000"

    DIR=['/neg','/pos']

    commentList=[]

    rootdir = Corpus_DIR+DIR[0]

    filelist = os.listdir(rootdir) #列出文件夹下所有的目录与文件

    labelList=[[0.0,1.0] for i in range(0,len(filelist))]

    for i in range(0,len(filelist)):

       path = os.path.join(rootdir,filelist[i])

       if os.path.isfile(path):

              templist=readData(path)

              commentList.append(templist)

    rootdir = Corpus_DIR+DIR[1]

    filelist = os.listdir(rootdir) #列出文件夹下所有的目录与文件

    labelList2=[[1.0,0.0] for i in range(0,len(filelist))]

    for i in range(0,len(filelist)):

       path = os.path.join(rootdir,filelist[i])

       if os.path.isfile(path):

              templist=readData(path)

              commentList.append(templist)

    labelList+=labelList2

    return commentList,labelList

if __name__=='__main__':

    List,labelList=loadData()  #加载语料数据

    gloveModel=loadGLoveModel('model/zhs_wiki_glove.vectors.300d.txt')  #加载glove模型数据

    countList=[]

    commentVecList=[]

    n=100

    for c in List:

        countList.append(len(c))

        glovec=word2Glovec(fitList(c,n),gloveModel)

        commentVecList.append(glovec)

    VALIDATION_SPLIT = 0.2

    commentVecList=np.array(commentVecList)

    labelList=np.array(labelList)

    indices = np.arange(commentVecList.shape[0])

    np.random.shuffle(indices)

    data = commentVecList[indices]

    labels = labelList[indices]

    nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

    x_train = data[:-nb_validation_samples]

    y_train = labels[:-nb_validation_samples]

    x_val = data[-nb_validation_samples:]

    y_val = labels[-nb_validation_samples:]

    model = Sequential()

    model.add(LSTM(120, input_shape=(x_train.shape[1], x_train.shape[2]),return_sequences=True))

#    model.add(Activation('relu')) #激活层

#    model.add(Attention_layer())

    model.add(Bidirectional(LSTM(60,return_sequences=True)))

#    model.add(Attention_layer())

#    model.add(Activation('relu')) #激活层

    model.add(Dropout(0.3)) #神经元随机失活

    model.add(Bidirectional(LSTM(30,return_sequences=False)))

    model.add(Dropout(0.3)) #神经元随机失活

    model.add(Dense(y_train.shape[1], activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.summary()

    model.fit(x_train, y_train, validation_data=(x_val, y_val),

              epochs=25, batch_size=200)

本文还在完善中。。。

中文情感分析 glove+LSTM的更多相关文章

中文情感分析——snownlp类库源码注释及使用
最近发现了snownlp这个库,这个类库是专门针对中文文本进行文本挖掘的. 主要功能: 中文分词(Character-Based Generative Model) 词性标注(TnT 3-gram 隐 ...
使用Spark MLlib进行情感分析
使用Spark MLlib进行情感分析使用Spark MLlib进行情感分析一.实验说明在当今这个互联网时代,人们对于各种事情的舆论观点都散布在各种社交网络平台或新闻提要 ...
情感分析snownlp包部分核心代码理解
snownlps是用Python写的个中文情感分析的包,自带了中文正负情感的训练集,主要是评论的语料库.使用的是朴素贝叶斯原理来训练和预测数据.主要看了一下这个包的几个主要的核心代码,看的过程作了一些 ...
LSTM实现中文文本情感分析
1. 背景介绍文本情感分析是在文本分析领域的典型任务,实用价值很高.本模型是第一个上手实现的深度学习模型,目的是对深度学习做一个初步的了解,并入门深度学习在文本分析领域的应用.在进行模型的上手实现之 ...
文本情感分析(二)：基于word2vec、glove和fasttext词向量的文本表示
上一篇博客用词袋模型,包括词频矩阵.Tf-Idf矩阵.LSA和n-gram构造文本特征,做了Kaggle上的电影评论情感分类题. 这篇博客还是关于文本特征工程的,用词嵌入的方法来构造文本特征,也就是用 ...
LSTM 文本情感分析/序列分类 Keras
LSTM 文本情感分析/序列分类 Keras 请参考 http://spaces.ac.cn/archives/3414/ neg.xls是这样的 pos.xls是这样的neg=pd.read_e ...
NLP入门（十）使用LSTM进行文本情感分析
情感分析简介文本情感分析(Sentiment Analysis)是自然语言处理(NLP)方法中常见的应用,也是一个有趣的基本任务,尤其是以提炼文本情绪内容为目的的分类.它是对带有情感色彩的主观性 ...
NLP之中文自然语言处理工具库：SnowNLP(情感分析/分词/自动摘要)
一安装与介绍 1.1 概述 SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个 ...
机器学习 - LSTM应用之情感分析
1. 概述在情感分析的应用领域,例如判断某一句话是positive或者是negative的案例中,咱们可以通过传统的standard neuro network来作为解决方案,但是传统的神经网络在应 ...

随机推荐

删除Oracle文件、注册表
用Oracle自带的卸载程序不能从根本上卸载Oracle,从而为下次的安装留下隐患,那么怎么才能完全卸载Oracle呢?那就是直接注册表清除,步骤如下: 1. 开始->设置->控制面板-& ...
Asp.NET 知识点总结（二）
1.两个对象值相同(x.equals(y) == true),但却可有不同的hash code,这句话对不对? 答:不对,有相同的 hash code 编码格式. 2.swtich是否能作用在byte ...
转 phpmyadmin操作技巧：如何在phpmyadmin里面复制mysql数据库？
对于每一个站长而言,都会遇到要进行网站测试的时候.这个时候,往往需要备份数据库.如果按照一般的操作方式,都是先把数据库导出并备份到本地,然后再服务器上测试.如果一切正常还好,一旦出了问题,就又得把数据 ...
407 Trapping Rain Water II 接雨水 II
给定一个m x n的矩阵,其中的值均为正整数,代表二维高度图每个单元的高度,请计算图中形状最多能接多少体积的雨水.说明:m 和 n 都是小于110的整数.每一个单位的高度都大于0 且小于 20000. ...
Java 遍历Map对象的4种方法
http://blog.csdn.net/tjcyjd/article/details/11111401
Selenium示例集锦--常见元素识别方法、下拉框、文本域及富文本框、鼠标操作、一组元素定位、弹窗、多窗口处理、JS、frame、文件上传和下载
元素定位及其他操作 0.常见的识别元素的方法是什么? driver.find_element_by_id() driver.find_element_by_name() driver.find_ele ...
详谈Struts+Hibernate+Spring三大框架
前言:对于JAVA WEB端的程序员来说,对JAVA三大框架:Struts+Hibernate+Spring的了解必不可缺,下面详细谈谈 Java三大框架主要用来做WEN应用. 三大框架:Struts ...
定时器tasktimer
1.web.xml中配置 <servlet> <servlet-name>TaskTimer</servlet-name> <servlet-class> ...
HDU_1011_Starship Troopers_树型dp
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1011 Starship Troopers Time Limit: 10000/5000 MS (Jav ...
Tomcat环境的搭建
一.Tomcat的简单介绍大家应该知道平时所说的C/S和B/S系统架构:C/S架构是基于客户端C和服务端S的,B/S架构是基于浏览器B和S服务端的,B/S架构中的server就是web服务器. To ...

中文情感分析 glove+LSTM

中文情感分析 glove+LSTM的更多相关文章

随机推荐

热门专题