Sentiment analysis in nlp

The goal of the program is to analysis the article title is Sarcasm or not, i use tensorflow 2.5 to solve this problem.

Dataset download url: https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/home

a sample of the dataset:

{

  "article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5",

  "headline": "former versace store clerk sues over secret 'black code' for minority shoppers",

  "is_sarcastic": 0

}

we want to depend on headline to predict the is_sarcastic, 1 means True,0 means False.

preprocessing

use pandas to read json file.

import pandas as pd

# lines = True means headle the json for each line

df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json" ,lines="True")

df

'''

   is_sarcastic      headline                     article_link

0      1         thirtysomething sci...         https://www.theonion.co...

1      0             dem rep. totally ...       https://www.huffingtonpos..

'''

build list for each column

labels = []

sentences = []

urls = []

# a tips for convert series to list

'''

    type(df['is_sarcastic'])

    # Series

    type(df['is_sarcastic'].values)

    # ndarray

    type(df['is_sarcastic'].values.tolist())

    # list

'''

labels = df['is_sarcastic'].values.tolist()

sentences = df['headline'].values.tolist()

urls = df['article_link'].values.tolist()

len(labels) # 28619

len(sentences) # 28619

split dataset into train set and test set

# train size is the 2/3 of the all dataset.

train_size = int(len(labels) / 3 * 2)

train_sentences = sentences[0: train_size]

test_sentences = sentences[train_size:]

train_y = labels[0:train_size]

test_y = labels[train_size:]

init some parameter

# some parameter

vocab_size = 10000

# input layer to embedding

embedding_dim = 16

# each input sentence length

max_length = 100

# padding method

trunc_type='post'

padding_type='post'

# token the unfamiliar word

oov_tok = "<OOV>"

preprocessing on train set and test set

# processing on train set and test set

import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token = oov_tok)

tokenizer.fit_on_texts(train_sentences)

train_X = tokenizer.texts_to_sequences(train_sentences)

# padding the data

train_X = pad_sequences(train_X,

                       maxlen = max_length,

                       truncating = trunc_type,

                       padding = padding_type)

train_X[:2]

# convery the list to nparray

train_y = np.array(train_y)

# same operator to test set

test_X = tokenizer.texts_to_sequences(test_sentences)

test_X = pad_sequences(test_X ,

                      maxlen = max_length,

                      truncating = trunc_type,

                      padding = padding_type)

test_y = np.array(test_y)

build the model

some important functions and args:

tf.keras.layers.Dense # Denseimplements the operation:output = activation(dot(input, kernel) + bias) , a NN layer
- activation # Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
- use_bias # Boolean, whether the layer uses a bias vector.
tf.keras.Sequential # contain a linear stack of layer into a tf.keras.Model.
tf.keras.Model # to train and predict
- config the model with losses and metrics with model.compile(args)
  - optimizer
    - some args Adam RMSprop SGD Adagrad
  - loss # The loss value that will be minimized by the model will then be the sum of all individual losses.
    - some args BinaryCrossentropy MeanAbsoluteError MeanSquaredError
  - metrices # List of metrics to be evaluated by the model during training and testing.
    - some args AUC Accuracy Recall
- train the model with model.fit(x=None,y=None)
  - batch_size # Number of samples per gradient update. If unspecified, batch_size will default to 32.
  - epochs # Number of epochs to train the model
  - verbose # Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch,verbose=2 is recommended when not running interactively
  - validation_data #( valid_X, valid_y )
tf.keras.layers.Embedding # Turns positive integers (indexes) into dense vectors of fixed size. as shown in following figure
- the purpose of the embedding is making the 1-dim integer proceed the muti-dim vectors add. can find the hide feature and connect to predict the labels. in this program ,every word's emotion direction can be trained many times.
tf.keras.layer.GlobalAveragePooling1D # add all muti-dim vectors ,if the output layer shape is (32, 10, 64), after the pooling, the shape will be changed as (32,64), as shown in following figure
-

code is more simple then theory

# build the model

model = tf.keras.Sequential(

        [

            # make a word became a 64-dim vector

            tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),

            # add all word vector

            tf.keras.layers.GlobalAveragePooling1D(),

            # NN

            tf.keras.layers.Dense(24, activation = 'relu'),

            tf.keras.layers.Dense(1, activation = 'sigmoid')

        ]

)

model.compile(loss = 'binary_crossentropy', optimizer = 'adam' , metrics = ['accuracy'])

train the model

num_epochs = 30

history = model.fit(train_X, train_y, epochs = num_epochs,

                   validation_data = (test_X, test_y),

                   verbose = 2)

after the 30 epochs

Epoch 30/30

597/597 - 8s - loss: 1.8816e-04 - accuracy: 1.0000 - val_loss: 1.2858 - val_accuracy: 0.8216

predict our sentence

mytest_sentence = ["you are so cute", "you are so cute but looks like stupid"]

mytest_X = tokenizer.texts_to_sequences(mytest_sentence)

mytest_X = pad_sequences(mytest_X ,

                      maxlen = max_length,

                      truncating = trunc_type,

                      padding = padding_type)



mytest_y = model.predict(mytest_X)

# if result is bigger then 0.5 ,it means the title is Sarcasm

print(mytest_y > 0.5)

'''

[[False]

 [ True]]

'''

reference:

tensorflow API: https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

colab: bit.ly/tfw-sarcembed

Sentiment analysis in nlp的更多相关文章

Sentiment Analysis resources
Wikipedia: Sentiment analysis (also known as opinion mining) refers to the use of natural language p ...
NAACL 2013 Paper Mining User Relations from Online Discussions using Sentiment Analysis and PMF
中文简单介绍:本文对怎样基于情感分析和概率矩阵分解从网络论坛讨论中挖掘用户关系进行了深入研究. 论文出处:NAACL'13. 英文摘要: Advances in sentiment analysis ...
【Deep Learning Nanodegree Foundation笔记】第 10 课：Sentiment Analysis with Andrew Trask
In this lesson, Andrew Trask, the author of Grokking Deep Learning, will walk you through using neur ...
论文阅读：Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
论文标题:Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis 论文链接:http://arxi ...
使用RNN进行imdb影评情感识别--use RNN to sentiment analysis
原创帖子,转载请说明出处一.RNN神经网络结构 RNN隐藏层神经元的连接方式和普通神经网路的连接方式有一个非常明显的区别,就是同一层的神经元的输出也成为了这一层神经元的输入.当然同一时刻的输出是不可 ...
Deep Learning for NLP 文章列举
Deep Learning for NLP 文章列举原文链接:http://www.xperseverance.net/blogs/2013/07/2124/ 大部分文章来自: http://w ...
转 Deep Learning for NLP 文章列举
原文链接:http://www.xperseverance.net/blogs/2013/07/2124/ 大部分文章来自: http://www.socher.org/ http://deepl ...
Standford CoreNLP--Sentiment Analysis初探
Stanford CoreNLP功能之一是Sentiment Analysis(情感分析),可以标识出语句的正面或者负面情绪,包括:Positive,Neutral,Negative三个值. 运行有两 ...
Java自然语言处理NLP工具包
1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实 ...

随机推荐

Codeforces Round #742 (Div. 2) B. MEXor Mixup
题目链接 Problem - B - Codeforces 题意: 给出MEX 和 XOR(分别表示1. 本串数不存在的最小非负数 2. 本串数所有数异或后的结果) 求出这串数最少有几个数, 1 ≤ ...
爬虫亚马逊Bestselling类别产品数据TOP100
1 # -*- coding: utf-8 -*- 2 # @Time : 2020/9/11 16:23 3 # @Author : Chunfang 4 # @Email : 3470959534 ...
『现学现忘』Git基础 — 14、Git基础操作的总结与补充
目录 1.Git本地版本库结构 2.Git常用操作方法 3.补充:添加多个文件到暂存区 4.补充:提交操作未写备注 5.补充:从工作区直接提交到版本库 1.Git本地版本库结构如下图所示: 工作区( ...
聊聊Lock接口的lock()和lockInterruptible()有什么区别？
lock()和lockInterruptible()都表示获取锁,唯一区别是,当A线程调用lock()或lockInterruptible()方法获取锁没有成功而进入等待锁的状态时,若接着调用该A线程 ...
数据交换格式 JSON
1. 什么是 JSON 概念 : JSON 的英文全称是 JavaScript ObjEct Notation, 即 "JavaScript 对象表示法" . 简单来讲 : JSO ...
团队Beta演示
组长博客本组(组名)所有成员短学号姓名 2236 王耀鑫(组长) 2210 陈超颖 2209 陈湘怡 2228 许培荣 2204 滕佳 2205 何佳琳 2237 沈梓耀 2233 陈志荣 22 ...
忘带U盘了？？别急！一行python代码即可搞定文件传输
近日发现了python一个很有趣的功能,今天在这里给大伙儿做一下分享需求前提 1.想要拷贝电脑的文件到另一台电脑但是又没有U盘2.手机上想获取到存储在电脑的文件3.忘带U盘- 您也太丢三落四了吧,但 ...
CRM项目的整理---第一篇
CRM:cunstomer relationship management 客户管理系统 1.项目的使用者:销售班主任讲师助教 2.项目的需求分析 2.1.注册 2.2.登录 2.3 ...
Ansible的参数介绍
安装完成ansible后查看ansible的参数:ansible -h ansible 命令格式:Usage: ansible <host-pattern> [options] ansib ...
【mq】从零开始实现 mq-12-消息的批量发送与回执
前景回顾 [mq]从零开始实现 mq-01-生产者.消费者启动 [mq]从零开始实现 mq-02-如何实现生产者调用消费者? [mq]从零开始实现 mq-03-引入 broker 中间人 [mq]从零 ...

Sentiment analysis in nlp

Sentiment analysis in nlp

preprocessing

build the model

code is more simple then theory

Sentiment analysis in nlp的更多相关文章

随机推荐

热门专题