Sentiment analysis in nlp

The goal of the program is to analysis the article title is Sarcasm or not, i use tensorflow 2.5 to solve this problem.

Dataset download url: https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/home

a sample of the dataset:

{

  "article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5",

  "headline": "former versace store clerk sues over secret 'black code' for minority shoppers",

  "is_sarcastic": 0

}

we want to depend on headline to predict the is_sarcastic, 1 means True,0 means False.

preprocessing

use pandas to read json file.

import pandas as pd

# lines = True means headle the json for each line

df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json" ,lines="True")

df

'''

   is_sarcastic      headline                     article_link

0      1         thirtysomething sci...         https://www.theonion.co...

1      0             dem rep. totally ...       https://www.huffingtonpos..

'''

build list for each column

labels = []

sentences = []

urls = []

# a tips for convert series to list

'''

    type(df['is_sarcastic'])

    # Series

    type(df['is_sarcastic'].values)

    # ndarray

    type(df['is_sarcastic'].values.tolist())

    # list

'''

labels = df['is_sarcastic'].values.tolist()

sentences = df['headline'].values.tolist()

urls = df['article_link'].values.tolist()

len(labels) # 28619

len(sentences) # 28619

split dataset into train set and test set

# train size is the 2/3 of the all dataset.

train_size = int(len(labels) / 3 * 2)

train_sentences = sentences[0: train_size]

test_sentences = sentences[train_size:]

train_y = labels[0:train_size]

test_y = labels[train_size:]

init some parameter

# some parameter

vocab_size = 10000

# input layer to embedding

embedding_dim = 16

# each input sentence length

max_length = 100

# padding method

trunc_type='post'

padding_type='post'

# token the unfamiliar word

oov_tok = "<OOV>"

preprocessing on train set and test set

# processing on train set and test set

import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token = oov_tok)

tokenizer.fit_on_texts(train_sentences)

train_X = tokenizer.texts_to_sequences(train_sentences)

# padding the data

train_X = pad_sequences(train_X,

                       maxlen = max_length,

                       truncating = trunc_type,

                       padding = padding_type)

train_X[:2]

# convery the list to nparray

train_y = np.array(train_y)

# same operator to test set

test_X = tokenizer.texts_to_sequences(test_sentences)

test_X = pad_sequences(test_X ,

                      maxlen = max_length,

                      truncating = trunc_type,

                      padding = padding_type)

test_y = np.array(test_y)

build the model

some important functions and args:

tf.keras.layers.Dense # Denseimplements the operation:output = activation(dot(input, kernel) + bias) , a NN layer
- activation # Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
- use_bias # Boolean, whether the layer uses a bias vector.
tf.keras.Sequential # contain a linear stack of layer into a tf.keras.Model.
tf.keras.Model # to train and predict
- config the model with losses and metrics with model.compile(args)
  - optimizer
    - some args Adam RMSprop SGD Adagrad
  - loss # The loss value that will be minimized by the model will then be the sum of all individual losses.
    - some args BinaryCrossentropy MeanAbsoluteError MeanSquaredError
  - metrices # List of metrics to be evaluated by the model during training and testing.
    - some args AUC Accuracy Recall
- train the model with model.fit(x=None,y=None)
  - batch_size # Number of samples per gradient update. If unspecified, batch_size will default to 32.
  - epochs # Number of epochs to train the model
  - verbose # Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch,verbose=2 is recommended when not running interactively
  - validation_data #( valid_X, valid_y )
tf.keras.layers.Embedding # Turns positive integers (indexes) into dense vectors of fixed size. as shown in following figure
- the purpose of the embedding is making the 1-dim integer proceed the muti-dim vectors add. can find the hide feature and connect to predict the labels. in this program ,every word's emotion direction can be trained many times.
tf.keras.layer.GlobalAveragePooling1D # add all muti-dim vectors ,if the output layer shape is (32, 10, 64), after the pooling, the shape will be changed as (32,64), as shown in following figure
-

code is more simple then theory

# build the model

model = tf.keras.Sequential(

        [

            # make a word became a 64-dim vector

            tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),

            # add all word vector

            tf.keras.layers.GlobalAveragePooling1D(),

            # NN

            tf.keras.layers.Dense(24, activation = 'relu'),

            tf.keras.layers.Dense(1, activation = 'sigmoid')

        ]

)

model.compile(loss = 'binary_crossentropy', optimizer = 'adam' , metrics = ['accuracy'])

train the model

num_epochs = 30

history = model.fit(train_X, train_y, epochs = num_epochs,

                   validation_data = (test_X, test_y),

                   verbose = 2)

after the 30 epochs

Epoch 30/30

597/597 - 8s - loss: 1.8816e-04 - accuracy: 1.0000 - val_loss: 1.2858 - val_accuracy: 0.8216

predict our sentence

mytest_sentence = ["you are so cute", "you are so cute but looks like stupid"]

mytest_X = tokenizer.texts_to_sequences(mytest_sentence)

mytest_X = pad_sequences(mytest_X ,

                      maxlen = max_length,

                      truncating = trunc_type,

                      padding = padding_type)



mytest_y = model.predict(mytest_X)

# if result is bigger then 0.5 ,it means the title is Sarcasm

print(mytest_y > 0.5)

'''

[[False]

 [ True]]

'''

reference:

tensorflow API: https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

colab: bit.ly/tfw-sarcembed

Sentiment analysis in nlp的更多相关文章

Sentiment Analysis resources
Wikipedia: Sentiment analysis (also known as opinion mining) refers to the use of natural language p ...
NAACL 2013 Paper Mining User Relations from Online Discussions using Sentiment Analysis and PMF
中文简单介绍:本文对怎样基于情感分析和概率矩阵分解从网络论坛讨论中挖掘用户关系进行了深入研究. 论文出处:NAACL'13. 英文摘要: Advances in sentiment analysis ...
【Deep Learning Nanodegree Foundation笔记】第 10 课：Sentiment Analysis with Andrew Trask
In this lesson, Andrew Trask, the author of Grokking Deep Learning, will walk you through using neur ...
论文阅读：Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
论文标题:Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis 论文链接:http://arxi ...
使用RNN进行imdb影评情感识别--use RNN to sentiment analysis
原创帖子,转载请说明出处一.RNN神经网络结构 RNN隐藏层神经元的连接方式和普通神经网路的连接方式有一个非常明显的区别,就是同一层的神经元的输出也成为了这一层神经元的输入.当然同一时刻的输出是不可 ...
Deep Learning for NLP 文章列举
Deep Learning for NLP 文章列举原文链接:http://www.xperseverance.net/blogs/2013/07/2124/ 大部分文章来自: http://w ...
转 Deep Learning for NLP 文章列举
原文链接:http://www.xperseverance.net/blogs/2013/07/2124/ 大部分文章来自: http://www.socher.org/ http://deepl ...
Standford CoreNLP--Sentiment Analysis初探
Stanford CoreNLP功能之一是Sentiment Analysis(情感分析),可以标识出语句的正面或者负面情绪,包括:Positive,Neutral,Negative三个值. 运行有两 ...
Java自然语言处理NLP工具包
1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实 ...

随机推荐

VS Code失焦时自动保存编辑器内容
vs code有一个非常好用的功能:就是自动保存. 而且不需要安装什么插件,只需要在编辑器设置就可以了.接下来我们一起来设置吧: 1.打开我们的vs code编辑器.在左下角有个齿轮图标(管理), ...
gin框架使用【3.路由参数】
GET url: http://127.0.0.1:8080/users/{id} http://127.0.0.1:8080/users/1 对于id值的获取 package main impo ...
SSL及GMVPN握手协议详解
之前写过一篇文章搞懂密码学基础及SSL/TLS协议,主要介绍了加密学的基础,并从整体上对SSL协议做了介绍.由于篇幅原因,SSL握手的详细流程没有深入介绍.本文将拆解握手流程,在消息级别对握手进行详细 ...
Source Generator实战
前言最近刷B站的时候浏览到了老杨的关于Source Generator的简介视频.其实当初.Net 6刚发布时候看到过微软介绍这个东西,但并没有在意.因为粗看觉得这东西限制蛮多的,毕竟C#是强类型语 ...
使用NFS作为Glance存储后端
NFS服务介绍 NFS网络文件系统提供了一种在类UNIX系统上共享文件的方法.目前NFS有3个版本:NFSv2.NFSv3.NFSv4.CentOS7默认使用NFSv4提供服务,优点是提供了有状态的连 ...
Redis源码漂流记（二）-搭建Redis调试环境
Redis源码漂流记(二)-搭建Redis调试环境一.目标搭建Redis调试环境简要理解Redis命令运转流程二.前提 1.有一些c知识简单基础(变量命名.常用数据类型.指针等) 可以参考这篇 ...
尾递归与 memorize 优化
尾递归与 memorize 优化本文写于 2020 年 12 月 10 日递归递归是一种非常常见的算法思维,在大家刚开始学编程的时候应该就会接触到. 我们可以这么理解递归: function 讲 ...
（数据科学学习手札136）Python中基于joblib实现极简并行计算加速
本文示例代码及文件已上传至我的Github仓库https://github.com/CNFeffery/DataScienceStudyNotes 1 简介我们在日常使用Python进行各种数据计算 ...
drools规则属性(rule attributes)的使用
一.介绍规则属性是您可以添加到业务规则以修改规则行为的附加规范. 在 DRL 文件中,您通常在规则条件和操作的上方定义规则属性,多个属性位于单独的行中,格式如下: rule "rule_n ...
23. Merge k Sorted Lists - LeetCode
Question 23. Merge k Sorted Lists Solution 题目大意:合并链表数组(每个链表中的元素是有序的),要求合并后的链表也是有序的思路:遍历链表数组,每次取最小节点 ...

Sentiment analysis in nlp

Sentiment analysis in nlp

preprocessing

build the model

code is more simple then theory

Sentiment analysis in nlp的更多相关文章

随机推荐

热门专题