在上文《TextCNN论文解读》中已经介绍了TextCNN的原理,本文通过tf2.0来做代码实践。

数据集:来自中文任务基准测评的数据集IFLYTEK

导库

import os
import re
import json
import jieba
import datetime
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.initializers import Constant
from sklearn.model_selection import train_test_split
from gensim.models.keyedvectors import KeyedVectors
random_seed = 100

数据预处理

设置数据路径

Dir = './data/iflytek_public/'
label_json_path = os.path.join(Dir, 'labels.json')
train_json_path = os.path.join(Dir, 'train.json')
test_json_path = os.path.join(Dir, 'test.json')
dev_json_path = os.path.join(Dir, 'dev.json')
  • read_json: 定义json数据读取函数
  • ReplacePunct: 一个用正则去除标点符号的类
  • string2list: 解析读取到的json列表,并提取文字序列和分类标签
def read_json(path):
json_data = []
with open(path, encoding='utf-8') as f:
for line in f.readlines():
json_data.append(json.loads(line))
return json_data class ReplacePunct:
def __init__(self):
self.pattern = re.compile(r"[!?',.:;!?’、,。:;「」~~○]") def replace(self, string):
return re.sub(self.pattern, "", string, count=0)
Replacer = ReplacePunct() def string2list(data_json):
'''
paras:
input:
data_json: the list of sample jsons outputs:
data_text: the list of word list
data_label: label list
'''
data_text = [list(Replacer.replace(text['sentence'])) for text in data_json]
data_label = [int(text['label']) for text in data_json]
return data_text, data_label

读取数据,过滤标点符号,转为字符序列并提取标签。

打印训练集、验证集的数量

label_json = read_json(label_json_path)
train_json = read_json(train_json_path)
dev_json = read_json(dev_json_path)
print ('train:{} | dev:{}'.format(len(train_json), len(dev_json))) train_text, train_label = string2list(train_json)
dev_text, dev_label = string2list(dev_json)
train:12133 | dev:2599

定义tokenizer并使用准备好的文本序列进行拟合

tokenizer = tf.keras.preprocessing.text.Tokenizer(
num_words=None,
filters=' ',
lower=True,
split=' ',
char_level=False,
oov_token='UNKONW',
document_count=0
)
tokenizer.fit_on_texts(train_text)
  • 定义batch_size, 序列最大长度
  • 将字符串序列转为整数序列
  • 将序列按照最大长度填充
  • 准备label tensor
  • 准备 train_dataset, dev_dataset
BATCH_SIZE = 64
MAX_LEN = 500
BUFFER_SIZE = tf.constant(len(train_text), dtype=tf.int64) # text 2 lists of int
train_sequence = tokenizer.texts_to_sequences(train_text)
dev_sequence = tokenizer.texts_to_sequences(dev_text) # padding sequence
train_sequence_padded = pad_sequences(train_sequence, padding='post', maxlen=MAX_LEN)
dev_sequence_padded = pad_sequences(dev_sequence, padding='post', maxlen=MAX_LEN) # cvt the label tensors
train_label_tensor = tf.convert_to_tensor(train_label, dtype=tf.float32)
dev_label_tensor = tf.convert_to_tensor(dev_label, dtype=tf.float32) # create the dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_sequence_padded, train_label_tensor)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(BUFFER_SIZE)
dev_dataset = tf.data.Dataset.from_tensor_slices((dev_sequence_padded, dev_label_tensor)).batch(BATCH_SIZE, drop_remainder=True).prefetch(BUFFER_SIZE)

一个batch的input, label样例

example_input, example_output = next(iter(train_dataset))
example_input.shape, example_output.shape
(TensorShape([64, 500]), TensorShape([64]))

构建模型

定义常量

VOCAB_SIZE = len(tokenizer.index_word) + 1   # 词典大小
EMBEDDING_DIM = 300 # 词向量大小
FILTERS = [3, 4, 5] # 卷积核尺寸个数
FILTER_NUM = 256 # 卷积层卷积核个数
CLASS_NUM = len(label_json) # 类别数
DROPOUT_RATE = 0.8 # dropout比例
  • get_embeddings: 读取预训练词向量
  • PretrainedEmbedding: 构建加载预训练词向量且可fine tuneEmbedding Layer
def get_embeddings():
pretrained_vec_path = "./saved_model/sgns.baidubaike.bigram-char"
word_vectors = KeyedVectors.load_word2vec_format(pretrained_vec_path, binary=False)
word_vocab = set(word_vectors.vocab.keys())
embeddings = np.zeros((VOCAB_SIZE, EMBEDDING_DIM), dtype=np.float32)
for i in range(len(tokenizer.index_word)):
i += 1
word = tokenizer.index_word[i]
if word in word_vocab:
embeddings[i, :] = word_vectors.get_vector(word)
return embeddings class PretrainedEmbedding(tf.keras.layers.Layer):
def __init__(self, VOCAB_SIZE, EMBEDDING_DIM, embeddings, rate=0.1):
super(PretrainedEmbedding, self).__init__()
self.VOCAB_SIZE = VOCAB_SIZE
self.EMBEDDING_DIM = EMBEDDING_DIM
self.embeddings_initializer = tf.constant_initializer(embeddings)
self.dropout = tf.keras.layers.Dropout(rate) def build(self, input_shape):
self.embeddings = self.add_weight(
shape = (self.VOCAB_SIZE, self.EMBEDDING_DIM),
initializer=self.embeddings_initializer,
dtype=tf.float32
) def call(self, x, trainable=None):
output = tf.nn.embedding_lookup(
params = self.embeddings,
ids = x
)
return self.dropout(output, training=trainable) embeddings = get_embeddings()

构建模型

class TextCNN(tf.keras.Model):
def __init__(self, VOCAB_SIZE, EMBEDDING_DIM, FILTERS, FILTER_NUM, CLASS_NUM, DROPOUT_RATE, embeddings):
super(TextCNN, self).__init__()
self.VOCAB_SIZE = VOCAB_SIZE
self.EMBEDDING_DIM = EMBEDDING_DIM
self.FILTERS = FILTERS
self.FILTER_NUM = FILTER_NUM
self.CLASS_NUM = CLASS_NUM
self.DROPOUT_RATE = DROPOUT_RATE # self.embed = tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM,
# embeddings_initializer=tf.keras.initializers.Constant(embeddings))
self.embed = PretrainedEmbedding(self.VOCAB_SIZE, self.EMBEDDING_DIM, embeddings)
self.convs = []
self.max_pools = []
for i, FILTER in enumerate(self.FILTERS):
conv = tf.keras.layers.Conv1D(self.FILTER_NUM, FILTER,
padding='same', activation='relu', use_bias=True)
max_pool = tf.keras.layers.GlobalAveragePooling1D()
self.convs.append(conv)
self.max_pools.append(max_pool)
self.dropout = tf.keras.layers.Dropout(self.DROPOUT_RATE)
self.fc = tf.keras.layers.Dense(self.CLASS_NUM, activation='softmax') def call(self, x):
x = self.embed(x, trainable=True)
conv_results = []
for conv, max_pool in zip(self.convs, self.max_pools):
conv_results.append(max_pool(conv(x)))
x = tf.concat(conv_results, axis=1)
x = self.dropout(x)
x = self.fc(x)
return x textcnn = TextCNN(VOCAB_SIZE, EMBEDDING_DIM, FILTERS, FILTER_NUM, CLASS_NUM, DROPOUT_RATE, embeddings)
out = textcnn(example_input)

定义损失函数、优化器

loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(0.0005) train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
eval_loss = tf.keras.metrics.Mean(name='eval_loss')
eval_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='eval_accuracy')

定义单步训练、测试函数

@tf.function
def train_step(input_tensor, label_tensor):
with tf.GradientTape() as tape:
prediction = textcnn(input_tensor)
loss = loss_object(label_tensor, prediction)
gradients = tape.gradient(loss, textcnn.trainable_variables)
optimizer.apply_gradients(zip(gradients, textcnn.trainable_variables)) train_loss(loss)
train_accuracy(label_tensor, prediction) @tf.function
def eval_step(input_tensor, label_tensor):
prediction = textcnn(input_tensor)
loss = loss_object(label_tensor, prediction) eval_loss(loss)
eval_accuracy(label_tensor, prediction)

定义writer,用于写入信息供tensorboard可视化观察使用。

current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
train_log_dir = 'logs/' + current_time + '/train'
test_log_dir = 'logs/' + current_time + '/test'
train_summary_writer = tf.summary.create_file_writer(train_log_dir)
test_summary_writer = tf.summary.create_file_writer(test_log_dir)

模型训练,保存权重

EPOCHS = 10
for epoch in range(EPOCHS): train_loss.reset_states()
train_accuracy.reset_states()
eval_loss.reset_states()
eval_accuracy.reset_states() for batch_idx, (train_input, train_label) in enumerate(train_dataset):
train_step(train_input, train_label)
with train_summary_writer.as_default():
tf.summary.scalar('loss', train_loss.result(), step=epoch)
tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch) for batch_idx, (dev_input, dev_label) in enumerate(dev_dataset):
eval_step(dev_input, dev_label)
with test_summary_writer.as_default():
tf.summary.scalar('loss', eval_loss.result(), step=epoch)
tf.summary.scalar('accuracy', eval_accuracy.result(), step=epoch) template = 'Epoch {}, Loss: {:.4f}, Accuracy: {:.4f}, Test Loss: {:.4f}, Test Accuracy: {:.4f}'
print (template.format(epoch+1,
train_loss.result().numpy(),
train_accuracy.result().numpy()*100,
eval_loss.result().numpy(),
eval_accuracy.result().numpy()*100))
textcnn.save_weights('./saved_model/weights_{}.h5'.format(epoch))
Epoch 1, Loss: 3.7328, Accuracy: 22.9497, Test Loss: 3.2937, Test Accuracy: 28.2422
Epoch 2, Loss: 2.9424, Accuracy: 33.8790, Test Loss: 2.7973, Test Accuracy: 35.1953
Epoch 3, Loss: 2.5407, Accuracy: 40.1620, Test Loss: 2.5324, Test Accuracy: 41.0156
Epoch 4, Loss: 2.3023, Accuracy: 44.6759, Test Loss: 2.4003, Test Accuracy: 43.1641
Epoch 5, Loss: 2.1400, Accuracy: 47.5942, Test Loss: 2.2732, Test Accuracy: 45.2344
Epoch 6, Loss: 2.0264, Accuracy: 49.5784, Test Loss: 2.2155, Test Accuracy: 45.1172
Epoch 7, Loss: 1.9319, Accuracy: 51.7361, Test Loss: 2.1572, Test Accuracy: 48.2812
Epoch 8, Loss: 1.8622, Accuracy: 53.1415, Test Loss: 2.1201, Test Accuracy: 48.7109
Epoch 9, Loss: 1.7972, Accuracy: 54.2411, Test Loss: 2.0863, Test Accuracy: 49.1016
Epoch 10, Loss: 1.7470, Accuracy: 55.2331, Test Loss: 2.1074, Test Accuracy: 48.8281

可视化

tensorboard --logdir logs/

TextCNN代码实践的更多相关文章

  1. ReactiveCocoa代码实践之-更多思考

    三.ReactiveCocoa代码实践之-更多思考 1. RACObserve()宏形参写法的区别 之前写代码考虑过 RACObserve(self.timeLabel , text) 和 RACOb ...

  2. ReactiveCocoa代码实践之-RAC网络请求重构

    前言 RAC相比以往的开发模式主要有以下优点:提供了统一的消息传递机制:提供了多种奇妙且高效的信号操作方法:配合MVVM设计模式和RAC宏绑定减少多端依赖. RAC的理论知识非常深厚,包含有FRP,高 ...

  3. 深刻理解Python中的元类(metaclass)--代码实践

    根据http://blog.jobbole.com/21351/所作的代码实践. 这篇讲得不错,但以我现在的水平,用到的机会是很少的啦... #coding=utf-8 class ObjectCre ...

  4. Java的BIO和NIO很难懂?用代码实践给你看,再不懂我转行!

    本文原题“从实践角度重新理解BIO和NIO”,原文由Object分享,为了更好的内容表现力,收录时有改动. 1.引言 这段时间自己在看一些Java中BIO和NIO之类的东西,也看了很多博客,发现各种关 ...

  5. word2vector代码实践

    引子 在上次的 <word2vector论文笔记>中大致介绍了两种词向量训练方法的原理及优劣,这篇咱们以skip-gram算法为例来代码实践一把. 当前教程参考:A Word2Vec Ke ...

  6. 机器学习(四):通俗理解支持向量机SVM及代码实践

    上一篇文章我们介绍了使用逻辑回归来处理分类问题,本文我们讲一个更强大的分类模型.本文依旧侧重代码实践,你会发现我们解决问题的手段越来越丰富,问题处理起来越来越简单. 支持向量机(Support Vec ...

  7. ReactiveCocoa代码实践之-UI组件的RAC信号操作

    上一节是自己对网络层的一些重构,本节是自己一些代码小实践做出的一些demo程序,基本涵盖大多数UI控件操作. 一.用UISlider实现调色板 假设我们现在做一个demo,上面有一个View用来展示颜 ...

  8. iOS代码实践总结

    转载地址:http://mobile.51cto.com/hot-492236.htm 最近一个月除了专门抽时间和精力重构之外,还有就是遇到需要添加功能的模块的时候,由于项目中的代码历史因素比较多,第 ...

  9. TextCNN 代码详解(附测试数据集以及GitHub 地址)

    前言:本篇是TextCNN系列的第三篇,分享TextCNN的优化经验 前两篇可见: 文本分类算法TextCNN原理详解(一) 一.textCNN 整体框架 1. 模型架构 图一:textCNN 模型结 ...

随机推荐

  1. kubernets之pod的标签拓展

    一 标签的拓展使用 1.1 标签的作用范围不仅仅适用于pod对node以及其他类的大部分资源同样适用 k label node node01 gpu=true k是kubectl的别名形式 同样对于n ...

  2. C# url的编码解码,xml和json的序列化和反序列化

    参考中国慕课网dot net web编程应用程序实践 using System; using System.Collections.Generic; using System.IO; using Sy ...

  3. JS复习笔记一:冒泡排序和二叉树列

    在这里逐步整理一些JS开发的知识,分享给大家: 一:冒泡排序 使用场景:数组中根据某个值得大小来对这个数组进行整体排序 难度:简单 原理:先进行循环,循环获取第一至倒数第二的范围内所有值,对当前值与下 ...

  4. 不支持的字符集 (在类路径中添加 orai18n.jar): ZHS16GBK

    不支持的字符集 (在类路径中添加 orai18n.jar): ZHS16GBK 报错图示: 报错内容: Exception in thread "main" java.sql.SQ ...

  5. ShardingSphere内核原理 原创 鸽子 架构漫谈 2021-01-09

    ShardingSphere内核原理 原创 鸽子 架构漫谈 2021-01-09

  6. 脑裂 CAP PAXOS 单元化 网络分区 最终一致性 BASE

    阿里技术专家甘盘:浅谈双十一背后的支付宝LDC架构和其CAP分析 https://mp.weixin.qq.com/s/Cnzz5riMc9RH19zdjToyDg 汤波(甘盘) 技术琐话 2020- ...

  7. LOJ10144宠物收养所

    HNOI 2004 最近,阿 Q 开了一间宠物收养所.收养所提供两种服务:收养被主人遗弃的宠物和让新的主人领养这些宠物. 每个领养者都希望领养到自己满意的宠物,阿 Q 根据领养者的要求通过他自己发明的 ...

  8. Vue3(三)CND + ES6的import + 工程化的目录结构 = 啥?

    突发奇想 这几天整理了一下vue的几种使用方式,对比之后发现有很多相似之处,那么是不是可以混合使用呢?比如这样: vue的全家桶和UI库,采用传统的方式加载(CND.script). 自己写的js代码 ...

  9. Linux数据库的导入导出

    Linux数据库的导入导出 1.导入数据库 mysql -u username -p test < /home/data/test.sql 说明:username是数据库用户名,test为目标数 ...

  10. 全局负载均衡与CDN内容分发

    CDN简介 CDN的全称是Content Delivery Network,即内容分发网络.CDN是构建在现有网络基础之上的智能虚拟网络,依靠部署在各地的边缘服务器,通过中心平台的负载均衡.内容分发. ...