NER（BiLSTM+CRF，Keras）

数据集为玻森命名实体数据。

目前代码流程跑通了，后续再进行优化。

项目地址：https://github.com/cyandn/practice/tree/master/NER

步骤：

数据预处理：

def data_process():

    zh_punctuation = ['，', '。', '？', '；', '！', '……']

    with open('data/BosonNLP_NER_6C_process.txt', 'w', encoding='utf8') as fw:

        with open('data/BosonNLP_NER_6C.txt', encoding='utf8') as fr:

            for line in fr.readlines():

                line = ''.join(line.split()).replace('\\n', '')  # 去除文本中的空字符

                i = 0

                while i < len(line):

                    word = line[i]

                    if word in zh_punctuation:

                        fw.write(word + '/O')

                        fw.write('\n')

                        i += 1

                        continue

                    if word == '{':

                        i += 2

                        temp = ''

                        while line[i] != '}':

                            temp += line[i]

                            i += 1

                        i += 2

                        type_ne = temp.split(':')

                        etype = type_ne[0]

                        entity = type_ne[1]

                        fw.write(entity[0] + '/B_' + etype + ' ')

                        for item in entity[1:]:

                            fw.write(item + '/I_' + etype + ' ')

                    else:

                        fw.write(word + '/O ')

                        i += 1

加载数据：

def load_data(self):

    maxlen = 0

    with open('data/BosonNLP_NER_6C_process.txt', encoding='utf8') as f:

        for line in f.readlines():

            word_list = line.strip().split()

            one_sample, one_label = zip(

                *[word.rsplit('/', 1) for word in word_list])

            one_sample_len = len(one_sample)

            if one_sample_len > maxlen:

                maxlen = one_sample_len

            one_sample = ' '.join(one_sample)

            one_label = [config.classes[label] for label in one_label]

            self.total_sample.append(one_sample)

            self.total_label.append(one_label)

    tok = Tokenizer()

    tok.fit_on_texts(self.total_sample)

    self.vocabulary = len(tok.word_index) + 1

    self.total_sample = tok.texts_to_sequences(self.total_sample)

    self.total_sample = np.array(pad_sequences(

        self.total_sample, maxlen=maxlen, padding='post', truncating='post'))

    self.total_label = np.array(pad_sequences(

        self.total_label, maxlen=maxlen, padding='post', truncating='post'))[:, :, None]

    print('total_sample shape:', self.total_sample.shape)

    print('total_label shape:', self.total_label.shape)

    X_train, self.X_test, y_train, self.y_test = train_test_split(

        self.total_sample, self.total_label, test_size=config.proportion['test'], random_state=666)

    self.X_train, self.X_val, self.y_train, self.y_val = train_test_split(

        X_train, y_train, test_size=config.proportion['val'], random_state=666)

    print('X_train shape:', self.X_train.shape)

    print('y_train shape:', self.y_train.shape)

    print('X_val shape:', self.X_val.shape)

    print('y_val shape:', self.y_val.shape)

    print('X_test shape:', self.X_test.shape)

    print('y_test shape:', self.y_test.shape)

    del self.total_sample

    del self.total_label

构建模型：

def build_model(self):

    model = Sequential()

    model.add(Embedding(self.vocabulary, 100, mask_zero=True))

    model.add(Bidirectional(LSTM(64, return_sequences=True)))

    model.add(CRF(len(config.classes), sparse_target=True))

    model.summary()

    opt = Adam(lr=config.hyperparameter['learning_rate'])

    model.compile(opt, loss=crf_loss, metrics=[crf_viterbi_accuracy])

    self.model = model

训练：

def train(self):

    save_dir = os.path.join(os.getcwd(), 'saved_models')

    model_name = '{epoch:03d}_{val_crf_viterbi_accuracy:.4f}.h5'

    if not os.path.isdir(save_dir):

        os.makedirs(save_dir)

    tensorboard = TensorBoard()

    checkpoint = ModelCheckpoint(os.path.join(save_dir, model_name),

                                    monitor='val_crf_viterbi_accuracy',

                                    save_best_only=True)

    lr_reduce = ReduceLROnPlateau(

        monitor='val_crf_viterbi_accuracy', factor=0.2, patience=10)

    self.model.fit(self.X_train, self.y_train,

                    batch_size=config.hyperparameter['batch_size'],

                    epochs=config.hyperparameter['epochs'],

                    callbacks=[tensorboard, checkpoint, lr_reduce],

                    validation_data=[self.X_val, self.y_val])

预测：

def evaluate(self):

    best_model_name = sorted(os.listdir('saved_models')).pop()

    self.best_model = load_model(os.path.join('saved_models', best_model_name),

                                    custom_objects={'CRF': CRF,

                                                    'crf_loss': crf_loss,

                                                    'crf_viterbi_accuracy': crf_viterbi_accuracy})

    scores = self.best_model.evaluate(self.X_test, self.y_test)

    print('test loss:', scores[0])

    print('test accuracy:', scores[1])

参考：

https://zhuanlan.zhihu.com/p/44042528

https://blog.csdn.net/buppt/article/details/81180361

https://github.com/stephen-v/zh-NER-keras

http://www.voidcn.com/article/p-pykfinyn-bro.html

NER（BiLSTM+CRF，Keras）的更多相关文章

百度坐标（BD09）、国测局坐标（火星坐标，GCJ02）、和WGS84坐标系之间的转换（JS版代码）
/** * Created by Wandergis on 2015/7/8. * 提供了百度坐标(BD09).国测局坐标(火星坐标,GCJ02).和WGS84坐标系之间的转换 */ //定义一些常量 ...
Slider插件（滑动条，拉链）
Slider插件(滑动条,拉链) 下载地址:http://files.cnblogs.com/elves/Slider.rar 提示:微软AJAX插件中也带此效果!
NGUI系列教程四（自定义Atlas，Font）
今天我们来看一下怎么自定义NGUIAtlas,制作属于自己风格的UI.第一部分:自定义 Atlas1 . 首先我们要准备一些图标素材,也就是我们的UI素材,将其导入到unity工程中.2. 全选我们需 ...
Java基础知识强化之集合框架笔记60：Map集合之TreeMap（TreeMap<Student，String>）的案例
1. TreeMap(TreeMap<Student,String>)的案例 2. 案例代码: (1)Student.java: package cn.itcast_04; public ...
Java基础知识强化之集合框架笔记57：Map集合之HashMap集合（HashMap<Student，String>）的案例
1. HashMap集合(HashMap<Student,String>)的案例 HashMap<Student,String>键:Student 要求:如果两个对象 ...
Java基础知识强化之集合框架笔记56：Map集合之HashMap集合（HashMap<String，Student>）的案例
1. HashMap集合(HashMap<String,Student>)的案例 HashMap是最常用的Map集合,它的键值对在存储时要根据键的哈希码来确定值放在哪里. HashMap的 ...
Java基础知识强化之集合框架笔记54：Map集合之HashMap集合（HashMap<String，String>）的案例
1. HashMap集合 HashMap集合(HashMap<String,String>)的案例 2. 代码示例: package cn.itcast_02; import java.u ...
pearl（二分查找，stl）
最近大概把有关二分的题目都看了一遍... 嗯..这题是二分查找...二分查找的代码都类似,所以打起来会水很多但是刚开始打二分还是很容易写挂..所以依旧需要注意题2 天堂的珍珠 [题目描述] 我有很 ...
KMP算法（研究总结，字符串）
KMP算法(研究总结,字符串) 前段时间学习KMP算法,感觉有些复杂,不过好歹是弄懂啦,简单地记录一下,方便以后自己回忆. 引入首先我们来看一个例子,现在有两个字符串A和B,问你在A中是否有B,有几 ...

随机推荐

字节流---Day30
IO概述当我们在生活中把电脑上的数据拷贝到U盘或者硬盘上时,就是进行数据传输,按照数据的流动方向,我们分为输入(input)和输出(output),即就是所谓IO流 Java中I/O操作主要是指使用 ...
webuploader-异步切片上传（暂不支持断点续传）及下载方法！C#/.NET
十年河东,十年河西,莫欺少年穷学无止境,精益求精进入正题: 关于webuploader,参考网址:https://fex.baidu.com/webuploader/: 本篇博客范例下载地址:ht ...
MSSQL记录表字段数据变化的相关SQl
在软件实施过程中,也许会有这样的问题: 表中数据出现非预期的结果,此时不确定是程序问题,哪个程序,存储过程,触发器.. 或还是人为修改的结果,此时可以用触发器对特定的表字段做跟踪监视,记录每次新增,修 ...
用matlab计算线性回归问题
看机器学习的时候遇到的第一个算法就是线性回归,高数中很详细的说明了线性回归的原理和最小2乘法的计算过程,很显然不适合手动计算,好在各种语言都有现成的函数使用,让我们愉快的做个调包侠吧简单线性回归 R ...
django rest_framework vue 实现用户登录
django rest_framework vue 实现用户登录后端代码就不介绍了,可以参考 django rest_framework 实现用户登录认证这里介绍一下前端代码,和前后端的联调过程 ...
k8s krew 插件管理工具
参考:https://github.com/kubernetes-sigs/krew https://int32bit.me/2019/12/05/%E5%88%86%E4%BA%AB%E5%87%A ...
javascript之DOM(四其他类型)
一.Text类型文本节点由Text类型表示,指的是可以以字面意思解释的纯文本内容,其中包含HTML代码. nodeType=3 nodeName=#text nodeValue=文本内容 paren ...
Nginx应用优化
案例环境: 系统类型 IP地址主机名所需软件 Centos 6.5 192.168.100.150 www.linuxfan.cn nginx-1.6.2.tar.gz 一.Nginx隐藏版本号 ...
十、lambda表达式、内置函数之filter、map、reduce
lambda表达式学习条件运算时,对于简单的 if else 语句,可以使用三元运算来表示,即: # 普通条件语句 == : name = 'wupeiqi' else: name = 'ale ...
Python程序 #!/usr/bin/python 的解释
关于脚本第一行的 #!/usr/bin/python 的解释,相信很多不熟悉 Linux 系统的同学需要普及这个知识,脚本语言的第一行,只对 Linux/Unix 用户适用,用来指定本脚本用什么解释器 ...

NER（BiLSTM+CRF，Keras）

NER（BiLSTM+CRF，Keras）的更多相关文章

随机推荐

热门专题