关于bert+lstm+crf实体识别训练数据的构建

一.在实体识别中，bert+lstm+crf也是近来常用的方法。这里的bert可以充当固定的embedding层，也可以用来和其它模型一起训练fine-tune。大家知道输入到bert中的数据需要一定的格式，如在单个句子的前后需要加入"[CLS]"和“[SEP]”，需要mask等。下面使用pad_sequences对句子长度进行截断以及padding填充，使每个输入句子的长度一致。构造训练集后，下载中文的预训练模型并加载相应的模型和词表vocab以参数配置，最后并利用albert抽取句子的embedding，这个embedding可以作为一个下游任务和其它模型进行组合完成特定任务的训练。

 import torch

 from configs.base import config

 from model.modeling_albert import BertConfig, BertModel

 from model.tokenization_bert import BertTokenizer

 from keras.preprocessing.sequence import pad_sequences

 from torch.utils.data import TensorDataset, DataLoader, RandomSampler

 import os

 device = torch.device('cuda' if torch.cuda.is_available()  else "cpu")

 MAX_LEN = 10

 if __name__ == '__main__':

     bert_config = BertConfig.from_pretrained(str(config['albert_config_path']), share_type='all')

     base_path = os.getcwd()

     VOCAB = base_path + '/configs/vocab.txt'  # your path for model and vocab

     tokenizer = BertTokenizer.from_pretrained(VOCAB)

     # encoder text

     tag2idx={'[SOS]':101, '[EOS]':102, '[PAD]':0, 'B_LOC':1, 'I_LOC':2, 'O':3}

     sentences = ['我是中华人民共和国国民', '我爱祖国']

     tags = ['O O B_LOC I_LOC I_LOC I_LOC I_LOC I_LOC O O', 'O O O O']

     tokenized_text = [tokenizer.tokenize(sent) for sent in sentences]

     #利用pad_sequence对序列长度进行截断和padding

     input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_text], #没法一条一条处理，只能2-d的数据，即多于一条样本，但是如果全部加载到内存是不是会爆

                               maxlen=MAX_LEN-2,

                               truncating='post',

                               padding='post',

                               value=0)

     tag_ids = pad_sequences([[tag2idx.get(tok) for tok in tag.split()] for tag in tags],

                             maxlen=MAX_LEN-2,

                             padding="post",

                             truncating="post",

                             value=0)

     #bert中的句子前后需要加入[CLS]:101和[SEP]:102

     input_ids_cls_sep = []

     for input_id in input_ids:

         linelist = []

         linelist.append(101)

         flag = True

         for tag in input_id:

             if tag > 0:

                 linelist.append(tag)

             elif tag == 0 and flag:

                 linelist.append(102)

                 linelist.append(tag)

                 flag = False

             else:

                 linelist.append(tag)

         if tag > 0:

             linelist.append(102)

         input_ids_cls_sep.append(linelist)

     tag_ids_cls_sep = []

     for tag_id in tag_ids:

         linelist = []

         linelist.append(101)

         flag = True

         for tag in tag_id:

             if tag > 0:

                 linelist.append(tag)

             elif tag == 0 and flag:

                 linelist.append(102)

                 linelist.append(tag)

                 flag = False

             else:

                 linelist.append(tag)

         if tag > 0:

             linelist.append(102)

         tag_ids_cls_sep.append(linelist)

     attention_masks = [[int(tok > 0) for tok in line] for line in input_ids_cls_sep]

     print('---------------------------')

     print('input_ids:{}'.format(input_ids_cls_sep))

     print('tag_ids:{}'.format(tag_ids_cls_sep))

     print('attention_masks:{}'.format(attention_masks))

     # input_ids = torch.tensor([tokenizer.encode('我 是 中 华 人 民 共 和 国 国 民', add_special_tokens=True)]) #为True则句子首尾添加[CLS]和[SEP]

     # print('input_ids:{}, size:{}'.format(input_ids, len(input_ids)))

     # print('attention_masks:{}, size:{}'.format(attention_masks, len(attention_masks)))

     inputs_tensor = torch.tensor(input_ids_cls_sep)

     tags_tensor = torch.tensor(tag_ids_cls_sep)

     masks_tensor = torch.tensor(attention_masks)

     train_data = TensorDataset(inputs_tensor, tags_tensor, masks_tensor)

     train_sampler = RandomSampler(train_data)

     train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=2)

     model = BertModel.from_pretrained(config['bert_dir'],config=bert_config)

     model.to(device)

     model.eval()

     with torch.no_grad():

         '''

         note:

         一.

         如果设置："output_hidden_states":"True"和"output_attentions":"True"

         输出的是： 所有层的 sequence_output, pooled_output, (hidden_states), (attentions)

         则 all_hidden_states, all_attentions = model(input_ids)[-2:]

         二.

         如果没有设置：output_hidden_states和output_attentions

         输出的是：最后一层  --> (output_hidden_states, output_attentions)

        '''

         for index, batch in enumerate(train_dataloader):

             batch = tuple(t.to(device) for t in batch)

             b_input_ids, b_input_mask, b_labels = batch

             last_hidden_state = model(input_ids = b_input_ids,attention_mask = b_input_mask)

             print(len(last_hidden_state))

             all_hidden_states, all_attentions = last_hidden_state[-2:] #这里获取所有层的hidden_satates以及attentions

             print(all_hidden_states[-2].shape)#倒数第二层hidden_states的shape
　　　　　　　　　print(all_hidden_states[-2])

二.打印结果

input_ids:[[101, 2769, 3221, 704, 1290, 782, 3696, 1066, 1469, 102], [101, 2769, 4263, 4862, 1744, 102, 0, 0, 0, 0]]
tag_ids:[[101, 3, 3, 1, 2, 2, 2, 2, 2, 102], [101, 3, 3, 3, 3, 102, 0, 0, 0, 0]]
attention_masks:[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]
4
torch.Size([2, 10, 768])
tensor([[[-1.1074, -0.0047, 0.4608, ..., -0.1816, -0.6379, 0.2295],
         [-0.1930, -0.4629, 0.4127, ..., -0.5227, -0.2401, -0.1014],
         [ 0.2682, -0.6617, 0.2744, ..., -0.6689, -0.4464, 0.1460],
         ...,
         [-0.1723, -0.7065, 0.4111, ..., -0.6570, -0.3490, -0.5541],
         [-0.2028, -0.7025, 0.3954, ..., -0.6566, -0.3653, -0.5655],
         [-0.2026, -0.6831, 0.3778, ..., -0.6461, -0.3654, -0.5523]],

[[-1.3166, -0.0052, 0.6554, ..., -0.2217, -0.5685, 0.4270],
         [-0.2755, -0.3229, 0.4831, ..., -0.5839, -0.1757, -0.1054],
         [-1.4941, -0.1436, 0.8720, ..., -0.8316, -0.5213, -0.3893],
         ...,
         [-0.7022, -0.4104, 0.5598, ..., -0.6664, -0.1627, -0.6270],
         [-0.7389, -0.2896, 0.6083, ..., -0.7895, -0.2251, -0.4088],
         [-0.0351, -0.9981, 0.0660, ..., -0.4606, 0.4439, -0.6745]]])

关于bert+lstm+crf实体识别训练数据的构建的更多相关文章

基于bert的命名实体识别，pytorch实现，支持中文/英文【源学计划】
声明:为了帮助初学者快速入门和上手,开始源学计划,即通过源代码进行学习.该计划收取少量费用,提供有质量保证的源码,以及详细的使用说明. 第一个项目是基于bert的命名实体识别(name entity ...
BiLSTM+CRF 实体识别
https://www.cnblogs.com/Determined22/p/7238342.html 这篇博客里面这个公式表示抽象的含义,表示的是最后的分数由他们影响,不是直观意义上的相加. 为什 ...
『深度应用』NLP命名实体识别(NER)开源实战教程
近几年来,基于神经网络的深度学习方法在计算机视觉.语音识别等领域取得了巨大成功,另外在自然语言处理领域也取得了不少进展.在NLP的关键性基础任务—命名实体识别(Named Entity Recogni ...
基于keras实现的中文实体识别
1.简介 NER(Named Entity Recognition,命名实体识别)又称作专名识别,是自然语言处理中常见的一项任务,使用的范围非常广.命名实体通常指的是文本中具有特别意义或者指代性非常强 ...
抛弃模板，一种Prompt Learning用于命名实体识别任务的新范式
原创作者 | 王翔论文名称: Template-free Prompt Tuning for Few-shot NER 文献链接: https://arxiv.org/abs/2109.13532 ...
基于BERT预训练的中文命名实体识别TensorFlow实现
BERT-BiLSMT-CRF-NERTensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuni ...
用IDCNN和CRF做端到端的中文实体识别
实体识别和关系抽取是例如构建知识图谱等上层自然语言处理应用的基础.实体识别可以简单理解为一个序列标注问题:给定一个句子,为句子序列中的每一个字做标注.因为同是序列标注问题,除去实体识别之外,相同的技术 ...
基于条件随机场（CRF）的命名实体识别
很久前做过一个命名实体识别的模块,现在有时间,记录一下. 一.要识别的对象人名.地名.机构名二.主要方法 1.使用CRF模型进行识别(识别对象都是最基础的序列,所以使用了好评率较高的序列识别算法C ...
基于双向LSTM和迁移学习的seq2seq核心实体识别
http://spaces.ac.cn/archives/3942/ 暑假期间做了一下百度和西安交大联合举办的核心实体识别竞赛,最终的结果还不错,遂记录一下.模型的效果不是最好的,但是胜在“端到端”, ...

随机推荐

使用css让表头固定的方法
1.可以使用display: table; width: 100%; table-layout: fixed; table-layout: fixed;设置表格布局算法.tableLayout 属性用 ...
json树迭代
getArray(data){ for (var i in data) { if(data[i].disabled){ data[i].disabled = false } if(data[i].ch ...
LEANGOO成员
转自:https://www.leangoo.com/leangoo_guide/leangoo_guide_member.html 1. 看板成员及权限一个看板上的最大成员限制为200个. 看板的 ...
css div嵌套层中button的margin-top不起作用解决方法
首先声明本人资质尚浅,本文只用于个人总结.如有错误,欢迎指正.共同提高. --------------------------------------------------------------- ...
ARMA(p,q)模型数据的产生
一.功能产生自回归滑动平均模型\(ARMA(p,q)\)的数据. 二.方法简介自回归滑动平均模型\(ARMA(p,q)\)为 \[ x(n)+\sum_{i=1}^{p}a_{i}x(n-i)=\ ...
【坑】Spring中抽象父类属性注入，子类调用父类方法使用父类注入属性
运行环境 idea 2017.1.1 spring 3.2.9.RELEASE 需求背景需要实现一个功能,该功能有2个场景A.B,大同小异抽象一个抽象基类Base,实现了基本相同的方法BaseMe ...
Kubernetes的yaml文件中command的使用
前面说了init容器initContainers,这主要是为应用容器做前期准备工作的,一般都会用到shell脚本,这就会用到command,这里写写command的用法. command就是将命令在创 ...
Mybatis 高级查询的小整理
高级查询的整理 // resutlType无法帮助我们自动的去完成映射,所以只有使用resultMap手动的进行映射 resultMap: type 结果集对应的数据类型 id 唯一标识,被引用的时候 ...
Linux学习笔记（一）分区
一.硬件设备文件名二.设备文件名 /dev/hda1(IDE硬盘接口) /dev/sda1(SCSI硬盘接口.SATA硬盘接口) 其中,a代表第1个硬盘(以此类推,b为第2个硬盘),1代表第1个分区 ...
EasyUI+JSP之java读取数据库后JSON格式数据的返回及调用
做作业工程中遇到一些问题,特此记录一下解决的问题:使用EasyUI框架搭建简单学生管理系统(数据库增删改查)操作时配合JSP,不知道如何把从数据库获得的数据封装成JSON格式并传回前端JSP并进行展 ...

关于bert+lstm+crf实体识别训练数据的构建

关于bert+lstm+crf实体识别训练数据的构建的更多相关文章

随机推荐

热门专题