基本原理

损失函数

(线性链)CRF通常用于序列标注任务,对于输入序列\(x\)和标签序列\(y\),定义匹配分数:

\[s(x,y) = \sum_{i=0}^l T(y_i, y_{i+1}) + \sum_{i=1}^l U(x_i, y_i)
\]

这里\(l\)是序列长度,\(T\)和\(U\)都是可学习的参数,\(T(y_i, y_{i+1})\)表示第\(i\)步的标签是\(y_i\),第\(i+1\)步标签是\(y_{i+1}\)的转移分数,\(U(x_i,y_i)\)表示第\(i\)步输入\(x_i\)对应的标签是\(y_i\)的发射分数。注意这里在计算转移分数\(T\)时,状态转移链为\(y_0\rightarrow y_1 \rightarrow \dots \rightarrow y_l \rightarrow y_{l+1}\),因为人为地加入了START_TAG和STOP_TAG标签。

为了解决标注偏置问题,CRF需要做全局归一化,具体而言就是输入\(x\)对应的标签序列为\(y\)的概率定义为:

\[P(y|x)=\frac{e^{s(x,y)}}{Z(x)} = \frac{e^{s(x,y)}}{\sum_{\tilde{y}\in Y_x}e^{s(x,y)}}
\]

因此这里最麻烦的就是计算配分函数(partition function)\(Z(x)\),因为它要遍历所有路径。

在训练过程中,我们希望最大化正确标签序列的对数概率,即:

\[\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = s(x,y) - \log Z(x) = s(x,y) - \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)})
\]

也就是最小化负对数似然,即损失函数为:

\[-\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)}) - s(x,y)
\]

配分函数计算

接下来我们来讨论怎么计算\(Z(x)\)。我们使用前向算法计算\(Z(x)\),伪码如下:

  1. 初始化,对于\(y_2\)的所有取值\(y_2^*\),定义

\[\alpha_1(y_2^*) = \sum_{y_1^*} \exp(U(x_1, y_1^*) + T(y_1^*, y_2^*))
\]

这里\(y_k\)表示\(k\)时刻的标签,它的取值空间是标签控件,如B,I,O等,某一个具体的取值记为\(y_k^*\)。\(\alpha_k(y_{k+1}^*)\)可以认为是时刻\(k\)时的非规范化概率。注意这里\(y_{k+1}^*\)我们只用了一个标签,其实我们要在整个标签空间遍历,对于\(y_{k+1}\)的每一个取值都算一遍。

2. 对于\(k = 2, 3, \dots, l-1\)以及\(y_{k+1}\)的所有取值\(y_{k+1}^*\),都有:

\[\log (\alpha_k(y_{k+1}^*)) = \log \sum_{y_k^*}\exp \left(U(x_k, y_k^*)+T(y_k^*, y_{k+1}^*) + \log(\alpha_{k-1}(y_k^*)) \right)
\]

这里\(y_k\)和\(y_{k+1}\)都是一个具体的取值,这意味着这一步的计算复杂度是\(O(N^2)\)的,其中\(N\)是标签数目。

3. 最终:

\[Z(x) = \sum_{y_l^*} \exp \left(U(x_l, y_l^*) + \log(\alpha_{l-1}(y_l^*)) \right)
\]

注意到伪码第二步就是所谓的logsumexp,这可能会导致问题。因为如果求指数特别大,可能会导致溢出。因此这里存在一个小trick使得计算时数值稳定:

\[\log \sum_k \exp(z_k) = \max (\mathbf{z}) + \log \sum_k \exp(z_k - \max(\mathbf{z}))
\]

证明如下:

\[\log \sum_k \exp(z_k) = \log \sum_k (\exp(z_k -c) \cdot \exp(c)) = \log[\exp(c) \cdot \sum_k \exp(z_k -c)] = c + \log \sum_k \exp(z_k -c) \qquad \text{令} \ c = \max({\mathbf{z}})
\]

代码实现

以下代码参考Pytorch关于Bi-LSTM+CRF的tutorial。首先导入需要的模块:

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim torch.manual_seed(1)

为了使模型易读,定义几个辅助函数:

def argmax(vec):
"""return the argmax as a python int"""
_, idx = torch.max(vec, 1)
return idx.item() def prepare_sequence(seq, to_ix):
"""word2id"""
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long) def log_sum_exp(vec):
"""Compute log sum exp in a numerically stable way for the forward algorithm
这个函数在Pytorch和TensorFlow其实都有,这里作者为了讲解又实现了一次
"""
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

接下来定义整个模型:

class BiLSTM_CRF(nn.Module):

    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
super(BiLSTM_CRF, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_dim = hidden_dim
self.vocab_size = vocab_size
self.tag_to_ix = tag_to_ix
self.tagset_size = len(tag_to_ix) self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,
num_layers=1, bidirectional=True) # 将LSTM的输出映射到标签空间
# 相当于公式中的发射矩阵U
self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size) # 转移矩阵,从标签i转移到标签j的分数
# tagset_size包含了人为加入的START_TAG和STOP_TAG
self.transitions = nn.Parameter(
torch.randn(self.tagset_size, self.tagset_size)) # 下面这两个约束不能转移到START_TAG,也不能从STOP_TAG开始转移
self.transitions.data[tag_to_ix[START_TAG], :] = -10000
self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000 self.hidden = self.init_hidden() def init_hidden(self):
"""初始化LSTM"""
return (torch.randn(2, 1, self.hidden_dim // 2),
torch.randn(2, 1, self.hidden_dim // 2)) def _forward_alg(self, feats):
"""计算配分函数Z(x)""" # 对应于伪码第一步
init_alphas = torch.full((1, self.tagset_size), -10000.)
# START_TAG has all of the score.
init_alphas[0][self.tag_to_ix[START_TAG]] = 0. # Wrap in a variable so that we will get automatic backprop
forward_var = init_alphas # 对应于伪码第二步的循环,迭代整个句子
for feat in feats:
alphas_t = [] # The forward tensors at this timestep
for next_tag in range(self.tagset_size):
# broadcast the emission score: it is the same regardless of
# the previous tag
emit_score = feat[next_tag].view(
1, -1).expand(1, self.tagset_size)
# the ith entry of trans_score is the score of transitioning to
# next_tag from i
trans_score = self.transitions[next_tag].view(1, -1)
# The ith entry of next_tag_var is the value for the
# edge (i -> next_tag) before we do log-sum-exp
# 这里对应了伪码第二步中三者求和
next_tag_var = forward_var + trans_score + emit_score
# The forward variable for this tag is log-sum-exp of all the scores.
alphas_t.append(log_sum_exp(next_tag_var).view(1))
forward_var = torch.cat(alphas_t).view(1, -1)
# 对应于伪码第三步,注意损失函数最终是要logZ(x),所以又是一个logsumexp
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
alpha = log_sum_exp(terminal_var)
return alpha def _get_lstm_features(self, sentence):
"""调用LSTM获得每个token的隐状态,这里可以替换为任意的特征函数,
LSTM返回的特征就是公式中的x
"""
self.hidden = self.init_hidden()
embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
lstm_out, self.hidden = self.lstm(embeds, self.hidden)
lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
lstm_feats = self.hidden2tag(lstm_out)
return lstm_feats def _score_sentence(self, feats, tags):
"""计算给定输入序列和标签序列的匹配函数,即公式中的s函数"""
score = torch.zeros(1)
tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
for i, feat in enumerate(feats):
score = score + \
self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
return score def _viterbi_decode(self, feats):
"""维特比解码,给定输入x和相关参数(发射矩阵和转移矩阵),或者概率最大的标签序列
"""
backpointers = [] # Initialize the viterbi variables in log space
init_vvars = torch.full((1, self.tagset_size), -10000.)
init_vvars[0][self.tag_to_ix[START_TAG]] = 0 # forward_var at step i holds the viterbi variables for step i-1
forward_var = init_vvars
for feat in feats:
bptrs_t = [] # holds the backpointers for this step
viterbivars_t = [] # holds the viterbi variables for this step for next_tag in range(self.tagset_size):
# next_tag_var[i] holds the viterbi variable for tag i at the
# previous step, plus the score of transitioning
# from tag i to next_tag.
# We don't include the emission scores here because the max
# does not depend on them (we add them in below)
next_tag_var = forward_var + self.transitions[next_tag]
best_tag_id = argmax(next_tag_var)
bptrs_t.append(best_tag_id)
viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
# Now add in the emission scores, and assign forward_var to the set
# of viterbi variables we just computed
forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
backpointers.append(bptrs_t) # Transition to STOP_TAG
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)
path_score = terminal_var[0][best_tag_id] # Follow the back pointers to decode the best path.
best_path = [best_tag_id]
for bptrs_t in reversed(backpointers):
best_tag_id = bptrs_t[best_tag_id]
best_path.append(best_tag_id)
# Pop off the start tag (we dont want to return that to the caller)
start = best_path.pop()
assert start == self.tag_to_ix[START_TAG] # Sanity check
best_path.reverse()
return path_score, best_path def neg_log_likelihood(self, sentence, tags):
"""损失函数 = Z(x) - s(x,y)
"""
feats = self._get_lstm_features(sentence)
forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score def forward(self, sentence):
"""预测函数,注意这个函数和_forward_alg不一样
这里给定一个句子,预测最有可能的标签序列
"""
# Get the emission scores from the BiLSTM
lstm_feats = self._get_lstm_features(sentence) # Find the best path, given the features.
score, tag_seq = self._viterbi_decode(lstm_feats)
return score, tag_seq

最后,把上述模型拼起来得到一个完整的可运行实例,这里就不再讲解:

START_TAG = "<START>"
STOP_TAG = "<STOP>"
EMBEDDING_DIM = 5
HIDDEN_DIM = 4 # Make up some training data
training_data = [(
"the wall street journal reported today that apple corporation made money".split(),
"B I I I O O O B I O O".split()
), (
"georgia tech is a university in georgia".split(),
"B I O O O O B".split()
)] word_to_ix = {}
for sentence, tags in training_data:
for word in sentence:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix) tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4} model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4) # Check predictions before training
with torch.no_grad():
precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long)
print(model(precheck_sent)) # Make sure prepare_sequence from earlier in the LSTM section is loaded
for epoch in range(
300): # again, normally you would NOT do 300 epochs, it is toy data
for sentence, tags in training_data:
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad() # Step 2. Get our inputs ready for the network, that is,
# turn them into Tensors of word indices.
sentence_in = prepare_sequence(sentence, word_to_ix)
targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long) # Step 3. Run our forward pass.
loss = model.neg_log_likelihood(sentence_in, targets) # Step 4. Compute the loss, gradients, and update the parameters by
# calling optimizer.step()
loss.backward()
optimizer.step() # Check predictions after training
with torch.no_grad():
precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
print(model(precheck_sent))
# We got it!

参考资料

[1]. https://towardsdatascience.com/implementing-a-linear-chain-conditional-random-field-crf-in-pytorch-16b0b9c4b4ea

[2]. https://zhuanlan.zhihu.com/p/27338210

线性链条件随机场(CRF)的原理与实现的更多相关文章

  1. 【Learning Notes】线性链条件随机场(CRF)原理及实现

    1. 概述条件随机场(Conditional Random Field, CRF)是概率图模型(Probabilistic Graphical Model)与区分性分类( Discriminative ...

  2. 条件随机场CRF(一)从随机场到线性链条件随机场

    条件随机场CRF(一)从随机场到线性链条件随机场 条件随机场CRF(二) 前向后向算法评估观察序列概率(TODO) 条件随机场CRF(三) 模型学习与维特比算法解码(TODO) 条件随机场(Condi ...

  3. 条件随机场CRF(三) 模型学习与维特比算法解码

    条件随机场CRF(一)从随机场到线性链条件随机场 条件随机场CRF(二) 前向后向算法评估标记序列概率 条件随机场CRF(三) 模型学习与维特比算法解码 在CRF系列的前两篇,我们总结了CRF的模型基 ...

  4. 条件随机场CRF(二) 前向后向算法评估标记序列概率

    条件随机场CRF(一)从随机场到线性链条件随机场 条件随机场CRF(二) 前向后向算法评估标记序列概率 条件随机场CRF(三) 模型学习与维特比算法解码 在条件随机场CRF(一)中我们总结了CRF的模 ...

  5. 长短时记忆网络LSTM和条件随机场crf

    LSTM 原理 CRF 原理 给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型.假设输出随机变量构成马尔科夫随机场(概率无向图模型)在标注问题应用中,简化成线性链条件随机场,对数线性判别 ...

  6. 条件随机场(CRF) - 2 - 定义和形式(转载)

    转载自:http://www.68idc.cn/help/jiabenmake/qita/20160530618218.html 参考书本: <2012.李航.统计学习方法.pdf> 书上 ...

  7. NLP --- 条件随机场CRF详解 重点 特征函数 转移矩阵

    上一节我们介绍了CRF的背景,本节开始进入CRF的正式的定义,简单来说条件随机场就是定义在隐马尔科夫过程的无向图模型,外加可观测符号X,这个X是整个可观测向量.而我们前面学习的HMM算法,默认可观测符 ...

  8. 条件随机场(CRF) - 2 - 定义和形式

    版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/xueyingxue001/article/details/51498968声明: 1,本篇为个人对& ...

  9. 条件随机场CRF简介

    http://blog.csdn.net/xmdxcsj/article/details/48790317 Crf模型 1.   定义 一阶(只考虑y前面的一个)线性条件随机场: 相比于最大熵模型的输 ...

随机推荐

  1. (20)打鸡儿教你Vue.js

    vue-cli 快速创建工程,工程化项目目录 npm uninstall -g vue-cli npm install -g @vue/cli https://www.bootcdn.cn/ http ...

  2. go type别名和定义类型区别

    package main import ( "fmt" ) type person struct { age int name string } func (p person)te ...

  3. 关于Lombok的认识及其应用(一)

    目录 1.Lombok的介绍 2.Lombok的安装 3.Lombok实现原理分析 4.Lombok使用方法 4.1.@Data注解 4.2.@Getter/@Setter注解 1.Lombok的介绍 ...

  4. 任意模数FFT

    任意模数FFT 这是一个神奇的魔法,但是和往常一样,在这之前,先 \(\texttt{orz}\ \color{orange}{\texttt{matthew99}}\) 问题描述 给定 2 个多项式 ...

  5. Redis恢复数据

    对于单点或者集群,都可以用 cat data.txt | redis-cli --pipe方式进行冷恢复. 对于大数据量会很慢,但不会出错.

  6. 微信小程序“一劳永逸”的接口封装

    前言 最近都在研究小程序了,我可以的! 需求 之前都是用vue来开发项目的,接口模块我特意封装了一下.感觉也可以记录一下 小程序的接口虽说简单,但是重复调用那么多,显得不专业(一本正经的胡说八道) 还 ...

  7. Django 创建数据库表

    1.连接数据库之前,我们需要在setting中修改一些内容 2.Django的表是在models中创建的,一个class代表一个数据库表 abstract是为了继承,将该基类定义为抽象类,即不必生成数 ...

  8. NoSql数据库Redis系列(4)——Redis数据持久化(AOF)

    上一篇文章我们介绍了Redis的RDB持久化,RDB 持久化存在一个缺点是一定时间内做一次备份,如果redis意外down掉的话,就会丢失最后一次快照后的所有修改(数据有丢失).对于数据完整性要求很严 ...

  9. 怎么设置cookie,怎么设置cookie以及删除cookie和cookie详解

    在操作cookie之前,先来看一下cookie长什么样. 可以看到,cookie是一个个键值对(“键=值”的形式)加上分号空格隔开组合而成, 形如: "name1=value1; name2 ...

  10. Qt编写自定义控件68-IP地址输入框

    一.前言 这个IP地址输入框控件,估计写烂了,网上随便一搜索,保证一大堆,估计也是因为这个控件太容易了,非常适合新手练手,一般的思路都是用4个qlineedit控件拼起来,然后每个输入框设置正则表达式 ...