基本原理

损失函数

（线性链）CRF通常用于序列标注任务，对于输入序列$x$和标签序列$y$，定义匹配分数：

\[s(x,y) = \sum_{i=0}^l T(y_i, y_{i+1}) + \sum_{i=1}^l U(x_i, y_i)
\]

这里$l$是序列长度，$T$和$U$都是可学习的参数，$T(y_i, y_{i+1})$表示第$i$步的标签是$y_i$，第$i+1$步标签是$y_{i+1}$的转移分数，$U(x_i,y_i)$表示第$i$步输入$x_i$对应的标签是$y_i$的发射分数。注意这里在计算转移分数$T$时，状态转移链为$y_0\rightarrow y_1 \rightarrow \dots \rightarrow y_l \rightarrow y_{l+1}$，因为人为地加入了START_TAG和STOP_TAG标签。

为了解决标注偏置问题，CRF需要做全局归一化，具体而言就是输入$x$对应的标签序列为$y$的概率定义为：

\[P(y|x)=\frac{e^{s(x,y)}}{Z(x)} = \frac{e^{s(x,y)}}{\sum_{\tilde{y}\in Y_x}e^{s(x,y)}}
\]

因此这里最麻烦的就是计算配分函数(partition function)$Z(x)$，因为它要遍历所有路径。

在训练过程中，我们希望最大化正确标签序列的对数概率，即：

\[\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = s(x,y) - \log Z(x) = s(x,y) - \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)})
\]

也就是最小化负对数似然，即损失函数为：

\[-\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)}) - s(x,y)
\]

配分函数计算

接下来我们来讨论怎么计算$Z(x)$。我们使用前向算法计算$Z(x)$，伪码如下：

初始化，对于$y_2$的所有取值$y_2^*$，定义

\[\alpha_1(y_2^*) = \sum_{y_1^*} \exp(U(x_1, y_1^*) + T(y_1^*, y_2^*))
\]

这里$y_k$表示$k$时刻的标签，它的取值空间是标签控件，如B,I,O等，某一个具体的取值记为$y_k^*$。$\alpha_k(y_{k+1}^*)$可以认为是时刻$k$时的非规范化概率。注意这里$y_{k+1}^*$我们只用了一个标签，其实我们要在整个标签空间遍历，对于$y_{k+1}$的每一个取值都算一遍。

2. 对于$k = 2, 3, \dots, l-1$以及$y_{k+1}$的所有取值$y_{k+1}^*$，都有：

\[\log (\alpha_k(y_{k+1}^*)) = \log \sum_{y_k^*}\exp \left(U(x_k, y_k^*)+T(y_k^*, y_{k+1}^*) + \log(\alpha_{k-1}(y_k^*)) \right)
\]

这里$y_k$和$y_{k+1}$都是一个具体的取值，这意味着这一步的计算复杂度是$O(N^2)$的，其中$N$是标签数目。

3. 最终：

\[Z(x) = \sum_{y_l^*} \exp \left(U(x_l, y_l^*) + \log(\alpha_{l-1}(y_l^*)) \right)
\]

注意到伪码第二步就是所谓的logsumexp，这可能会导致问题。因为如果求指数特别大，可能会导致溢出。因此这里存在一个小trick使得计算时数值稳定：

\[\log \sum_k \exp(z_k) = \max (\mathbf{z}) + \log \sum_k \exp(z_k - \max(\mathbf{z}))
\]

证明如下：

\[\log \sum_k \exp(z_k) = \log \sum_k (\exp(z_k -c) \cdot \exp(c)) = \log[\exp(c) \cdot \sum_k \exp(z_k -c)] = c + \log \sum_k \exp(z_k -c) \qquad \text{令} \ c = \max({\mathbf{z}})
\]

代码实现

以下代码参考Pytorch关于Bi-LSTM+CRF的tutorial。首先导入需要的模块：

import torch

import torch.autograd as autograd

import torch.nn as nn

import torch.optim as optim

torch.manual_seed(1)

为了使模型易读，定义几个辅助函数：

def argmax(vec):

    """return the argmax as a python int"""

    _, idx = torch.max(vec, 1)

    return idx.item()

def prepare_sequence(seq, to_ix):

    """word2id"""

    idxs = [to_ix[w] for w in seq]

    return torch.tensor(idxs, dtype=torch.long)

def log_sum_exp(vec):

    """Compute log sum exp in a numerically stable way for the forward algorithm

    这个函数在Pytorch和TensorFlow其实都有，这里作者为了讲解又实现了一次

    """

    max_score = vec[0, argmax(vec)]

    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])

    return max_score + \

        torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

接下来定义整个模型：

class BiLSTM_CRF(nn.Module):

    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):

        super(BiLSTM_CRF, self).__init__()

        self.embedding_dim = embedding_dim

        self.hidden_dim = hidden_dim

        self.vocab_size = vocab_size

        self.tag_to_ix = tag_to_ix

        self.tagset_size = len(tag_to_ix)

        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,

                            num_layers=1, bidirectional=True)

        # 将LSTM的输出映射到标签空间

        # 相当于公式中的发射矩阵U

        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

        # 转移矩阵，从标签i转移到标签j的分数

        # tagset_size包含了人为加入的START_TAG和STOP_TAG

        self.transitions = nn.Parameter(

            torch.randn(self.tagset_size, self.tagset_size))

        # 下面这两个约束不能转移到START_TAG，也不能从STOP_TAG开始转移

        self.transitions.data[tag_to_ix[START_TAG], :] = -10000

        self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000

        self.hidden = self.init_hidden()

    def init_hidden(self):

        """初始化LSTM"""

        return (torch.randn(2, 1, self.hidden_dim // 2),

                torch.randn(2, 1, self.hidden_dim // 2))

    def _forward_alg(self, feats):

        """计算配分函数Z(x)"""

        # 对应于伪码第一步

        init_alphas = torch.full((1, self.tagset_size), -10000.)

        # START_TAG has all of the score.

        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

        # Wrap in a variable so that we will get automatic backprop

        forward_var = init_alphas

        # 对应于伪码第二步的循环，迭代整个句子

        for feat in feats:

            alphas_t = []  # The forward tensors at this timestep

            for next_tag in range(self.tagset_size):

                # broadcast the emission score: it is the same regardless of

                # the previous tag

                emit_score = feat[next_tag].view(

                    1, -1).expand(1, self.tagset_size)

                # the ith entry of trans_score is the score of transitioning to

                # next_tag from i

                trans_score = self.transitions[next_tag].view(1, -1)

                # The ith entry of next_tag_var is the value for the

                # edge (i -> next_tag) before we do log-sum-exp

                # 这里对应了伪码第二步中三者求和

                next_tag_var = forward_var + trans_score + emit_score

                # The forward variable for this tag is log-sum-exp of all the scores.

                alphas_t.append(log_sum_exp(next_tag_var).view(1))

            forward_var = torch.cat(alphas_t).view(1, -1)

        # 对应于伪码第三步，注意损失函数最终是要logZ(x)，所以又是一个logsumexp

        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]

        alpha = log_sum_exp(terminal_var)

        return alpha

    def _get_lstm_features(self, sentence):

        """调用LSTM获得每个token的隐状态，这里可以替换为任意的特征函数，

        LSTM返回的特征就是公式中的x

        """

        self.hidden = self.init_hidden()

        embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)

        lstm_out, self.hidden = self.lstm(embeds, self.hidden)

        lstm_out = lstm_out.view(len(sentence), self.hidden_dim)

        lstm_feats = self.hidden2tag(lstm_out)

        return lstm_feats

    def _score_sentence(self, feats, tags):

        """计算给定输入序列和标签序列的匹配函数，即公式中的s函数"""

        score = torch.zeros(1)

        tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])

        for i, feat in enumerate(feats):

            score = score + \

                self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]

        score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]

        return score

    def _viterbi_decode(self, feats):

        """维特比解码，给定输入x和相关参数(发射矩阵和转移矩阵)，或者概率最大的标签序列

        """

        backpointers = []

        # Initialize the viterbi variables in log space

        init_vvars = torch.full((1, self.tagset_size), -10000.)

        init_vvars[0][self.tag_to_ix[START_TAG]] = 0

        # forward_var at step i holds the viterbi variables for step i-1

        forward_var = init_vvars

        for feat in feats:

            bptrs_t = []  # holds the backpointers for this step

            viterbivars_t = []  # holds the viterbi variables for this step

            for next_tag in range(self.tagset_size):

                # next_tag_var[i] holds the viterbi variable for tag i at the

                # previous step, plus the score of transitioning

                # from tag i to next_tag.

                # We don't include the emission scores here because the max

                # does not depend on them (we add them in below)

                next_tag_var = forward_var + self.transitions[next_tag]

                best_tag_id = argmax(next_tag_var)

                bptrs_t.append(best_tag_id)

                viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))

            # Now add in the emission scores, and assign forward_var to the set

            # of viterbi variables we just computed

            forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)

            backpointers.append(bptrs_t)

        # Transition to STOP_TAG

        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]

        best_tag_id = argmax(terminal_var)

        path_score = terminal_var[0][best_tag_id]

        # Follow the back pointers to decode the best path.

        best_path = [best_tag_id]

        for bptrs_t in reversed(backpointers):

            best_tag_id = bptrs_t[best_tag_id]

            best_path.append(best_tag_id)

        # Pop off the start tag (we dont want to return that to the caller)

        start = best_path.pop()

        assert start == self.tag_to_ix[START_TAG]  # Sanity check

        best_path.reverse()

        return path_score, best_path

    def neg_log_likelihood(self, sentence, tags):

        """损失函数 = Z(x) - s(x,y)

        """

        feats = self._get_lstm_features(sentence)

        forward_score = self._forward_alg(feats)

        gold_score = self._score_sentence(feats, tags)

        return forward_score - gold_score

    def forward(self, sentence):

        """预测函数，注意这个函数和_forward_alg不一样

        这里给定一个句子，预测最有可能的标签序列

        """

        # Get the emission scores from the BiLSTM

        lstm_feats = self._get_lstm_features(sentence)

        # Find the best path, given the features.

        score, tag_seq = self._viterbi_decode(lstm_feats)

        return score, tag_seq

最后，把上述模型拼起来得到一个完整的可运行实例，这里就不再讲解：

START_TAG = "<START>"

STOP_TAG = "<STOP>"

EMBEDDING_DIM = 5

HIDDEN_DIM = 4

# Make up some training data

training_data = [(

    "the wall street journal reported today that apple corporation made money".split(),

    "B I I I O O O B I O O".split()

), (

    "georgia tech is a university in georgia".split(),

    "B I O O O O B".split()

)]

word_to_ix = {}

for sentence, tags in training_data:

    for word in sentence:

        if word not in word_to_ix:

            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)

optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# Check predictions before training

with torch.no_grad():

    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)

    precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long)

    print(model(precheck_sent))

# Make sure prepare_sequence from earlier in the LSTM section is loaded

for epoch in range(

        300):  # again, normally you would NOT do 300 epochs, it is toy data

    for sentence, tags in training_data:

        # Step 1. Remember that Pytorch accumulates gradients.

        # We need to clear them out before each instance

        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is,

        # turn them into Tensors of word indices.

        sentence_in = prepare_sequence(sentence, word_to_ix)

        targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)

        # Step 3. Run our forward pass.

        loss = model.neg_log_likelihood(sentence_in, targets)

        # Step 4. Compute the loss, gradients, and update the parameters by

        # calling optimizer.step()

        loss.backward()

        optimizer.step()

# Check predictions after training

with torch.no_grad():

    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)

    print(model(precheck_sent))

# We got it!

参考资料

[1]. https://towardsdatascience.com/implementing-a-linear-chain-conditional-random-field-crf-in-pytorch-16b0b9c4b4ea

[2]. https://zhuanlan.zhihu.com/p/27338210

线性链条件随机场(CRF)的原理与实现的更多相关文章

【Learning Notes】线性链条件随机场（CRF）原理及实现
1. 概述条件随机场(Conditional Random Field, CRF)是概率图模型(Probabilistic Graphical Model)与区分性分类( Discriminative ...
条件随机场CRF(一)从随机场到线性链条件随机场
条件随机场CRF(一)从随机场到线性链条件随机场条件随机场CRF(二) 前向后向算法评估观察序列概率(TODO) 条件随机场CRF(三) 模型学习与维特比算法解码(TODO) 条件随机场(Condi ...
条件随机场CRF(三) 模型学习与维特比算法解码
条件随机场CRF(一)从随机场到线性链条件随机场条件随机场CRF(二) 前向后向算法评估标记序列概率条件随机场CRF(三) 模型学习与维特比算法解码在CRF系列的前两篇,我们总结了CRF的模型基 ...
条件随机场CRF(二) 前向后向算法评估标记序列概率
条件随机场CRF(一)从随机场到线性链条件随机场条件随机场CRF(二) 前向后向算法评估标记序列概率条件随机场CRF(三) 模型学习与维特比算法解码在条件随机场CRF(一)中我们总结了CRF的模 ...
长短时记忆网络LSTM和条件随机场crf
LSTM 原理 CRF 原理给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型.假设输出随机变量构成马尔科夫随机场(概率无向图模型)在标注问题应用中,简化成线性链条件随机场,对数线性判别 ...
条件随机场(CRF) - 2 - 定义和形式（转载）
转载自:http://www.68idc.cn/help/jiabenmake/qita/20160530618218.html 参考书本: <2012.李航.统计学习方法.pdf> 书上 ...
NLP --- 条件随机场CRF详解重点特征函数转移矩阵
上一节我们介绍了CRF的背景,本节开始进入CRF的正式的定义,简单来说条件随机场就是定义在隐马尔科夫过程的无向图模型,外加可观测符号X,这个X是整个可观测向量.而我们前面学习的HMM算法,默认可观测符 ...
条件随机场(CRF) - 2 - 定义和形式
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/xueyingxue001/article/details/51498968声明: 1,本篇为个人对& ...
条件随机场CRF简介
http://blog.csdn.net/xmdxcsj/article/details/48790317 Crf模型 1. 定义一阶(只考虑y前面的一个)线性条件随机场: 相比于最大熵模型的输 ...

随机推荐

Mac laravel: command not found
如果用的oh-my-zsh 安装laravel 提示找不到.可以试试下面的 export PATH=$HOME/bin:/usr/local/bin:~/.composer/vendor/bin:$P ...
postgresql 增量备份
介绍: barman是postgresql备份还原的管理工具. 本文环境: 系统: centos6.6 PostgreSQL 9.3.9 barman-1.4.1-1.rhel6.noarch.rpm ...
如何在团队中做好Code Review
一.Code Review的好处想要做好Code Review,必须让参与的工程师充分认识到Code Review的好处 1.互相学习,彼此成就无论是高手云集的架构师团队,还是以CURD为主的业务 ...
Jar hell问题以及解放方法
当一个类或一个资源文件存在多个jar中,就好存在jar hell问题可以通过以下代码来诊断问题:
fork 可能导致subprocess崩溃
https://docs.python.org/zh-cn/3/library/multiprocessing.html 在 3.8 版更改: 对于 macOS,spawn 启动方式是默认方式. 因为 ...
android strings: %s、%1$s、%d、%1$d占位符
实际开发的过程中我们有时候会遇到,一个TextView里面会遇到会有一个一大串固定的文字,而里面的数字或者个别字需要根据后台的接口而展示的.这个时候我们最简单的方法就是在string.xml文件里使 ...
java类型 jdbcType类型 mysql类型关系
java类型 jdbcType类型 mysql类型关系 Java类型 JdbcType Mysql类型备注 String VARCHAR VARCHAR 变长字符串 String LONGVARCH ...
win cmd 设置代理
windows: HTTP(S)代理服务器:127.0.0.1:5783 SOCKS代理服务器:127.0.0.1:5789 set 2 set http_proxy=socks5://127.0.0 ...
openresty开发系列13--lua基础语法2常用数据类型介绍
openresty开发系列13--lua基础语法2常用数据类型介绍一)boolean(布尔)布尔类型,可选值 true/false: Lua 中 nil 和 false 为"假" ...
shell编程系列5--数学运算
shell编程系列5--数学运算方法1 expr $num1 operator $num2 方法2 $(($num1 operator $num2)) expr操作符对照表1 操作符含义 num1 ...

线性链条件随机场(CRF)的原理与实现