王树森Transformer学习笔记

Transformer
BERT：Bidirectional Encoder Representations from Transformers
Summary
Reference

Transformer

Transformer是完全由Attention和Self-Attention结构搭建的深度神经网络结构。

其中最为重要的就是Attention和Self-Attention结构。

Attention结构

Attention Layer接收两个输入\(X = [x_1, x_2, x_3, ..., x_m]\)，Decoder的输入为 \(X' = [x_1^{'}, x_2^{'}, x_3^{'}, ...,x_t^{'}]\)，得到一个输出\(C = [c_1, c_2, c_3, ..., c_t]\)，包含三个参数：\(W_Q, W_K, W_V\)。

具体的计算流程为：

首先，使用Encoder的输入来计算Key和Value向量，得到m个k向量和v向量：\(k_{:i} = W_Kx_{:i}, v_{:i} = W_vx_{:i}\)
然后，对Decoder的输入做线性变换，得到t个q向量：\(q_{:j} = W_Qx_{:j}^{'}\)
计算权重：\(\alpha_{:1} = Softmax(K^Tq_{:1})\)
计算Context vector：\(c_{:1} = \alpha_{11}v_{:1} + \alpha_{21}v_{:2} + ...\alpha_{m1}v_{:m} = V\alpha_{:1} = VSoftmax(K^Tq_{:1})\)
用相同的方式计算\(c_2, c_3, ..., c_t\)，得到\(C = [c_1, c_2, ..., c_t]\)

Key：表示待匹配的值，Query表示查找值，这m个\(\alpha_{:j}\)就说明是query（\(q_j\)）和所有key（\([k_{:1}, k_{:2}, ..., k_{:m}]\)）之间的匹配程度。匹配程度越高，权重越大。V是对输入的一个线性变化，使用权重对其进行加权平均得到相关矩阵\(C\)。在Attention+RNN的结构中，是对输入状态进行加权平均，这里\(V\)相当于对\([h_1, h_2, ..., h_m]\)进行线性变换。

Self-Attention结构

Attention结构接收两个输入得到一个输出，Self-Attention结构接收一个输入得到一个输出，如下图所示。中间的计算过程与Attention完全一致。

Multi-head Self-Attention

上述的Self-Attention结构被称为单头Self-Attention（Single-Head Self-Attention）结构，Multi-Head Self-Attention就是将多个Single-Head Self-Attention的结构进行堆叠，结果Concatenate到一块儿。

假如有\(l\)个Single-Head Self-Attention组成一个Multi-Head Self-Attention，Single-Head Self-Attention的输入为\(X = [x_{:1}, x_{:2}, x_{:3}, ..., x_{:m}]\)，输出为\(C = [c_{:1}, c_{:2}, c_{:3}, ..., c_{:m}]\)维度为\(dm\)，

则，Multi-Head Self-Attention的输出维度为\((ld)*m\)，参数量为\(l\)个\(W_Q, W_K, W_V\)即\(3l\)个参数矩阵。

Multi-Head Attention操作一致，就是进行多次相同的操作，将结果Concatenate到一块儿。

BERT：Bidirectional Encoder Representations from Transformers

BERT的提出是为了预训练Transformer的Encoder网络【BERT[4] is for pre-training Transformer's[3] encoder.】，通过两个任务（1）预测被遮挡的单词（2）预测下一个句子，这两个任务不需要人工标注数据，从而达到使用海量数据训练超级大模型的目的。

BERT有两种任务：

Task 1: Predict the masked word，预测被遮挡的单词

输入：the [MASK] sat on the mat

groundTruth：cat

损失函数：交叉熵损失

Task 2: Predict the next sentence，预测下一个句子，判断两句话在文中是否真实相邻

输入：[CLS, first sentence, SEP, second sentence]

输出：true or false

损失函数：交叉熵损失

这样做二分类可以让Encoder学习并强化句子之间的相关性。

好处：

BERT does not need manually labeled data. (Nice, Manual labeling is expensive.)
Use large-scale data, e.g., English Wikipedia (2.5 billion words)
task 1: Randomly mask works(with some tricks)
task 2: 50% of the next sentences are real. (the other 50% are fake.)
BERT将上述两个任务结合起来预训练Transformer模型
想法简单且非常有效

消耗极大【普通人玩不起，但是BERT训练出来的模型参数是公开的，可以拿来使用】：

BERT Base
- 110M parameters
- 16 TPUs, 4 days of training
BERT Large
- 235M parameters
- 64 TPUs, 4days of training

Summary

Transformer：

Transformer is a Seq2Seq model, it has an encoder and a decoder
Transformer model is not RNN
Transfomer is purely based on attention and dense layers(全连接层)
Transformer outperforms all the state-of-the-art RNN models

Attention的发展：

Attention was originally developed for Seq2Seq RNN models[1].
Self-Attention: attention for all the RNN models(not necessarily for Seq2Seq models)[2].
Attention can be used without RNN[3].

Reference

王树森的Transformer模型

[1] Bahdanau, Cho, & Bengio, Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

[2] Cheng, Dong, & Lapata. Long Short-Term Memory-Networks for Machine Reading. In EMNLP, 2016.

[3] Vaswani et al. Attention Is All You Need. In NIPS, 2017.

[4] Devlin, Chang, Lee, and Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2019.