Link of the Paper: https://arxiv.org/abs/1706.03762

Motivation:

  • The inherently sequential nature of Recurrent Models precludes parallelization within training examples.
  • Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.

Innovation:

  • The first sequence transduction model, the Transformer, relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or Convolutions. The Transformer follows the overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

    • Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. The authors employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm (x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
    • Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, they employ residual connections around each of the sub-layers, followed by layer normalization. They also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

  • Scaled Dot-Product Attention and Multi-Head Attention. The Transformer uses multi-head attention in three different ways:

    • In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [ Google’s neural machine translation system: Bridging the gap between human and machine translation. ].
    • The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. [ More about Attention Definition ]
    • Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. They need to prevent leftward information flow in the decoder to preserve the auto-regressive property. They implement this inside of scaled dot-product attention by masking out (setting to -∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

In terms of encoder-decoder, the query Q is usually the hidden state of the decoder. Whereas key K, is the hidden state of the encoder, and the corresponding value V is normalized weight, representing how much attention a key gets.    -- The Transformer - Attention is all you need.

    

  • Positional Encoding: Add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. In this work, they use sine and cosine functions of different frequencies:

    • PE(pos, 2i) = sin ( pos / 100002i/dmodel )
    • PE(pos, 2i+1) = cos ( pos / 100002i/dmodel)

Improvement:

  • Position-wise Feed-Forward Networks: In addition to attention sub-layers, each of the layers in their encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. FFN(x) = max(0, xW1 + b1)W2 + b2.

General Points:

  • An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
  • Why Self-Attention:
    • One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
    • The third is the path length between long-range dependencies in the network.

Paper Reading - Attention Is All You Need ( NIPS 2017 ) ★的更多相关文章

  1. Paper Reading - Convolutional Sequence to Sequence Learning ( CoRR 2017 ) ★

    Link of the Paper: https://arxiv.org/abs/1705.03122 Motivation: Compared to recurrent layers, convol ...

  2. Paper Reading: Stereo DSO

    开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse ...

  3. Paper Reading - Im2Text: Describing Images Using 1 Million Captioned Photographs ( NIPS 2011 )

    Link of the Paper: http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio ...

  4. Paper Reading - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention ( ICML 2015 )

    Link of the Paper: https://arxiv.org/pdf/1502.03044.pdf Main Points: Encoder-Decoder Framework: Enco ...

  5. Paper Reading - Sequence to Sequence Learning with Neural Networks ( NIPS 2014 )

    Link of the Paper: https://arxiv.org/pdf/1409.3215.pdf Main Points: Encoder-Decoder Model: Input seq ...

  6. [Paper Reading] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

    论文链接:https://arxiv.org/pdf/1502.03044.pdf 代码链接:https://github.com/kelvinxu/arctic-captions & htt ...

  7. Paper Reading - CNN+CNN: Convolutional Decoders for Image Captioning

    Link of the Paper: https://arxiv.org/abs/1805.09019 Innovations: The authors propose a CNN + CNN fra ...

  8. Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★

    Link of the Paper: https://arxiv.org/abs/1806.06422 Innovations: The authors propose a novel learnin ...

  9. Paper Reading - Convolutional Image Captioning ( CVPR 2018 )

    Link of the Paper: https://arxiv.org/abs/1711.09151 Motivation: LSTM units are complex and inherentl ...

随机推荐

  1. Adapter(适配器)模式

    1. 概述: 接口的改变,是一个需要程序员们必须(虽然很不情愿)接受和处理的普遍问题.程序提供者们修改他们的代码;系统库被修正;各种程序语言以及相关库的发展和进化.  例子1:iphone4,你即可以 ...

  2. RAC日常维护命令

    olsnodes -n   查看节点个数 crs_stat   -t    查看RAC中各节点的资源状态 crs_stat   -p   查看RAC的节点的配置 crsctl命令: 对于crsctl命 ...

  3. Tarjan算法初探(2):缩点

    接上一节 Tarjan算法初探(1):Tarjan如何求有向图的强连通分量 Tarjan算法一个非常重要的应用就是 在一张题目性质在点上性质能够合并的普通有向图中将整个强连通分量视作一个点来把整张图变 ...

  4. Linux基础练习题之(四)

    Linux基础练习题 请详细总结vim编辑器的使用并完成以下练习题 1.复制/etc/rc.d/rc.sysinit文件至/tmp目录,将/tmp/rc.sysinit文件中的以至少一个空白字符开头的 ...

  5. Linux Centos平台下安装Nginx

    以home下安装为例,切换到home目录下 cd /home 安装依赖 nginx相关依赖 yum -y install make gcc gcc-c++ openssl openssl-devel ...

  6. BootStrapValidate 简单使用

    前阵子用了bootstrapvalidate写了一个登录验证,这里小记一笔 首先需要引入 bootstrapValidator.css //可不引入 jquery-2.1.0.min.js boots ...

  7. 企业案例:查找当前目录下所有文件,并把文件中的https://www.cnblogs.com/zhaokang2019/字符串替换成https://www.cnblogs.com/guobaoyan2019/

    企业案例:查找当前目录下所有文件,并把文件中的https://www.cnblogs.com/zhaokang2019/字符串替换成https://www.cnblogs.com/guobaoyan2 ...

  8. [修正] Firemonkey Android 文字斜粗体显示不全的问题

    问题:Firemonkey Android 平台显示斜粗体文字时,文字右方会有显示不全的问题. 修正代码: 请将 FMX.FontGlyphs.Android.pas 复制到自己的工程目录下,再修改如 ...

  9. Flume的介绍和简单操作

    Flume是什么 Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集.聚合和传输的系统,Flume支持在日志系统中定制各类数据发送方,用于收集数据:同时,Flume提供对数 ...

  10. 浏览器窗口输入网址后发生的一段事情(http完整请求)

    1.DNS查询得到IP 输入的是域名,需要进行dns解析成IP,大致流程: 如果浏览器有缓存,直接使用浏览器缓存,否则使用本机缓存,再没有的话就是用host 如果本地没有,就向dns域名服务器查询(当 ...