transformer 源码

训练时：
1. 输入正确标签一次性解码出来

预测时：
1. 第一次输入1个词，解码出一个词
第二次输入第一次输入的词和第一次解码出来词一起，解码出来第3个词，这样依次解码，解码到最长的长度或者<pad>。就结束。
训练时，全部输入与预测时一个一个输入是一样的

1. 需要传入词向量

    def __init__(self, hp):

        self.hp = hp

        self.token2idx, self.idx2token = load_vocab(hp.vocab) # 这里在实际的需求情况下传入自己的词典

        self.embeddings = get_token_embeddings(self.hp.vocab_size, self.hp.d_model, zero_pad=True) # 这里作者使用定义的变量训练的词向量，在实际的生产过程当中，我们可以使用word2vec、bert等

2.position_encoding

def positional_encoding(inputs,

                        num_units,

                        zero_pad=True,

                        scale=True,

                        scope="positional_encoding",

                        reuse=None):

    '''Sinusoidal Positional_Encoding.

    Args:

      inputs: A 2d Tensor with shape of (N, T).

      num_units: Output dimensionality

      zero_pad: Boolean. If True, all the values of the first row (id = 0) should be constant zero

      scale: Boolean. If True, the output will be multiplied by sqrt num_units(check details from paper)

      scope: Optional scope for `variable_scope`.

      reuse: Boolean, whether to reuse the weights of a previous layer

        by the same name.

    Returns:

        A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'

    '''

    N, T = inputs.get_shape().as_list()

    with tf.variable_scope(scope, reuse=reuse):

        position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])

        # First part of the PE function: sin and cos argument

        position_enc = np.array([

            [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]

            for pos in range(T)])

        # Second part, apply the cosine to even columns and sin to odds.

        position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i

        position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1

        # Convert to a tensor

        lookup_table = tf.convert_to_tensor(position_enc)

        if zero_pad:

            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),

                                      lookup_table[1:, :]), 0)

        outputs = tf.nn.embedding_lookup(lookup_table, position_ind)

        if scale:

            outputs = outputs * num_units**0.5

        return outputs

3. multihead_attention

def multihead_attention(queries,

                        keys,

                        num_units=None,

                        num_heads=8,

                        dropout_rate=0,

                        is_training=True,

                        causality=False,

                        scope="multihead_attention",

                        reuse=None):

    '''Applies multihead attention.

    Args:

      queries: A 3d tensor with shape of [N, T_q, C_q].

      keys: A 3d tensor with shape of [N, T_k, C_k].

      num_units: A scalar. Attention size.

      dropout_rate: A floating point number.

      is_training: Boolean. Controller of mechanism for dropout.

      causality: Boolean. If true, units that reference the future are masked.

      num_heads: An int. Number of heads.

      scope: Optional scope for `variable_scope`.

      reuse: Boolean, whether to reuse the weights of a previous layer

        by the same name.

    Returns

      A 3d tensor with shape of (N, T_q, C)

    '''

    with tf.variable_scope(scope, reuse=reuse):

        # Set the fall back option for num_units

        if num_units is None:

            num_units = queries.get_shape().as_list()[-1]

        # Linear projections

        Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)  C为num_units，本实现中未设定，故等于C_q

        K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)

        V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)

        # Split and concat

        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h)

        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)

        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 

        # Multiplication

        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)

        # Scale

        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)

        # Key Masking

        key_masks = tf.sign(tf.reduce_sum(tf.abs(keys), axis=-1)) # (N, T_k)

        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, -T_k)

        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)

        paddings = tf.ones_like(outputs)*(-2**32+1)

        b = tf.equal(key_masks, 0)

        """

            然后定义一个和outputs同shape的paddings，该tensor每个值都设定的极小。用where函数比较，当对应位置的key_masks值为0也就是需要mask时，

            outputs的该值（attention score）设置为极小的值（利用paddings实现），否则保留原来的outputs值。

            经过以上key mask操作之后outputs的shape仍为 (h*N, T_q, T_k)，只是对应mask了的key的score变为很小的值。

        """

        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)

        # Causality = Future blinding

        if causality: # 是否忽略未来信息

            diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)

            tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)

            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)

            paddings = tf.ones_like(masks)*(-2**32+1)

            outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)

        # Activation

        outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)

        # Query Masking

        query_masks = tf.sign(tf.reduce_sum(tf.abs(queries), axis=-1)) # (N, T_q)

        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)

        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)

        outputs *= query_masks # broadcasting. (N, T_q, T_k)？注释有误，将C改成T_k

        # Dropouts

        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))

        # Weighted sum

        outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)

        # Restore shape

        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)

        # Residual connection

        outputs += queries

        # Normalize

        outputs = normalize(outputs) # (N, T_q, C)

    return outputs

4. feedforward

def feedforward(inputs,

                num_units=[2048, 512],

                scope="multihead_attention",

                reuse=None):

    '''Point-wise feed forward net.

    Args:

      inputs: A 3d tensor with shape of [N, T, C].

      num_units: A list of two integers.

      scope: Optional scope for `variable_scope`.

      reuse: Boolean, whether to reuse the weights of a previous layer

        by the same name.

    Returns:

      A 3d tensor with the same shape and dtype as inputs

    '''

    with tf.variable_scope(scope, reuse=reuse):

        # Inner layer

        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,

                  "activation": tf.nn.relu, "use_bias": True}

        outputs = tf.layers.conv1d(**params)

        # Readout layer

        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,

                  "activation": None, "use_bias": True}

        outputs = tf.layers.conv1d(**params)

        # Residual connection

        outputs += inputs

        # Normalize

        outputs = normalize(outputs)

    return outputs

5.normalize

def normalize(inputs,

              epsilon = 1e-8,

              scope="ln",

              reuse=None):

    '''Applies layer normalization.

    Args:

      inputs: A tensor with 2 or more dimensions, where the first dimension has

        `batch_size`.

      epsilon: A floating number. A very small number for preventing ZeroDivision Error.

      scope: Optional scope for `variable_scope`.

      reuse: Boolean, whether to reuse the weights of a previous layer

        by the same name.

    Returns:

      A tensor with the same shape and data dtype as `inputs`.

    '''

    with tf.variable_scope(scope, reuse=reuse):

        inputs_shape = inputs.get_shape()

        params_shape = inputs_shape[-1:]

        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)

        beta= tf.Variable(tf.zeros(params_shape))

        gamma = tf.Variable(tf.ones(params_shape))

        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )

        outputs = gamma * normalized + beta

    return outputs

6. encoder-decoder

            with tf.variable_scope("encoder"):

                ## Embedding

                self.enc = embedding(self.x,

                                      vocab_size=len(de2idx),

                                      num_units=hp.hidden_units,

                                      scale=True,

                                      scope="enc_embed")

                # key_masks = tf.expand_dims(tf.sign(tf.reduce_sum(tf.abs(self.enc), axis=-1)), -1)

                ## Positional Encoding

                if hp.sinusoid:

                    self.enc += tf.cast(positional_encoding(self.x,

                                        num_units=hp.hidden_units,

                                        zero_pad=False,

                                        scale=False,

                                        scope="enc_pe"), tf.float32)

                else:

                    self.enc += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]), 0), [tf.shape(self.x)[0], 1]),

                                          vocab_size=hp.maxlen,

                                          num_units=hp.hidden_units,

                                          zero_pad=False,

                                          scale=False,

                                          scope="enc_pe")

                # self.enc *= key_masks

                ## Dropout

                self.enc = tf.layers.dropout(self.enc,

                                             rate=hp.dropout_rate,

                                             training=tf.convert_to_tensor(is_training))

                ## Blocks

                for i in range(hp.num_blocks):

                    with tf.variable_scope("num_blocks_{}".format(i)):

                        ### Multihead Attention

                        self.enc = multihead_attention(queries=self.enc,

                                                        keys=self.enc,

                                                        num_units=hp.hidden_units,

                                                        num_heads=hp.num_heads,

                                                        dropout_rate=hp.dropout_rate,

                                                        is_training=is_training,

                                                        causality=False)

                        ### Feed Forward

                        self.enc = feedforward(self.enc, num_units=[4*hp.hidden_units, hp.hidden_units])

            # Decoder

            with tf.variable_scope("decoder"):

                ## Embedding

                self.dec = embedding(self.decoder_inputs,

                                      vocab_size=len(en2idx),

                                      num_units=hp.hidden_units,

                                      scale=True,

                                      scope="dec_embed")

                self.dec_ = self.dec

                # key_masks = tf.expand_dims(tf.sign(tf.reduce_sum(tf.abs(self.dec), axis=-1)), -1)

                ## Positional Encoding

                if hp.sinusoid:

                    self.dec += tf.cast(positional_encoding(self.decoder_inputs,

                                                    num_units=hp.hidden_units,

                                                    zero_pad=False,

                                                    scale=False,

                                                    scope="dec_pe"), tf.float32)

                else:

                    self.dec += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.decoder_inputs)[1]), 0), [tf.shape(self.decoder_inputs)[0], 1]),

                                      vocab_size=hp.maxlen,

                                      num_units=hp.hidden_units,

                                      zero_pad=False,

                                      scale=False,

                                      scope="dec_pe")

                # self.dec *= key_masks

                ## Dropout

                self.dec = tf.layers.dropout(self.dec,

                                            rate=hp.dropout_rate,

                                            training=tf.convert_to_tensor(is_training))

                ## Blocks

                for i in range(hp.num_blocks):

                    with tf.variable_scope("num_blocks_{}".format(i)):

                        ## Multihead Attention ( self-attention)

                        self.dec = multihead_attention(queries=self.dec,

                                                        keys=self.dec,

                                                        num_units=hp.hidden_units,

                                                        num_heads=hp.num_heads,

                                                        dropout_rate=hp.dropout_rate,

                                                        is_training=is_training,

                                                        causality=True,

                                                        scope="self_attention")

                        ## Multihead Attention ( vanilla attention)

                        self.dec = multihead_attention(queries=self.dec,

                                                       keys=self.enc,

                                                        num_units=hp.hidden_units,

                                                        num_heads=hp.num_heads,

                                                        dropout_rate=hp.dropout_rate,

                                                        is_training=is_training,

                                                        causality=False,

                                                        scope="vanilla_attention")

                        ## Feed Forward

                        self.dec = feedforward(self.dec, num_units=[4*hp.hidden_units, hp.hidden_units])

# Final linear projection
self.logits = tf.layers.dense(self.dec, len(en2idx))
self.preds = tf.to_int32(tf.arg_max(self.logits, dimension=-1))
self.istarget = tf.to_float(tf.not_equal(self.y, 0))
self.acc = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.y))*self.istarget) / (tf.reduce_sum(self.istarget))
tf.summary.scalar('acc', self.acc)

7. train

            if is_training:

                # Loss

                self.y_smoothed = label_smoothing(tf.one_hot(self.y, depth=len(en2idx)))

                self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y_smoothed)

                self.mean_loss = tf.reduce_sum(self.loss*self.istarget) / (tf.reduce_sum(self.istarget))

                # Training Scheme

                self.global_step = tf.Variable(0, name='global_step', trainable=False)

                self.optimizer = tf.train.AdamOptimizer(learning_rate=hp.lr, beta1=0.9, beta2=0.98, epsilon=1e-8)

                self.train_op = self.optimizer.minimize(self.mean_loss, global_step=self.global_step)

                # Summary

                tf.summary.scalar('mean_loss', self.mean_loss)

                self.merged = tf.summary.merge_all()

transformer 源码的更多相关文章

Flipboard-BottomSheetlayout 源码分析
BottomSheetLayout BottomSheet:Google在API 23中已经加入了这样的一个控件. BottomSheet介绍: BottomSheet是一个可以从底部飞入和飞出的An ...
angular源码分析：angular中各种常用函数，比较省代码的各种小技巧
angular的工具函数在angular的API文档中,在最前面就是讲的就是angular的工具函数,下面列出来 angular.bind //用户将函数和对象绑定在一起,返回一个新的函数 angu ...
Volley源码分析（2）----ImageLoader
一:imageLoader 先来看看如何使用imageloader: public void showImg(View view){ ImageView imageView = (ImageView) ...
Java文件操作源码大全
Java文件操作源码大全 1.创建文件夹 52.创建文件 53.删除文件 54.删除文件夹 65.删除一个文件下夹所有的文件夹 76.清空文件夹 87.读取文件 88.写入文件 99.写入随机文件 9 ...
玩转Android之Picasso使用详详详详详详解，从入门到源码剖析！！！！
Picasso是Squareup公司出的一款图片加载框架,能够解决我们在Android开发中加载图片时遇到的诸多问题,比如OOM,图片错位等,问题主要集中在加载图片列表时,因为单张图片加载谁都会写.如 ...
SpringMVC源码情操陶冶-ResourcesBeanDefinitionParser静态资源解析器
解析mvc:resources节点,控制对静态资源的映射访问查看官方注释 /** * {@link org.springframework.beans.factory.xml.BeanDefinit ...
RxJava系列6(从微观角度解读RxJava源码)
RxJava系列1(简介) RxJava系列2(基本概念及使用介绍) RxJava系列3(转换操作符) RxJava系列4(过滤操作符) RxJava系列5(组合操作符) RxJava系列6(从微观角 ...
【算法】Bert预训练源码阅读
Bert预训练源码主要代码地址:https://github.com/google-research/bert create_pretraning_data.py:原始文件转换为训练数据格式 to ...
vuex2.0源码分析
当我们用vue在开发的过程中,经常会遇到以下问题多个vue组件共享状态 Vue组件间的通讯在项目不复杂的时候,我们会利用全局事件bus的方式解决,但随着复杂度的提升,用这种方式将会使得代码难以维护 ...

随机推荐

Swift MD5加密
很多时候我们会用到md5加密,下面是swift 3.0的实现方法: 首先新建桥接文件 xx-Bridging-Header,方法很多,这里就不介绍了. 然后在桥接文件中引入加密库 #import &l ...
Django自动获取项目中的全部URL
import re from collections import OrderedDict from django.conf import settings from django.utils.mod ...
docker 支持ipv6 (核心要点是ndp需要把docker内的ip全部加入到ndplist中来)
IPv6 with Docker Estimated reading time: 10 minutes The information in this section explains IPv6 wi ...
day 5，格式化输出，for，while， break，continue，列表
本节内容: 1,格式化输出 2,数据类型 3,for 循环 4,while 循环 5,列表 pycharm的简单使用,设置pycharm自动生成日期和计算机用户名 ctrl+d复制一行 1,格式化输出 ...
Linux 只列出目录的方法
1. ls -d 2. find -type d -maxdepth 1 3. ls -F | grep "/$" 4. ls -l | grep "^d"
《Miracle_House团队》第一次作业：团队亮相
Our Team:Miracle_House part 1 团队成员组成: NO.1 汝春瑞 201571030125 (组长) Style:乐观开朗,认真踏实,责任心强,还有就是爱笑.随和 ...
MD5=======RBAC权限管理
经过网上查阅相关的说明原来,MD5全名Message-Digest Algorithm 5(信息-摘要算法)是一种不可逆的加密算法. MD5为计算机安全领域广泛使用的一种散列函数,用以提供消息的完整性 ...
Xpath在选择器中正确，在代码中返回的是空列表问题
一.问题: 在进行爬虫的时候我们会用到xpath解析html文件,但是会有一种情况就是在xpath选择器中可以使用,但是在代码中就无法使用的情况. 二.原因: 1.是元素中有tbody的原因,这个元素 ...
Html5与Css3知识点拾遗（五）
css3更新的颜色 RGBA:红.绿.蓝.不透明度 rgba(89,0,127,0.4); HSL和HSLA:色相.饱和度.亮度.不透明度 hsl(282,100%,25%); hsl(282,100 ...
Beta冲刺（5/7）
Part.1 开篇队名:彳艮彳亍团队组长博客:戳我进入作业博客:班级博客本次作业的链接 Part.2 成员汇报组员1(组长)柯奇豪过去两天完成了哪些任务共享编辑文章的后端数据处理部分代码 ...

transformer 源码

transformer 源码的更多相关文章

随机推荐

热门专题