【tf.keras】AdamW: Adam with Weight decay

论文 Decoupled Weight Decay Regularization 中提到，Adam 在使用时，L2 regularization 与 weight decay 并不等价，并提出了 AdamW，在神经网络需要正则项时，用 AdamW 替换 Adam+L2 会得到更好的性能。

TensorFlow 2.x 在 tensorflow_addons 库里面实现了 AdamW，可以直接pip install tensorflow_addons进行安装（在 windows 上需要 TF 2.1），也可以直接把这个仓库下载下来使用。

下面是一个利用 AdamW 的示例程序（TF 2.0, tf.keras），在使用 AdamW 的同时，使用 learning rate decay：（以下程序中，AdamW 的结果不如 Adam，这是因为模型比较简单，加多了 regularization 反而影响性能）

import tensorflow as tf

import os

from tensorflow_addons.optimizers import AdamW

import numpy as np

from tensorflow.python.keras import backend as K

from tensorflow.python.util.tf_export import keras_export

from tensorflow.keras.callbacks import Callback

def lr_schedule(epoch):

    """Learning Rate Schedule

    Learning rate is scheduled to be reduced after 20, 30 epochs.

    Called automatically every epoch as part of callbacks during training.

    # Arguments

        epoch (int): The number of epochs

    # Returns

        lr (float32): learning rate

    """

    lr = 1e-3

    if epoch >= 30:

        lr *= 1e-2

    elif epoch >= 20:

        lr *= 1e-1

    print('Learning rate: ', lr)

    return lr

def wd_schedule(epoch):

    """Weight Decay Schedule

    Weight decay is scheduled to be reduced after 20, 30 epochs.

    Called automatically every epoch as part of callbacks during training.

    # Arguments

        epoch (int): The number of epochs

    # Returns

        wd (float32): weight decay

    """

    wd = 1e-4

    if epoch >= 30:

        wd *= 1e-2

    elif epoch >= 20:

        wd *= 1e-1

    print('Weight decay: ', wd)

    return wd

# just copy the implement of LearningRateScheduler, and then change the lr with weight_decay

@keras_export('keras.callbacks.WeightDecayScheduler')

class WeightDecayScheduler(Callback):

    """Weight Decay Scheduler.

    Arguments:

        schedule: a function that takes an epoch index as input

            (integer, indexed from 0) and returns a new

            weight decay as output (float).

        verbose: int. 0: quiet, 1: update messages.

    ```python

    # This function keeps the weight decay at 0.001 for the first ten epochs

    # and decreases it exponentially after that.

    def scheduler(epoch):

      if epoch < 10:

        return 0.001

      else:

        return 0.001 * tf.math.exp(0.1 * (10 - epoch))

    callback = WeightDecayScheduler(scheduler)

    model.fit(data, labels, epochs=100, callbacks=[callback],

              validation_data=(val_data, val_labels))

    ```

    """

    def __init__(self, schedule, verbose=0):

        super(WeightDecayScheduler, self).__init__()

        self.schedule = schedule

        self.verbose = verbose

    def on_epoch_begin(self, epoch, logs=None):

        if not hasattr(self.model.optimizer, 'weight_decay'):

            raise ValueError('Optimizer must have a "weight_decay" attribute.')

        try:  # new API

            weight_decay = float(K.get_value(self.model.optimizer.weight_decay))

            weight_decay = self.schedule(epoch, weight_decay)

        except TypeError:  # Support for old API for backward compatibility

            weight_decay = self.schedule(epoch)

        if not isinstance(weight_decay, (float, np.float32, np.float64)):

            raise ValueError('The output of the "schedule" function '

                             'should be float.')

        K.set_value(self.model.optimizer.weight_decay, weight_decay)

        if self.verbose > 0:

            print('\nEpoch %05d: WeightDecayScheduler reducing weight '

                  'decay to %s.' % (epoch + 1, weight_decay))

    def on_epoch_end(self, epoch, logs=None):

        logs = logs or {}

        logs['weight_decay'] = K.get_value(self.model.optimizer.weight_decay)

if __name__ == '__main__':

    os.environ["CUDA_VISIBLE_DEVICES"] = '1'

    gpus = tf.config.experimental.list_physical_devices(device_type='GPU')

    for gpu in gpus:

        tf.config.experimental.set_memory_growth(gpu, enable=True)

    print(gpus)

    cifar10 = tf.keras.datasets.cifar10

    (x_train, y_train), (x_test, y_test) = cifar10.load_data()

    x_train, x_test = x_train / 255.0, x_test / 255.0

    model = tf.keras.models.Sequential([

        tf.keras.layers.Conv2D(16, (3, 3), padding='same', activation='relu', input_shape=(32, 32, 3)),

        tf.keras.layers.AveragePooling2D(),

        tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu'),

        tf.keras.layers.AveragePooling2D(),

        tf.keras.layers.Flatten(),

        tf.keras.layers.Dense(10, activation='softmax')

    ])

    optimizer = AdamW(learning_rate=lr_schedule(0), weight_decay=wd_schedule(0))

    # optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)

    tb_callback = tf.keras.callbacks.TensorBoard(os.path.join('logs', 'adamw'),

                                                 profile_batch=0)

    lr_callback = tf.keras.callbacks.LearningRateScheduler(lr_schedule)

    wd_callback = WeightDecayScheduler(wd_schedule)

    model.compile(optimizer=optimizer,

                  loss='sparse_categorical_crossentropy',

                  metrics=['accuracy'])

    model.fit(x_train, y_train, epochs=40, validation_split=0.1,

              callbacks=[tb_callback, lr_callback, wd_callback])

    model.evaluate(x_test, y_test, verbose=2)

以上代码实现了在 learning rate decay 时使用 AdamW，虽然只能是在 epoch 层面进行学习率衰减。

在使用 AdamW 时，如果要使用 learning rate decay，那么对 weight_decay 的值要进行同样的学习率衰减，不然训练会崩掉。

References

How to use AdamW correctly? -- wuliytTaotao

Loshchilov, I., & Hutter, F. Decoupled Weight Decay Regularization. ICLR 2019. Retrieved from http://arxiv.org/abs/1711.05101

【tf.keras】AdamW: Adam with Weight decay的更多相关文章

【tf.keras】tf.keras使用tensorflow中定义的optimizer
Update:2019/09/21 使用 tf.keras 时,请使用 tf.keras.optimizers 里面的优化器,不要使用 tf.train 里面的优化器,不然学习率衰减会出现问题. 使用 ...
【tf.keras】使用手册
目录 0. 简介 1. 安装 1.1 安装 CUDA 和 cuDNN 2. 数据集 2.1 使用 tensorflow_datasets 导入公共数据集 2.2 数据集过大导致内存溢出 2.3 加载 ...
【tf.keras】在 cifar 上训练 AlexNet，数据集过大导致 OOM
cifar-10 每张图片的大小为 32×32,而 AlexNet 要求图片的输入是 224×224(也有说 227×227 的,这是 224×224 的图片进行大小为 2 的 zero paddin ...
【tf.keras】实现 F1 score、precision、recall 等 metric
tf.keras.metric 里面竟然没有实现 F1 score.recall.precision 等指标,一开始觉得真不可思议.但这是有原因的,这些指标在 batch-wise 上计算都没有意义, ...
【tf.keras】TensorFlow 1.x 到 2.0 的 API 变化
TensorFlow 2.0 版本将 keras 作为高级 API,对于 keras boy/girl 来说,这就很友好了.tf.keras 从 1.x 版本迁移到 2.0 版本,需要修改几个地方. ...
【tf.keras】Resource exhausted: OOM when allocating tensor with shape [9216,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
运行以下类似代码: while True: inputs, outputs = get_AlexNet() model = tf.keras.Model(inputs=inputs, outputs= ...
【tf.keras】tf.keras加载AlexNet预训练模型
目录从 PyTorch 中导出模型参数第 0 步:配置环境第 1 步:安装 MMdnn 第 2 步:得到 PyTorch 保存完整结构和参数的模型(pth 文件) 第 3 步:导出 PyTorc ...
【tf.keras】tensorflow datasets，tfds
一些最常用的数据集如 MNIST.Fashion MNIST.cifar10/100 在 tf.keras.datasets 中就能找到,但对于其它也常用的数据集如 SVHN.Caltech101,t ...
【tf.keras】ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1977)
问题描述 tf.keras 在加载 cifar10 数据时报错,ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption ...

随机推荐

SuperSocket 服务器管理器客户端
SuperSocket 服务器管理器当前有两种类型的客户端, Silverlight客户端和WPF客户端.这两种客户端的代码都在源代码中的"Management"目录,你可以自行编 ...
Laravel5.5 支付宝手机网站支付的教程
https://segmentfault.com/a/1190000015559571 这篇文章主要介绍了Laravel5.5 支付宝手机网站支付的教程,小编觉得挺不错的,现在分享给大家,也给大家做个 ...
mosquitto/openssl 在RK3288上的编译以及MQTT客户端的代码示例
1,依赖库openssl 的交叉编译 (1)配置编译器信息 setarch i386 ./config no-asm shared --cross-compile-prefix=arm-linux-a ...
HDU 5974"A Simple Math Problem"（GCD(a,b) = GCD(a+b,ab) = 1）
传送门 •题意已知 $a,b$,求满足 $x+y=a\ ,\ LCM(x,y)=b$ 条件的 $x,y$: 其中,$a,b$ 为正整数,$x,y$ 为整数: •题解关键式子:设 $a,b$ 为正整 ...
MySQL 数据库中如何把A表的数据插入到B表?
web开发中,我们经常需要将一个表的数据插入到另外一个表,有时还需要指定导入字段,设置只需要导入目标表中不存在的记录,虽然这些都可以在程序中拆分成简单sql来实现,但是用一个sql的话,会节省大量代码 ...
mpvue的坑，持续更新-.-
mpvue... 坑怎么说呢,去github看一下,发现还是有很多问题没有解决... 不支持filter 亲,到现在还没有支持filter哦.只能用替代方法了,用computed或者渲染前先处理数据 ...
P1035 台阶问题二
题目描述有 $N$ 级的台阶,你一开始在底部,每次可以向上迈最多 $K$ 级台阶(最少 $1$ 级),问到达第 $N$ 级台阶有多少种不同方式. 输入格式两个正整数 \(N, K( ...
H3C 域名
linux自旋锁函数
我们已经看到 2 个函数, spin_lock 和 spin_unlock, 可以操作自旋锁. 有其他几个函数, 然而, 有类似的名子和用途. 我们现在会展示全套. 这个讨论将带我们到一个我们无法 ...
linux 选择 ioctl 命令
在为 ioctl 编写代码之前, 你需要选择对应命令的数字. 许多程序员的第一个本能的反应是选择一组小数从0或1 开始, 并且从此开始向上. 但是, 有充分的理由不这样做. ioctl 命令数字应当 ...

【tf.keras】AdamW: Adam with Weight decay

References

【tf.keras】AdamW: Adam with Weight decay的更多相关文章

随机推荐

热门专题