梯度修剪

梯度修剪主要避免训练梯度爆炸的问题,一般来说使用了 Batch Normalization 就不必要使用梯度修剪了,但还是有必要理解下实现的

In TensorFlow, the optimizer’s minimize() function takes care of both computing the gradients and applying them, so you must instead call the optimizer’s compute_gradients() method first, then create an operation to clip the gradients using the clip_by_value() function, and finally create an operation to apply the clipped gradients using the optimizer’s apply_gradients() method:

threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
         for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

例子:

import tensorflow as tf

def Swish(features):
return features*tf.nn.sigmoid(features) # 1. create data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('../MNIST_data', one_hot=True) X = tf.placeholder(tf.float32, shape=(None, 784), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')
is_training = tf.placeholder(tf.bool, None, name='is_training') # 2. define network
he_init = tf.contrib.layers.variance_scaling_initializer()
with tf.name_scope('dnn'):
hidden1 = tf.layers.dense(X, 300, kernel_initializer=he_init, name='hidden1')
# hidden1 = tf.layers.batch_normalization(hidden1, momentum=0.9)
hidden1 = tf.nn.relu(hidden1)
hidden2 = tf.layers.dense(hidden1, 100, kernel_initializer=he_init, name='hidden2')
# hidden2 = tf.layers.batch_normalization(hidden2, training=is_training, momentum=0.9)
hidden2 = tf.nn.relu(hidden2)
logits = tf.layers.dense(hidden2, 10, kernel_initializer=he_init, name='output')
# prob = tf.layers.dense(hidden2, 10, tf.nn.softmax, name='prob') # 3. define loss
with tf.name_scope('loss'):
# tf.losses.sparse_softmax_cross_entropy() label is not one_hot and dtype is int*
# xentropy = tf.losses.sparse_softmax_cross_entropy(labels=tf.argmax(y, axis=1), logits=logits)
# tf.nn.sparse_softmax_cross_entropy_with_logits() label is not one_hot and dtype is int*
# xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.argmax(y, axis=1), logits=logits)
# loss = tf.reduce_mean(xentropy)
loss = tf.losses.softmax_cross_entropy(onehot_labels=y, logits=logits) # label is one_hot # 4. define optimizer
learning_rate = 0.01
with tf.name_scope('train'):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # for batch normalization
with tf.control_dependencies(update_ops):
# optimizer_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
optimizer_op = optimizer.apply_gradients(capped_gvs) with tf.name_scope('eval'):
correct = tf.nn.in_top_k(logits, tf.argmax(y, axis=1), 1) # 目标是否在前K个预测中, label's dtype is int*
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) # 5. initialize
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
saver = tf.train.Saver()
# =================
print([v.name for v in tf.trainable_variables()])
print([v.name for v in tf.global_variables()])
# =================
# 5. train & test
n_epochs = 20
n_batches = 50
batch_size = 50 with tf.Session() as sess:
sess.run(init_op)
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(optimizer_op, feed_dict={X: X_batch, y: y_batch, is_training:True})
# =================
# for grad, var in grads_and_vars:
# grad = grad.eval(feed_dict={X: X_batch, y: y_batch, is_training:True})
# var = var.eval()
# =================
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch, is_training:False}) # 最后一个 batch 的 accuracy
acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
loss_test = loss.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test, "Test loss:", loss_test)
save_path = saver.save(sess, "./my_model_final.ckpt") with tf.Session() as sess:
sess.run(init_op)
saver.restore(sess, "./my_model_final.ckpt")
acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
loss_test = loss.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
print("Test accuracy:", acc_test, ", Test loss:", loss_test)

下面我们来看看上面这个例子里所涉及的一些东西

compute_gradients

compute_gradients 是任何一个优化器都有的方法:

compute_gradients(
loss,
var_list=None,
gate_gradients=GATE_OP,
aggregation_method=None,
colocate_gradients_with_ops=False,
grad_loss=None
)

计算 loss 中可训练的 var_list 中的梯度。
相当于minimize() 的第一步,返回 (gradient, variable) 列表。

获得了梯度后我们就可以手动进行梯度裁剪了,下面这句话就是将梯度限制到 [-threshold, threshold] 的范围内:

capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]

apply_gradients

apply_gradients 同样是任何一个优化器都有的方法:

apply_gradients(
grads_and_vars,
global_step=None,
name=None
)

minimize() 的第二部分,返回一个执行梯度更新的 ops。

Max-Norm Regularization

对于每个节点,max-norm regularization 会对权重 $\mathbf{w}$ 进行限制 $\lVert \mathbf{w} \rVert_2 \le r$:

\begin{equation}
\label{a}
\mathbf{w} \gets \mathbf{w} \frac{r}{\lVert \mathbf{w} \rVert_2}
\end{equation}

实例代码:

import tensorflow as tf

# =================
def max_norm_regularizer(threshold=1.0, axes=1, name="max_norm",
collection="max_norm"):
def max_norm(weights):
clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
clip_weights = tf.assign(weights, clipped, name=name)
tf.add_to_collection(collection, clip_weights)
return None # there is no regularization loss term
return max_norm
max_norm_reg = max_norm_regularizer(threshold=1.0)
# ================= # 1. create data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('../MNIST_data', one_hot=True) X = tf.placeholder(tf.float32, shape=(None, 784), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')
is_training = tf.placeholder(tf.bool, None, name='is_training') # 2. define network
he_init = tf.contrib.layers.variance_scaling_initializer()
with tf.name_scope('dnn'):
hidden1 = tf.layers.dense(X, 300, kernel_initializer=he_init,
kernel_regularizer=max_norm_reg, name='hidden1')
# hidden1 = tf.layers.batch_normalization(hidden1, momentum=0.9)
hidden1 = tf.nn.relu(hidden1)
hidden2 = tf.layers.dense(hidden1, 100, kernel_initializer=he_init,
kernel_regularizer=max_norm_reg, name='hidden2')
# hidden2 = tf.layers.batch_normalization(hidden2, training=is_training, momentum=0.9)
hidden2 = tf.nn.relu(hidden2)
logits = tf.layers.dense(hidden2, 10, kernel_initializer=he_init, name='output') # 3. define loss
with tf.name_scope('loss'):
loss = tf.losses.softmax_cross_entropy(onehot_labels=y, logits=logits) # label is one_hot # 4. define optimizer
learning_rate_init = 0.01
global_step = tf.Variable(0, trainable=False)
with tf.name_scope('train'):
learning_rate = tf.train.polynomial_decay( # 多项式衰减
learning_rate=learning_rate_init, # 初始学习率
global_step=global_step, # 当前迭代次数
decay_steps=22000, # 在迭代到该次数实际,学习率衰减为 learning_rate * dacay_rate
end_learning_rate=learning_rate_init / 10, # 最小的学习率
power=0.9,
cycle=False
)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # for batch normalization
with tf.control_dependencies(update_ops):
optimizer_op = tf.train.MomentumOptimizer(
learning_rate=learning_rate, momentum=0.9).minimize(
loss=loss,
var_list=tf.trainable_variables(),
global_step=global_step # 不指定的话学习率不更新
)
# ================= clip gradient
# threshold = 1.0
# optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# grads_and_vars = optimizer.compute_gradients(loss)
# capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
# for grad, var in grads_and_vars]
# optimizer_op = optimizer.apply_gradients(capped_gvs)
# ================= with tf.name_scope('eval'):
correct = tf.nn.in_top_k(logits, tf.argmax(y, axis=1), 1) # 目标是否在前K个预测中, label's dtype is int*
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) # 5. initialize
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
saver = tf.train.Saver() # =================
clip_all_weights = tf.get_collection("max_norm")
# ================= # 6. train & test
n_epochs = 20
batch_size = 50 with tf.Session() as sess:
sess.run(init_op)
# saver.restore(sess, './my_model_final.ckpt')
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run([optimizer_op, learning_rate], feed_dict={X: X_batch, y: y_batch, is_training:True})
sess.run(clip_all_weights)
# ================= check gradient
# for grad, var in grads_and_vars:
# grad = grad.eval(feed_dict={X: X_batch, y: y_batch, is_training:True})
# var = var.eval()
# =================
learning_rate_cur = learning_rate.eval()
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch, is_training:False}) # 最后一个 batch 的 accuracy
acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
loss_test = loss.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
print(epoch, "Current learning rate:", learning_rate_cur, "Train accuracy:", acc_train, "Test accuracy:", acc_test, "Test loss:", loss_test)
save_path = saver.save(sess, "./my_model_final.ckpt")

TensorFlow使用记录 (八): 梯度修剪 和 Max-Norm Regularization的更多相关文章

  1. TensorFlow使用记录 (六): 优化器

    0. tf.train.Optimizer tensorflow 里提供了丰富的优化器,这些优化器都继承与 Optimizer 这个类.class Optimizer 有一些方法,这里简单介绍下: 0 ...

  2. TensorFlow 学习(八)—— 梯度计算(gradient computation)

    maxpooling 的 max 函数关于某变量的偏导也是分段的,关于它就是 1,不关于它就是 0: BP 是反向传播求关于参数的偏导,SGD 则是梯度更新,是优化算法: 1. 一个实例 relu = ...

  3. 『PyTorch x TensorFlow』第八弹_基本nn.Module层函数

    『TensorFlow』网络操作API_上 『TensorFlow』网络操作API_中 『TensorFlow』网络操作API_下 之前也说过,tf 和 t 的层本质区别就是 tf 的是层函数,调用即 ...

  4. Tensorflow安装记录

    一.安装Ubantu环境 下载ios 网址:http://cn.ubuntu.com/download/ 2.配合虚拟机进行安装环境 虚拟机直接百度下载即可 虚拟机采用 具体安装,虚拟机百度中很多记录 ...

  5. linux 配置tensorflow 全过程记录

    前几天刚下一个deepin系统,是基于linux 内核的,界面的设计有些mac的feel 感觉还是挺不错的,之后就赶紧配置了一下tensorflow ,尽管之前配置过,但是这次还是遇到点儿问题,所以说 ...

  6. TensorFlow使用记录 (七): BN 层及 Dropout 层的使用

    参考:tensorflow中的batch_norm以及tf.control_dependencies和tf.GraphKeys.UPDATE_OPS的探究 1. Batch Normalization ...

  7. TensorFlow使用记录 (五): 激活函数和初始化方式

    In general ELU > leaky ReLU(and its variants) > ReLU > tanh > logistic. If you care a lo ...

  8. TensorFlow实战第八课(卷积神经网络CNN)

    首先我们来简单的了解一下什么是卷积神经网路(Convolutional Neural Network) 卷积神经网络是近些年逐步兴起的一种人工神经网络结构, 因为利用卷积神经网络在图像和语音识别方面能 ...

  9. TensorFlow学习记录(一)

    windows下的安装: 首先访问https://storage.googleapis.com/tensorflow/ 找到对应操作系统下,对应python版本,对应python位数的whl,下载. ...

随机推荐

  1. CentOS7-部署kubernetes

    1 环境准备   节点 主机名 IP OS Master     k8s-master        192.168.57.1       centos 7        Node1  k8s-nod ...

  2. Codeforces 1244E. Minimizing Difference

    传送门 首先减的顺序是无关紧要的,那么有一个显然的贪心 每次减都减最大或者最小的,因为如果不这样操作,最大的差值不会变小 那么直接把序列排序一下然后模拟一下操作过程即可,别一次只减 $1$ 就好 #i ...

  3. Python 最常见的 170 道面试题解析:2019 最新

    Python 最常见的 170 道面试题解析:2019 最新 2019年06月03日 23:30:10 GitChat的博客 阅读数 21329 文章标签: PythonPython入门Python面 ...

  4. 怎样使用 v-on 指令?

    1. Vue 中的 v-on 指令用于绑定 dom 事件 的监听函数. 下面代码实现的是 点击更改文字颜色 的功能. <!DOCTYPE html> <html lang=" ...

  5. js点击发送验证码 xx秒后重新发送

    用于一些注册类的场景,点击发送验证码,xx秒后重新发送. 利用 setTimeout 方法,xx秒后执行指定的方法,修改button的属性值,disabled为true时为灰色,不可点击. <! ...

  6. 常用的Java工具类——十六种

    常用的Java工具类——十六种 在Java中,工具类定义了一组公共方法,这篇文章将介绍Java中使用最频繁及最通用的Java工具类.以下工具类.方法按使用流行度排名,参考数据来源于Github上随机选 ...

  7. Nginx作为静态资源web服务之防盗链

    Nginx作为静态资源web服务之防盗链 首先,为什么需要防盗链,因为有些资源存在竞争对手的关系,比如淘宝的商品图片,不会轻易的让工具来爬虫爬走收集.但是如果使用防盗链,需要知道上一个访问的资源,然后 ...

  8. zabbix4.2Proxy安装文档

    zabbix4.2Proxy安装文档 目录 zabbix4.2Proxy安装文档    1 一.安装    2 1.前期安装准备    2 2.安装zabbix RPM源    3 2.1下载zabb ...

  9. html页面嵌入php代码不显示内容

    新建一个html文件,内容如下: <html> <head> <title>Example</title> </head> <body ...

  10. js 中判断对象是否为空

      var dd = function (S_object, name) { console.log(name + '第一步执行结果:' + S_object); if (typeof S_objec ...