Ho J., Jain A. and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NIPS), 2020.

[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]

Yerukala R., Boiroju N. K. Approximations to standard normal distribution function. Journal of Scientific and Engineering Research, vol. 6, pp. 515-518, 2015.

diffusion model和变分界的结合.

对抗鲁棒性上已经有多篇论文用DDPM生成的数据用于训练了, 可见其强大.

主要内容

Diffusion models

reverse process

从\(p(x_T) = \mathcal{N}(x_T; 0, I)\)出发:

\[p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)),
\]

注意这个过程我们拟合均值\(\mu_{\theta}\)和协方差矩阵\(\Sigma_{\theta}\).

这部分的过程逐步将噪声'恢复'为图片(信号)\(x_0\).

forward process

\[q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I).
\]

其中\(\beta_t\)是可训练的参数或者人为给定的超参数.

这部分为将图片(信号)逐步添加噪声的过程.

变分界

对于参数\(\theta\), 很自然地我们希望通过最小化其负对数似然来优化:

\[\begin{array}{ll}
\mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg]
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\
\end{array}
\]

注: \(q=q(x_{1:T}|x_0)p_{data}(x_0)\), 下面另\(q(x_0) := p_{data}(x_0)\).

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}]
&= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\
&= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\
&= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg].
\end{array}
\]

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}]
&=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\
&=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\
&=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg].
\end{array}
\]

故最后:

\[\mathcal{L} := \mathbb{E}_q \bigg[
\underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} +
\sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}}
\underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}.
\bigg]
\]

损失求解

因为无论forward, 还是 reverse process都是基于高斯分布的, 我们可以显示求解上面的各项损失:

首先, 对于forward process中的\(x_t\):

\[\begin{array}{ll}
x_t
&= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\
&= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} +
\sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\
&= \cdots \\
&= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon,
\end{array}
\]

\[q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s.
\]

对于后验分布\(q(x_{t-1}|x_t, x_0)\), 我们有

\[\begin{array}{ll}
q(x_{t-1}|x_t, x_0)
&= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\
&\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\
\end{array}
\]

所以

\[q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I),
\]

其中

\[\tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t,
\]
\[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t.
\]

\(L_{t}\)

\(L_T\)与\(\theta\)无关, 舍去.

作者假设\(\Sigma_{\theta}(x_t, t) = \sigma_t^2 I\)为非训练的参数, 其中

\[\sigma_t^2 = \beta_t | \tilde{\beta}_t,
\]

分别为\(x_0 \sim \mathcal{N}(0, I)\)和\(x_0\)为固定值时, 期望下KL散度的最优参数(作者说在实验中二者差不多).

\[L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1.
\]

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon.
\]

所以

\[\begin{array}{ll}
\mathbb{E}_q [L_{t-1} - C]
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) -
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\
\end{array}
\]

注: 上式子中\(x_t\)由\(x_0, \epsilon\)决定, 实际上\(x_t = x_t(x_0, \epsilon)\), 故期望实际上是对\(x_t\)求期望.

既然如此, 我们不妨直接参数化\(\mu_{\theta}\)为

\[\mu_{\theta}(x_t, t):=
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big),
\]

即直接建模残差\(\epsilon\).

此时损失可简化为:

\[\mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2
\bigg\}
\]

这个实际上时denoising score matching.

类似地, 从\(p_{\theta}(x_{t-1}|x_t)\)采样则为:

\[x_{t-1} =
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I),
\]

这是Langevin dynamic的形式(步长和权重有点变化)

注: 这部分见here.

\(L_0\)

最后我们要处理\(L_0\), 这里作者假设\(x_0|x_1\)满足一个离散分布, 首先图片取值于\(\{0, 1, 2, \cdots, 255\}\), 并标准化至\([-1, 1]\). 假设

\[p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\
\delta_+(x) =
\left \{
\begin{array}{ll}
+\infty & \text{if } x = 1, \\
x + \frac{1}{255} & \text{if } x < 1.
\end{array}
\right .
\delta_- (x)
\left \{
\begin{array}{ll}
-\infty & \text{if } x = -1, \\
x - \frac{1}{255} & \text{if } x > -1.
\end{array}
\right .
\]

实际上就是将普通的正态分布划分为:

\[(-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty)
\]

各取值落在其中之一.

在实际代码编写中, 会遇到高斯函数密度函数估计的问题(直接求是做不到的), 作者选择用下列的估计方式:

\[\Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}.
\]

这样梯度也就能够回传了.

注: 该估计属于Page.

最后的算法

注: \(t=1\)对应\(L_0\), \(t=2,\cdots, T\)对应\(L_{1}, \cdots, L_{T-1}\).

注: 对于\(L_t\)作者省略了开始的系数, 这反而是一种加权.

作者在实际中是采样损失用以训练的.

细节

注意到, 作者的\(\epsilon_{\theta}(\cdot, t)\)是有显示强调\(t\), 作者在实验中是通过attention中的位置编码实现的, 假设位置编码为\(P\):

  1. $ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通过两层的MLP来转换得到time_steps;
  2. 作者用的是U-Net结构, 在每个residual 模块中:
\[x += \text{Linear}(\text{ACT}(t)).
\]
参数
\(T\) 1000
\(\beta_t\) \([0.0001, 0.02]\), 线性增长\(1,2,\cdots, T\).
backbone U-Net

注: 作者在实现中还用到了EMA等技巧.

代码

原文代码

lucidrains-denoising-diffusion-pytorch

Denoising Diffusion Probabilistic Models (DDPM)的更多相关文章

  1. (转) RNN models for image generation

    RNN models for image generation MARCH 3, 2017   Today we’re looking at the remaining papers from the ...

  2. {ICIP2014}{收录论文列表}

    This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...

  3. CVPR 2015 papers

    CVPR2015 Papers震撼来袭! CVPR 2015的文章可以下载了,如果链接无法下载,可以在Google上通过搜索paper名字下载(友情提示:可以使用filetype:pdf命令). Go ...

  4. ICLR 2014 International Conference on Learning Representations深度学习论文papers

    ICLR 2014 International Conference on Learning Representations Apr 14 - 16, 2014, Banff, Canada Work ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. Official Program for CVPR 2015

    From:  http://www.pamitc.org/cvpr15/program.php Official Program for CVPR 2015 Monday, June 8 8:30am ...

  7. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  8. Notes(一)

    Numerous experimental measurements in spatially complex systems have revealed anomalous diffusion in ...

  9. 斯坦福CS课程列表

    http://exploredegrees.stanford.edu/coursedescriptions/cs/ CS 101. Introduction to Computing Principl ...

随机推荐

  1. 日常Java 2021/9/28

    字符串反转 package m; public class m { public static void main(String[] args) { //定义一个字符串 String str = &q ...

  2. 日常Java(测试 (二柱)修改版)2021/9/22

    题目: 一家软件公司程序员二柱的小孩上了小学二年级,老师让家长每天出30道四则运算题目给小学生做. 二柱一下打印出好多份不同的题目,让孩子做了.老师看了作业之后,对二柱赞许有加.别的老师闻讯, 问二柱 ...

  3. 基于MQTT协议实现远程控制的"智能"车

    智能,但不完全智能 虽然我不觉得这玩意儿有啥智能的,但都这么叫就跟着叫喽. 时隔好几天才写的 其实在写这篇博文的时候我已经在做升级了,并且已经到了中后期阶段了. 主要是业余时间做着玩,看时间了. 规格 ...

  4. Hbase(一)【入门安装及高可用】

    目录 一.Zookeeper正常部署 二.Hadoop正常部署 三.Hbase部署 1.下载 2.解压 3.相关配置 4.分发文件 5.启动.关闭 6.验证 四.HMaster的高可用 一.Zooke ...

  5. 利用python爬取城市公交站点

    利用python爬取城市公交站点 页面分析 https://guiyang.8684.cn/line1 爬虫 我们利用requests请求,利用BeautifulSoup来解析,获取我们的站点数据.得 ...

  6. Docker学习(六)——Dockerfile文件详解

    Docker学习(六)--Dockerfile文件详解 一.环境介绍 1.Dockerfile中所用的所有文件一定要和Dockerfile文件在同一级父目录下,可以为Dockerfile父目录的子目录 ...

  7. ubantu上编辑windows程序

    命令简记 cd $GOROOT/src cp -r $GOROOT /root/go1.4 CGO_ENABLED=0 GOOS=windows GOARCH=amd64 ./make.bash 操作 ...

  8. SVN终端演练-版本回退

    1. 版本回退概念以及原因?    概念: 是指将代码(本地代码或者服务器代码), 回退到之前记录的某一特定版本    原因: 如果代码做错了, 想返回之前某个状态重做;2. 修改了,但未提交的情况下 ...

  9. 关于python中的随机种子——random_state

    random_state是一个随机种子,是在任意带有随机性的类或函数里作为参数来控制随机模式.当random_state取某一个值时,也就确定了一种规则. random_state可以用于很多函数,我 ...

  10. RabbitMQ,RocketMQ,Kafka 几种消息队列的对比

    常用的几款消息队列的对比 前言 RabbitMQ 优点 缺点 RocketMQ 优点 缺点 Kafka 优点 缺点 如何选择合适的消息队列 参考 常用的几款消息队列的对比 前言 消息队列的作用: 1. ...