Ho J., Jain A. and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NIPS), 2020.

[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]

Yerukala R., Boiroju N. K. Approximations to standard normal distribution function. Journal of Scientific and Engineering Research, vol. 6, pp. 515-518, 2015.

diffusion model和变分界的结合.

对抗鲁棒性上已经有多篇论文用DDPM生成的数据用于训练了, 可见其强大.

主要内容

Diffusion models

reverse process

从\(p(x_T) = \mathcal{N}(x_T; 0, I)\)出发:

\[p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)),
\]

注意这个过程我们拟合均值\(\mu_{\theta}\)和协方差矩阵\(\Sigma_{\theta}\).

这部分的过程逐步将噪声'恢复'为图片(信号)\(x_0\).

forward process

\[q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I).
\]

其中\(\beta_t\)是可训练的参数或者人为给定的超参数.

这部分为将图片(信号)逐步添加噪声的过程.

变分界

对于参数\(\theta\), 很自然地我们希望通过最小化其负对数似然来优化:

\[\begin{array}{ll}
\mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg]
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\
\end{array}
\]

注: \(q=q(x_{1:T}|x_0)p_{data}(x_0)\), 下面另\(q(x_0) := p_{data}(x_0)\).

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}]
&= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\
&= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\
&= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg].
\end{array}
\]

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}]
&=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\
&=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\
&=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg].
\end{array}
\]

故最后:

\[\mathcal{L} := \mathbb{E}_q \bigg[
\underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} +
\sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}}
\underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}.
\bigg]
\]

损失求解

因为无论forward, 还是 reverse process都是基于高斯分布的, 我们可以显示求解上面的各项损失:

首先, 对于forward process中的\(x_t\):

\[\begin{array}{ll}
x_t
&= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\
&= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} +
\sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\
&= \cdots \\
&= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon,
\end{array}
\]

\[q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s.
\]

对于后验分布\(q(x_{t-1}|x_t, x_0)\), 我们有

\[\begin{array}{ll}
q(x_{t-1}|x_t, x_0)
&= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\
&\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\
\end{array}
\]

所以

\[q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I),
\]

其中

\[\tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t,
\]
\[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t.
\]

\(L_{t}\)

\(L_T\)与\(\theta\)无关, 舍去.

作者假设\(\Sigma_{\theta}(x_t, t) = \sigma_t^2 I\)为非训练的参数, 其中

\[\sigma_t^2 = \beta_t | \tilde{\beta}_t,
\]

分别为\(x_0 \sim \mathcal{N}(0, I)\)和\(x_0\)为固定值时, 期望下KL散度的最优参数(作者说在实验中二者差不多).

\[L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1.
\]

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon.
\]

所以

\[\begin{array}{ll}
\mathbb{E}_q [L_{t-1} - C]
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) -
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\
\end{array}
\]

注: 上式子中\(x_t\)由\(x_0, \epsilon\)决定, 实际上\(x_t = x_t(x_0, \epsilon)\), 故期望实际上是对\(x_t\)求期望.

既然如此, 我们不妨直接参数化\(\mu_{\theta}\)为

\[\mu_{\theta}(x_t, t):=
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big),
\]

即直接建模残差\(\epsilon\).

此时损失可简化为:

\[\mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2
\bigg\}
\]

这个实际上时denoising score matching.

类似地, 从\(p_{\theta}(x_{t-1}|x_t)\)采样则为:

\[x_{t-1} =
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I),
\]

这是Langevin dynamic的形式(步长和权重有点变化)

注: 这部分见here.

\(L_0\)

最后我们要处理\(L_0\), 这里作者假设\(x_0|x_1\)满足一个离散分布, 首先图片取值于\(\{0, 1, 2, \cdots, 255\}\), 并标准化至\([-1, 1]\). 假设

\[p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\
\delta_+(x) =
\left \{
\begin{array}{ll}
+\infty & \text{if } x = 1, \\
x + \frac{1}{255} & \text{if } x < 1.
\end{array}
\right .
\delta_- (x)
\left \{
\begin{array}{ll}
-\infty & \text{if } x = -1, \\
x - \frac{1}{255} & \text{if } x > -1.
\end{array}
\right .
\]

实际上就是将普通的正态分布划分为:

\[(-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty)
\]

各取值落在其中之一.

在实际代码编写中, 会遇到高斯函数密度函数估计的问题(直接求是做不到的), 作者选择用下列的估计方式:

\[\Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}.
\]

这样梯度也就能够回传了.

注: 该估计属于Page.

最后的算法

注: \(t=1\)对应\(L_0\), \(t=2,\cdots, T\)对应\(L_{1}, \cdots, L_{T-1}\).

注: 对于\(L_t\)作者省略了开始的系数, 这反而是一种加权.

作者在实际中是采样损失用以训练的.

细节

注意到, 作者的\(\epsilon_{\theta}(\cdot, t)\)是有显示强调\(t\), 作者在实验中是通过attention中的位置编码实现的, 假设位置编码为\(P\):

  1. $ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通过两层的MLP来转换得到time_steps;
  2. 作者用的是U-Net结构, 在每个residual 模块中:
\[x += \text{Linear}(\text{ACT}(t)).
\]
参数
\(T\) 1000
\(\beta_t\) \([0.0001, 0.02]\), 线性增长\(1,2,\cdots, T\).
backbone U-Net

注: 作者在实现中还用到了EMA等技巧.

代码

原文代码

lucidrains-denoising-diffusion-pytorch

Denoising Diffusion Probabilistic Models (DDPM)的更多相关文章

  1. (转) RNN models for image generation

    RNN models for image generation MARCH 3, 2017   Today we’re looking at the remaining papers from the ...

  2. {ICIP2014}{收录论文列表}

    This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...

  3. CVPR 2015 papers

    CVPR2015 Papers震撼来袭! CVPR 2015的文章可以下载了,如果链接无法下载,可以在Google上通过搜索paper名字下载(友情提示:可以使用filetype:pdf命令). Go ...

  4. ICLR 2014 International Conference on Learning Representations深度学习论文papers

    ICLR 2014 International Conference on Learning Representations Apr 14 - 16, 2014, Banff, Canada Work ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. Official Program for CVPR 2015

    From:  http://www.pamitc.org/cvpr15/program.php Official Program for CVPR 2015 Monday, June 8 8:30am ...

  7. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  8. Notes(一)

    Numerous experimental measurements in spatially complex systems have revealed anomalous diffusion in ...

  9. 斯坦福CS课程列表

    http://exploredegrees.stanford.edu/coursedescriptions/cs/ CS 101. Introduction to Computing Principl ...

随机推荐

  1. 浅谈MySQL数据库面试必要掌握知识点

    概述 **本人博客网站 **IT小神 www.itxiaoshen.com 定义 MySQL官方地址 https://www.mysql.com/ MySQL 8系列最新版本为8.0.27,5系列的最 ...

  2. adhere, adjust, adjacent

    adhere to stick,不是to here. 在古英语里,stick是twig(细树枝).fasten(想必是用twig来固定).后引申为粘住.stick还有stab, pierce的意思,想 ...

  3. ES6必知,变量的结构赋值。

    对象和数组时 Javascript 中最常用的两种数据结构,由于 JSON 数据格式的普及,二者已经成为 Javascript 语言中特别重要的一部分. 在编码过程中,我们经常定义许多对象和数组,然后 ...

  4. Oracle—全局变量

    Oracle全局变量 一.数据库程序包全局变量       在程序实现过程中,经常用遇到一些全局变量或常数.在程序开发过程中,往往会将该变量或常数存储于临时表或前台程序的全局变量中,由此带来运行效率降 ...

  5. 石墨文档Websocket百万长连接技术实践

    引言 在石墨文档的部分业务中,例如文档分享.评论.幻灯片演示和文档表格跟随等场景,涉及到多客户端数据同步和服务端批量数据推送的需求,一般的 HTTP 协议无法满足服务端主动 Push 数据的场景,因此 ...

  6. layui 弹窗中 分页展示table

    1. 需求:点击查看更多,展示该类别 所有数据,并分页 2. 参考文档: (1)https://www.jianshu.com/p/40da11ebae66 (2) https://blog.csdn ...

  7. SQL错误总结

    ORA-00918: column ambiguously defined 异常原因: select 查询的字段在from的两张表中都存在,导致数据库无法区别需要查询的字段来自于哪张表 以下是例子 s ...

  8. 【C/C++】从矩阵左上角走到右下角

    tx的笔试,但是只过了10%,就离谱 #include <bits/stdc++.h> using namespace std; const int maxn = 1010; long d ...

  9. .NET6使用DOCFX自动生成开发文档

    本文内容来自我写的开源电子书<WoW C#>,现在正在编写中,可以去WOW-Csharp/学习路径总结.md at master · sogeisetsu/WOW-Csharp (gith ...

  10. SOUI3界面编辑器使用说明

    SOUI一直没有官方的界面编辑器,关键是我自己一直坚持手写界面更好控制. 大概是2年前,网友"指尖"开发了一个SOUI2的编辑器,功能非常多,特点是可以拖动控件来实现可视化布局. ...