Ho J., Jain A. and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NIPS), 2020.

[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]

Yerukala R., Boiroju N. K. Approximations to standard normal distribution function. Journal of Scientific and Engineering Research, vol. 6, pp. 515-518, 2015.

diffusion model和变分界的结合.

对抗鲁棒性上已经有多篇论文用DDPM生成的数据用于训练了, 可见其强大.

主要内容

Diffusion models

reverse process

从\(p(x_T) = \mathcal{N}(x_T; 0, I)\)出发:

\[p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)),
\]

注意这个过程我们拟合均值\(\mu_{\theta}\)和协方差矩阵\(\Sigma_{\theta}\).

这部分的过程逐步将噪声'恢复'为图片(信号)\(x_0\).

forward process

\[q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I).
\]

其中\(\beta_t\)是可训练的参数或者人为给定的超参数.

这部分为将图片(信号)逐步添加噪声的过程.

变分界

对于参数\(\theta\), 很自然地我们希望通过最小化其负对数似然来优化:

\[\begin{array}{ll}
\mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg]
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\
\end{array}
\]

注: \(q=q(x_{1:T}|x_0)p_{data}(x_0)\), 下面另\(q(x_0) := p_{data}(x_0)\).

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}]
&= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\
&= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\
&= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg].
\end{array}
\]

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}]
&=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\
&=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\
&=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg].
\end{array}
\]

故最后:

\[\mathcal{L} := \mathbb{E}_q \bigg[
\underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} +
\sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}}
\underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}.
\bigg]
\]

损失求解

因为无论forward, 还是 reverse process都是基于高斯分布的, 我们可以显示求解上面的各项损失:

首先, 对于forward process中的\(x_t\):

\[\begin{array}{ll}
x_t
&= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\
&= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} +
\sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\
&= \cdots \\
&= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon,
\end{array}
\]

\[q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s.
\]

对于后验分布\(q(x_{t-1}|x_t, x_0)\), 我们有

\[\begin{array}{ll}
q(x_{t-1}|x_t, x_0)
&= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\
&\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\
\end{array}
\]

所以

\[q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I),
\]

其中

\[\tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t,
\]
\[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t.
\]

\(L_{t}\)

\(L_T\)与\(\theta\)无关, 舍去.

作者假设\(\Sigma_{\theta}(x_t, t) = \sigma_t^2 I\)为非训练的参数, 其中

\[\sigma_t^2 = \beta_t | \tilde{\beta}_t,
\]

分别为\(x_0 \sim \mathcal{N}(0, I)\)和\(x_0\)为固定值时, 期望下KL散度的最优参数(作者说在实验中二者差不多).

\[L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1.
\]

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon.
\]

所以

\[\begin{array}{ll}
\mathbb{E}_q [L_{t-1} - C]
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) -
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\
\end{array}
\]

注: 上式子中\(x_t\)由\(x_0, \epsilon\)决定, 实际上\(x_t = x_t(x_0, \epsilon)\), 故期望实际上是对\(x_t\)求期望.

既然如此, 我们不妨直接参数化\(\mu_{\theta}\)为

\[\mu_{\theta}(x_t, t):=
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big),
\]

即直接建模残差\(\epsilon\).

此时损失可简化为:

\[\mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2
\bigg\}
\]

这个实际上时denoising score matching.

类似地, 从\(p_{\theta}(x_{t-1}|x_t)\)采样则为:

\[x_{t-1} =
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I),
\]

这是Langevin dynamic的形式(步长和权重有点变化)

注: 这部分见here.

\(L_0\)

最后我们要处理\(L_0\), 这里作者假设\(x_0|x_1\)满足一个离散分布, 首先图片取值于\(\{0, 1, 2, \cdots, 255\}\), 并标准化至\([-1, 1]\). 假设

\[p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\
\delta_+(x) =
\left \{
\begin{array}{ll}
+\infty & \text{if } x = 1, \\
x + \frac{1}{255} & \text{if } x < 1.
\end{array}
\right .
\delta_- (x)
\left \{
\begin{array}{ll}
-\infty & \text{if } x = -1, \\
x - \frac{1}{255} & \text{if } x > -1.
\end{array}
\right .
\]

实际上就是将普通的正态分布划分为:

\[(-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty)
\]

各取值落在其中之一.

在实际代码编写中, 会遇到高斯函数密度函数估计的问题(直接求是做不到的), 作者选择用下列的估计方式:

\[\Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}.
\]

这样梯度也就能够回传了.

注: 该估计属于Page.

最后的算法

注: \(t=1\)对应\(L_0\), \(t=2,\cdots, T\)对应\(L_{1}, \cdots, L_{T-1}\).

注: 对于\(L_t\)作者省略了开始的系数, 这反而是一种加权.

作者在实际中是采样损失用以训练的.

细节

注意到, 作者的\(\epsilon_{\theta}(\cdot, t)\)是有显示强调\(t\), 作者在实验中是通过attention中的位置编码实现的, 假设位置编码为\(P\):

  1. $ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通过两层的MLP来转换得到time_steps;
  2. 作者用的是U-Net结构, 在每个residual 模块中:
\[x += \text{Linear}(\text{ACT}(t)).
\]
参数
\(T\) 1000
\(\beta_t\) \([0.0001, 0.02]\), 线性增长\(1,2,\cdots, T\).
backbone U-Net

注: 作者在实现中还用到了EMA等技巧.

代码

原文代码

lucidrains-denoising-diffusion-pytorch

Denoising Diffusion Probabilistic Models (DDPM)的更多相关文章

  1. (转) RNN models for image generation

    RNN models for image generation MARCH 3, 2017   Today we’re looking at the remaining papers from the ...

  2. {ICIP2014}{收录论文列表}

    This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...

  3. CVPR 2015 papers

    CVPR2015 Papers震撼来袭! CVPR 2015的文章可以下载了,如果链接无法下载,可以在Google上通过搜索paper名字下载(友情提示:可以使用filetype:pdf命令). Go ...

  4. ICLR 2014 International Conference on Learning Representations深度学习论文papers

    ICLR 2014 International Conference on Learning Representations Apr 14 - 16, 2014, Banff, Canada Work ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. Official Program for CVPR 2015

    From:  http://www.pamitc.org/cvpr15/program.php Official Program for CVPR 2015 Monday, June 8 8:30am ...

  7. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  8. Notes(一)

    Numerous experimental measurements in spatially complex systems have revealed anomalous diffusion in ...

  9. 斯坦福CS课程列表

    http://exploredegrees.stanford.edu/coursedescriptions/cs/ CS 101. Introduction to Computing Principl ...

随机推荐

  1. canal从mysql拉取数据,并以protobuf的格式往kafka中写数据

    大致思路: canal去mysql拉取数据,放在canal所在的节点上,并且自身对外提供一个tcp服务,我们只要写一个连接该服务的客户端,去拉取数据并且指定往kafka写数据的格式就能达到以proto ...

  2. 图形学3D渲染管线学习

    图形学3D渲染管线 DX和OpenGL左右手坐标系不同,会有一些差距,得出的矩阵会不一样; OpenGL的投影平面不是视景体的近截面: 顶点(vertexs) 顶点坐标,颜色,法线,纹理坐标(UV), ...

  3. 容器之分类与各种测试(三)——forward_list的用法

    forward_list是C++11规定的新标准单项链表,slist是g++以前的规定的单项链表 例程 #include<stdexcept> #include<string> ...

  4. iOS11&IPhoneX适配

    1.在iOS 11中,会默认开启获取的一个估算值来获取一个大体的空间大小,导致不能正常显示,可以选择关闭.目前尝试在delegate中处理不能很好的解决,不过可以直接设置: Swift if #ava ...

  5. When does compiler create default and copy constructors in C++?

    In C++, compiler creates a default constructor if we don't define our own constructor (See this). Co ...

  6. mysql 报 'Host ‘XXXXXX’ is blocked because of many connection errors'

    1. 问题:服务启动时,日志报错,导致启动失败: Caused by: com.mysql.cj.exceptions.CJException: null,  message from server: ...

  7. springboot+vue脚手架使用nginx前后端分离

    1.vue配置 /** * * 相对于该配置的nginx服务器请参考nginx配置文件 * */ module.exports = { // 基本路径 publicPath: '/', // 输出文件 ...

  8. FindUserByPageServlet

    package com.hopetesting.web.servlet;import com.hopetesting.domain.PageBean;import com.hopetesting.do ...

  9. java列表组件鼠标双击事件的实现

    Swing中提供两种列表组件,分别是列表框(JList)和组合框(JComboBox). 一.JList组件 构造方法: public JList():构造一个空的.具有只读模型的JList.publ ...

  10. python爬取实习僧招聘信息字体反爬

    参考博客:http://www.cnblogs.com/eastonliu/p/9925652.html 实习僧招聘的网站采用了字体反爬,在页面上显示正常,查看源码关键信息乱码,如下图所示: 查看网页 ...