Variational Autoencoder: Intuition and Implementation

There are two generative models facing neck to neck in the data generation business right now: Generative Adversarial Nets (GAN) and Variational Autoencoder (VAE). These two models have different take on how the models are trained. GAN is rooted in game theory, its objective is to find the Nash Equilibrium between discriminator net and generator net. On the other hand, VAE is rooted in bayesian inference, i.e. it wants to model the underlying probability distribution of data so that it could sample new data from that distribution.

In this post, we will look at the intuition of VAE model and its implementation in Keras.

VAE: Formulation and Intuition

Suppose we want to generate a data. Good way to do it is first to decide what kind of data we want to generate, then actually generate the data. For example, say, we want to generate an animal. First, we imagine the animal: it must have four legs, and it must be able to swim. Having those criteria, we could then actually generate the animal by sampling from the animal kingdom. Lo and behold, we get Platypus!

From the story above, our imagination is analogous to latent variable. It is often useful to decide the latent variable first in generative models, as latent variable could describe our data. Without latent variable, it is as if we just generate data blindly. And this is the difference between GAN and VAE: VAE uses latent variable, hence it’s an expressive model.

Alright, that fable is great and all, but how do we model that? Well, let’s talk about probability distribution.

Let’s define some notions:

  1. XX: data that we want to model a.k.a the animal
  2. zz: latent variable a.k.a our imagination
  3. P(X)P(X): probability distribution of the data, i.e. that animal kingdom
  4. P(z)P(z): probability distribution of latent variable, i.e. our brain, the source of our imagination
  5. P(X|z)P(X|z): distribution of generating data given latent variable, e.g. turning imagination into real animal

Our objective here is to model the data, hence we want to find P(X)P(X). Using the law of probability, we could find it in relation with zz as follows:

P(X)=∫P(X|z)P(z)dzP(X)=∫P(X|z)P(z)dz

that is, we marginalize out zz from the joint probability distribution P(X,z)P(X,z).

Now if only we know P(X,z)P(X,z), or equivalently, P(X|z)P(X|z) and P(z)P(z)…

The idea of VAE is to infer P(z)P(z) using P(z|X)P(z|X). This is make a lot of sense if we think about it: we want to make our latent variable likely under our data. Talking in term of our fable example, we want to limit our imagination only on animal kingdom domain, so we shouldn’t imagine about things like root, leaf, tyre, glass, GPU, refrigerator, doormat, … as it’s unlikely that those things have anything to do with things that come from the animal kingdom. Right?

But the problem is, we have to infer that distribution P(z|X)P(z|X), as we don’t know it yet. In VAE, as it name suggests, we infer P(z|X)P(z|X) using a method called Variational Inference (VI). VI is one of the popular choice of method in bayesian inference, the other one being MCMC method. The main idea of VI is to pose the inference by approach it as an optimization problem. How? By modeling the true distribution P(z|X)P(z|X) using simpler distribution that is easy to evaluate, e.g. Gaussian, and minimize the difference between those two distribution using KL divergence metric, which tells us how difference it is PP and QQ.

Alright, now let’s say we want to infer P(z|X)P(z|X) using Q(z|X)Q(z|X). The KL divergence then formulated as follows:

DKL[Q(z|X)∥P(z|X)]=∑zQ(z|X)logQ(z|X)P(z|X)=E[logQ(z|X)P(z|X)]=E[logQ(z|X)−logP(z|X)]DKL[Q(z|X)‖P(z|X)]=∑zQ(z|X)log⁡Q(z|X)P(z|X)=E[log⁡Q(z|X)P(z|X)]=E[log⁡Q(z|X)−log⁡P(z|X)]

Recall the notations above, there are two things that we haven’t use, namely P(X)P(X), P(X|z)P(X|z), and P(z)P(z). But, with Bayes’ rule, we could make it appear in the equation:

DKL[Q(z|X)∥P(z|X)]=E[logQ(z|X)−logP(X|z)P(z)P(X)]=E[logQ(z|X)−(logP(X|z)+logP(z)−logP(X))]=E[logQ(z|X)−logP(X|z)−logP(z)+logP(X)]DKL[Q(z|X)‖P(z|X)]=E[log⁡Q(z|X)−log⁡P(X|z)P(z)P(X)]=E[log⁡Q(z|X)−(log⁡P(X|z)+log⁡P(z)−log⁡P(X))]=E[log⁡Q(z|X)−log⁡P(X|z)−log⁡P(z)+log⁡P(X)]

Notice that the expectation is over

z and zP(X) doesn’t depend on P(X)z, so we could move it outside of the expectation.zDKL[Q(z|X)∥P(z|X)]=E[logQ(z|X)−logP(X|z)−logP(z)]+logP(X)DKL[Q(z|X)∥P(z|X)]−logP(X)=E[logQ(z|X)−logP(X|z)−logP(z)]DKL[Q(z|X)‖P(z|X)]=E[log⁡Q(z|X)−log⁡P(X|z)−log⁡P(z)]+log⁡P(X)DKL[Q(z|X)‖P(z|X)]−log⁡P(X)=E[log⁡Q(z|X)−log⁡P(X|z)−log⁡P(z)]

If we look carefully at the right hand side of the equation, we would notice that it could be rewritten as another KL divergence. So let’s do that by first rearranging the sign.

DKL[Q(z|X)∥P(z|X)]−logP(X)=E[logQ(z|X)−logP(X|z)−logP(z)]logP(X)−DKL[Q(z|X)∥P(z|X)]=E[logP(X|z)−(logQ(z|X)−logP(z))]=E[logP(X|z)]−E[logQ(z|X)−logP(z)]=E[logP(X|z)]−DKL[Q(z|X)∥P(z)]DKL[Q(z|X)‖P(z|X)]−log⁡P(X)=E[log⁡Q(z|X)−log⁡P(X|z)−log⁡P(z)]log⁡P(X)−DKL[Q(z|X)‖P(z|X)]=E[log⁡P(X|z)−(log⁡Q(z|X)−log⁡P(z))]=E[log⁡P(X|z)]−E[log⁡Q(z|X)−log⁡P(z)]=E[log⁡P(X|z)]−DKL[Q(z|X)‖P(z)]

And this is it, the VAE objective function:

logP(X)−DKL[Q(z|X)∥P(z|X)]=E[logP(X|z)]−DKL[Q(z|X)∥P(z)]log⁡P(X)−DKL[Q(z|X)‖P(z|X)]=E[log⁡P(X|z)]−DKL[Q(z|X)‖P(z)]

At this point, what do we have? Let’s enumerate:

Q(z|X) that project our data Q(z|X)X into latent variable spaceX z, the latent variablez P(X|z) that generate data given latent variableP(X|z)

We might feel familiar with this kind of structure. And guess what, it’s the same structure as seen in ! That is,

AutoencoderQ(z|X) is the encoder net, Q(z|X)z is the encoded representation, and zP(X|z) is the decoder net! Well, well, no wonder the name of this model is Variational Autoencoder!P(X|z)

VAE: Dissecting the Objective

It turns out, VAE objective function has a very nice interpretation. That is, we want to model our data, which described by

logP(X), under some error log⁡P(X)DKL[Q(z|X)∥P(z|X)]. In other words, VAE tries to find the lower bound of DKL[Q(z|X)‖P(z|X)]logP(X), which in practice is good enough as trying to find the exact distribution is often untractable.log⁡P(X)

That model then could be found by maximazing over some mapping from latent variable to data

logP(X|z) and minimizing the difference between our simple distribution log⁡P(X|z)Q(z|X) and the true latent distribution Q(z|X)P(z).P(z)

As we might already know, maximizing

E[logP(X|z)] is a maximum likelihood estimation. We basically see it all the time in discriminative supervised model, for example Logistic Regression, SVM, or Linear Regression. In the other words, given an input E[log⁡P(X|z)]z and an output zX, we want to maximize the conditional distribution XP(X|z) under some model parameters. So we could implement it by using any classifier with input P(X|z)z and output zX, then optimize the objective function by using for example log loss or regression loss.X

What about

DKL[Q(z|X)∥P(z)]? Here, DKL[Q(z|X)‖P(z)]P(z) is the latent variable distribution. We might want to sample P(z)P(z) later, so the easiest choice is P(z)N(0,1). Hence, we want to make N(0,1)Q(z|X) to be as close as possible to Q(z|X)N(0,1) so that we could sample it easily.N(0,1)

Having

P(z)=N(0,1) also add another benefit. Let’s say we also want P(z)=N(0,1)Q(z|X) to be Gaussian with parameters Q(z|X)μ(X) and μ(X)Σ(X), i.e. the mean and variance given X. Then, the KL divergence between those two distribution could be computed in closed form!Σ(X)DKL[N(μ(X),Σ(X))∥N(0,1)]=12(tr(Σ(X))+μ(X)Tμ(X)−k−logdet(Σ(X)))DKL[N(μ(X),Σ(X))‖N(0,1)]=12(tr(Σ(X))+μ(X)Tμ(X)−k−logdet(Σ(X)))

Above,

k is the dimension of our Gaussian. ktr(X) is trace function, i.e. sum of the diagonal of matrix tr(X)X. The determinant of a diagonal matrix could be computed as product of its diagonal. So really, we could implement XΣ(X) as just a vector as it’s a diagonal matrix:Σ(X)DKL[N(μ(X),Σ(X))∥N(0,1)]=12(∑kΣ(X)+∑kμ2(X)−∑k1−log∏kΣ(X))=12(∑kΣ(X)+∑kμ2(X)−∑k1−∑klogΣ(X))=12∑k(Σ(X)+μ2(X)−1−logΣ(X))DKL[N(μ(X),Σ(X))‖N(0,1)]=12(∑kΣ(X)+∑kμ2(X)−∑k1−log∏kΣ(X))=12(∑kΣ(X)+∑kμ2(X)−∑k1−∑klog⁡Σ(X))=12∑k(Σ(X)+μ2(X)−1−log⁡Σ(X))

In practice, however, it’s better to model

Σ(X) as Σ(X)logΣ(X), as it is more numerically stable to take exponent compared to computing log. Hence, our final KL divergence term is:log⁡Σ(X)DKL[N(μ(X),Σ(X))∥N(0,1)]=12∑k(exp(Σ(X))+μ2(X)−1−Σ(X))DKL[N(μ(X),Σ(X))‖N(0,1)]=12∑k(exp⁡(Σ(X))+μ2(X)−1−Σ(X))

Implementation in Keras

First, let’s implement the encoder net

Q(z|X), which takes input Q(z|X)X and outputting two things: Xμ(X) and μ(X)Σ(X), the parameters of the Gaussian.Σ(X)

 

from tensorflow.examples.tutorials.mnist import input_data from keras.layers import Input, Dense, Lambda from keras.models import Model from keras.objectives import binary_crossentropy from keras.callbacks import LearningRateScheduler import numpy as np import matplotlib.pyplot as plt import keras.backend as K import tensorflow as tf m = 50 n_z = 2 n_epoch = 10 # Q(z|X) -- encoder inputs = Input(shape=(784,)) h_q = Dense(512, activation='relu')(inputs) mu = Dense(n_z, activation='linear')(h_q) log_sigma = Dense(n_z, activation='linear')(h_q)

That is, our

Q(z|X) is a neural net with one hidden layer. In this implementation, our latent variable is two dimensional, so that we could easily visualize it. In practice though, more dimension in latent variable should be better.Q(z|X)

However, we are now facing a problem. How do we get

z from the encoder outputs? Obviously we could sample zz from a Gaussian which parameters are the outputs of the encoder. Alas, sampling directly won’t do, if we want to train VAE with gradient descent as the sampling operation doesn’t have gradient!z

There is, however a trick called reparameterization trick, which makes the network differentiable. Reparameterization trick basically divert the non-differentiable operation out of the network, so that, even though we still involve a thing that is non-differentiable, at least it is out of the network, hence the network could still be trained.

The reparameterization trick is as follows. Recall, if we have

x∼N(μ,Σ) and then standardize it so that x∼N(μ,Σ)μ=0,Σ=1, we could revert it back to the original distribution by reverting the standardization process. Hence, we have this equation:μ=0,Σ=1x=μ+Σ12xstdx=μ+Σ12xstd

With that in mind, we could extend it. If we sample from a standard normal distribution, we could convert it to any Gaussian we want if we know the mean and the variance. Hence we could implement our sampling operation of

z by:zz=μ(X)+Σ12(X)ϵz=μ(X)+Σ12(X)ϵ

where

ϵ∼N(0,1).ϵ∼N(0,1)

Now, during backpropagation, we don’t care anymore with the sampling process, as it is now outside of the network, i.e. doesn’t depend on anything in the net, hence the gradient won’t flow through it.

 

def sample_z(args): mu, log_sigma = args eps = K.random_normal(shape=(m, n_z), mean=0., std=1.) return mu + K.exp(log_sigma / 2) * eps # Sample z ~ Q(z|X) z = Lambda(sample_z)([mu, log_sigma])

Now we create the decoder net

P(X|z):P(X|z)

 

# P(X|z) -- decoder decoder_hidden = Dense(512, activation='relu') decoder_out = Dense(784, activation='sigmoid') h_p = decoder_hidden(z) outputs = decoder_out(h_p)

Lastly, from this model, we can do three things: reconstruct inputs, encode inputs into latent variables, and generate data from latent variable. So, we have three Keras models:

 

# Overall VAE model, for reconstruction and training vae = Model(inputs, outputs) # Encoder model, to encode input into latent variable # We use the mean as the output as it is the center point, the representative of the gaussian encoder = Model(inputs, mu) # Generator model, generate new data given latent variable z d_in = Input(shape=(n_z,)) d_h = decoder_hidden(d_in) d_out = decoder_out(d_h) decoder = Model(d_in, d_out)

Then, we need to translate our loss into Keras code:

 

def vae_loss(y_true, y_pred): """ Calculate loss = reconstruction loss + KL loss for each data in minibatch """ # E[log P(X|z)] recon = K.sum(K.binary_crossentropy(y_pred, y_true), axis=1) # D_KL(Q(z|X) || P(z|X)); calculate in closed form as both dist. are Gaussian kl = 0.5 * K.sum(K.exp(log_sigma) + K.square(mu) - 1. - log_sigma, axis=1) return recon + kl

and then train it:

 

vae.compile(optimizer='adam', loss=vae_loss) vae.fit(X_train, X_train, batch_size=m, nb_epoch=n_epoch)

And that’s it, the implementation of VAE in Keras!

Implementation on MNIST Data

We could use any dataset really, but like always, we will use MNIST as an example.

After we trained our VAE model, we then could visualize the latent variable space

Q(z|X):Q(z|X)

As we could see, in the latent space, the representation of our data that have the same characteristic, e.g. same label, are close to each other. Notice that in the training phase, we never provide any information regarding the data.

We could also look at the data reconstruction by running through the data into overall VAE net:

Lastly, we could generate new sample by first sample

z∼N(0,1) and feed it into our decoder net:z∼N(0,1)

If we look closely on the reconstructed and generated data, we would notice that some of the data are ambiguous. For example the digit 5 looks like 3 or 8. That’s because our latent variable space is a continous distribution (i.e.

N(0,1)), hence there bound to be some smooth transition on the edge of the clusters. And also, the cluster of digits are close to each other if they are somewhat similar. That’s why in the latent space, 5 is close to 3.N(0,1)

Conclusion

In this post we looked at the intuition behind Variational Autoencoder (VAE), its formulation, and its implementation in Keras.

We also saw the difference between VAE and GAN, the two most popular generative models nowadays.

For more math on VAE, be sure to hit the original paper by Kingma et al., 2014. There is also an excellent tutorial on VAE by Carl Doersch. Check out the references section below.

The full code is available in my repo:

https://github.com/wiseodd/generative-models

References

  • Doersch, Carl. “Tutorial on variational autoencoders.” arXiv preprint arXiv:1606.05908 (2016).
  • Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

https://blog.keras.io/building-autoencoders-in-keras.html

(zhuan) Variational Autoencoder: Intuition and Implementation的更多相关文章

  1. (转) 变分自编码器(Variational Autoencoder, VAE)通俗教程

    变分自编码器(Variational Autoencoder, VAE)通俗教程 转载自: http://www.dengfanxin.cn/?p=334&sukey=72885186ae5c ...

  2. VAE (variational autoencoder)

    https://www.zhihu.com/question/41490383/answer/103006793 自编码是一种表示学习的技术,是deep learning的核心问题 让输入等于输出,取 ...

  3. 变分自编码器(Variational auto-encoder,VAE)

    参考: https://www.cnblogs.com/huangshiyu13/p/6209016.html https://zhuanlan.zhihu.com/p/25401928 https: ...

  4. Latent Representation Learning For Artificial Bandwidth Extension Using A Conditional Variational Auto-Encoder

    博客作者:凌逆战 论文地址:https://ieeexplore.ieee.xilesou.top/abstract/document/8683611/ 地址:https://www.cnblogs. ...

  5. VAE(Variational Autoencoder)的原理

    Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint ar ...

  6. 变分自编码器(Variational Autoencoder, VAE)通俗教程

    原文地址:http://www.dengfanxin.cn/?p=334 1. 神秘变量与数据集 现在有一个数据集DX(dataset, 也可以叫datapoints),每个数据也称为数据点.我们假定 ...

  7. Variational Auto-encoder(VAE)变分自编码器-Pytorch

    import os import torch import torch.nn as nn import torch.nn.functional as F import torchvision from ...

  8. PP: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications

    Problem: unsupervised anomaly detection for seasonal KPIs in web applications. Donut: an unsupervise ...

  9. (译) Conditional Variational Autoencoders 条件式变换自编码机

    Conditional Variational Autoencoders --- 条件式变换自编码机 Goal of a Variational Autoencoder: 一个 VAE(variati ...

随机推荐

  1. 利用python 数据分析入门,详细教程,教小白快速入门

    这是一篇的数据的分析的典型案列,本人也是经历一次从无到有的过程,倍感珍惜,所以将其详细的记录下来,用来帮助后来者快速入门,,希望你能看到最后! 需求:对obo文件进行解析,输出为json字典格式 数据 ...

  2. ymPrompt,jcs缓存架构

    jcs.auxiliary.LTCP=org.apache.jcs.auxiliary.lateral.socket.tcp.LateralTCPCacheFactory#jcs.auxiliary. ...

  3. AtCoder Regular Contest 077 D - 11

    题目链接:http://arc077.contest.atcoder.jp/tasks/arc077_b Time limit : 2sec / Memory limit : 256MB Score ...

  4. いっしょ / Be Together (暴力枚举)

    题目链接:http://abc043.contest.atcoder.jp/tasks/arc059_a Time limit : 2sec / Memory limit : 256MB Score ...

  5. POJ 1330 Nearest Common Ancestors(LCA Tarjan算法)

    题目链接:http://poj.org/problem?id=1330 题意:给定一个n个节点的有根树,以及树中的两个节点u,v,求u,v的最近公共祖先. 数据范围:n [2, 10000] 思路:从 ...

  6. Java技术学习路线笔记:Maven安装和作用

    Maven是一个基于项目对象模型(POM)的概念的纯java开发的开源的项目管理工具.主要用来管理java项目,进行依赖管理(jar包管理,能自动分析项目所需的依赖软件包,并到Maven仓库区下载)和 ...

  7. bzoj4720 / P1850 换教室(Floyd+期望dp)

    P1850 换教室 先用Floyd把最短路处理一遍,接下来就是重头戏了 用 f [ i ][ j ][ 0/1 ] 表示在第 i 个时间段,发出了 j 次申请(注意不一定成功),并且在这个时间段是否( ...

  8. Mybatis的Mapper接口方法不能重载

    今天给项目的数据字典查询添加通用方法,发现里边已经有了一个查询所有数据字典的方法 List<Dict> selectDictList(); 但我想设置的方法是根据数据字典的code查询出所 ...

  9. kafka学习指南(总结版)

    版本介绍 从使用上来看,以0.9为分界线,0.9开始不再区分高级/低级消费者API. 从兼容性上来看,以0.8.x为分界线,0.8.x不兼容以前的版本. 总体拓扑架构 从上可知: 1.生产者不需要访问 ...

  10. linux交换区使用过多导致的性能问题

    近日,我们开发发现有一台配置相同的服务器跑的特别慢,相同数据量的情况下,其他服务器只要跑10分钟,这台服务器要跑50分钟,经确认,所有的应用层配置参数都相同.上去之后,发现该服务器swap使用比较多, ...