Study notes for Sparse Coding

Sparse Coding

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors $\phi_i$ such that an input vector $x\in \mathbb{R}^n (i.e., k>n)$ can be represented as a linear combination of these basis vectors:

$x=\sum_{i=1}^k a_i \phi_i$
The advantage of having an over-complete basis is that our basis vectors $\phi_i$ are better able to capture structures and patterns inherent in the input data $x$ .
However, with an over-complete basis, the coefficients are no longer uniquely determined by the input vector. Therefore, in sparse coding, we introduce the additional criterion of sparsity to resolve the degeneracy introduced by over-completeness.
The sparse coding cost function is defined on a set of m input vectors as:
$\mbox{minimize}_{a_i^{(j)}, \phi_i} \sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k S(a_i^{(j)})$

where $S(.)$ is a sparsity function which penalizes $a_i$ for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $x$ , and the second term as a sparsity penalty which forces our representation of $x$ (i.e., the learned features) to be sparse. The constant $\lambda$ is a scaling constant to determine the relative importance of these two contributions.
Although the most direct measure of sparsity is the $L_0$ norm, it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost $S(.)$ are the $L_1$ penalty and the log sparsity $S(a_i)=log(1+a_i^2)$
It is also possible to make the sparsity penalty arbitrary small by scaling down $a_i$ and scaling up $\phi_i$ by some large constant. To prevent this from happening, we will constrain $\|\phi\|^2$ to be less than some constant $C$ . The full sparse coding cost function hence is:
$\mbox{minimize}_{a_i^{(j)}, \phi_i} \sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k S(a_i^{(j)})$

$\mbox{subject to } \|\phi_i\|^2\le C, \any i=1, \ldots, k$

where the constant is usually set $C=1$
One problem is that the constraint cannot be forced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of $\phi_i$ small:
$\mbox{minimize}_{a_i^{(j)}, \phi_i} \sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k S(a_i^{(j)})+\gamma \|\phi\|_2^2$
Another problem is that the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. We will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use $\sqrt{x^2+\epsilon}$ in place of $|x|$ , where $\epsilon$ is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when $\epsilon$ is large compared to $x$ , the $x+\epsilon$ is dominated by $\epsilon$ , and taking the square root yield approximately $\sqrt{\epsilon}$ .
Hence, the final objective function is:
$J(\phi, A)=\sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k \sqrt{a_i^2+\epsilon}+\gamma \|\phi\|_2^2$
The set of basis vectors are called "dictionary" ( $D$ ). $D$ is "adapted" to $x$ if it can represent it with a few basis vectors, that is, there exists a sparse vector $\alpha$ in $\mathbb{R}^p$ such that $x=D\alpha=\sum_{i=1}^p a_i d_i$ . We call $\alpha$ the sparse code. It is illustrated as follows:

Learning

Learning a set of basis vectors using sparse coding consists of performing two separate optimizations (i.e., alternative optimization method):
- The first being an optimization over coefficients $a_i$ for each training example $x$
- The second being an optimization over basis vectors $\phi$ across many training examples at once.
However, the classical optimization alternates between D and $\alpha$ can achieve good results, but very slow.
A significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time, especially compared to typical feed-forward architectures.

Remarks

From my view, due to the sparseness enforced in the dictionary learning (i.e., sparse code), the restored matrix is able to remove noise of original matrix, i.e., having some effect of denoising. Hence, Sparse coding could be used to denoise images.

References

Sparse coding: http://ufldl.stanford.edu/wiki/index.php/Sparse_Coding
Sparse coding: autoencoder interpretation: http://ufldl.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_Interpretation
Sparse coding: exercise: http://ufldl.stanford.edu/wiki/index.php/Exercise:Sparse_Coding
Sparse coding and dictionary learning for image analysis, ICCV 2010 tutorial.

Study notes for Sparse Coding的更多相关文章

Machine Learning Algorithms Study Notes(2)--Supervised Learning
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
Machine Learning Algorithms Study Notes(3)--Learning Theory
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
Machine Learning Algorithms Study Notes(1)--Introduction
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 目录 1 Introduction 1 1.1 ...
理解sparse coding
理解sparse coding 稀疏编码系列: (一)----Spatial Pyramid 小结 (二)----图像的稀疏表示——ScSPM和LLC的总结 (三)----理解sparse codin ...
[Paper] **Before GAN: sparse coding
读罢[UFLDL] ConvNet,为了知识体系的完整,看来需要实战几篇论文深入理解一些原理. 如下是未来博文系列的初步设想,为了hold住 GAN而必备的知识体系,也是必经之路. [Paper] B ...
sparse coding
Deep Learning(深度学习)学习笔记整理系列 zouxy09@qq.com http://blog.csdn.net/zouxy09 作者:Zouxy version 1.0 2013-04 ...
稀疏编码(Sparse Coding)的前世今生(一) 转自http://blog.csdn.net/marvin521/article/details/8980853
稀疏编码来源于神经科学,计算机科学和机器学习领域一般一开始就从稀疏编码算法讲起,上来就是找基向量(超完备基),但是我觉得其源头也比较有意思,知道根基的情况下,拓展其应用也比较有底气.哲学.神经科学.计 ...
Study notes for Clustering and K-means
1. Clustering Analysis Clustering is the process of grouping a set of (unlabeled) data objects into ...
sparse coding稀疏表达入门
最近在看sparse and redundant representations这本书,进度比较慢,不过力争看过的都懂,不把时间浪费掉.才看完了不到3页吧,书上基本给出了稀疏表达的概念以及传统的求法. ...

随机推荐

C/C++ 笔试、面试题目大汇总（转）
这些东西有点烦,有点无聊.如果要去C++面试就看看吧.几年前网上搜索的.刚才看到,就整理一下,里面有些被我改了,感觉之前说的不对或不完善. 转自fangyukuan,地址http://www.cnbl ...
WCF跟踪分析使用（SvcTraceViewer）
1.首先在WCF服务端配置文件中配置两处,用于记录WCF调用记录! A:<system.serviceModel>目录下: <diagnostics> <mes ...
sql中charindex和cast结合使用
1.CHARINDEX函数常常用来在一段字符中搜索字符或者字符串. 语法 CHARINDEX ( expression1 , expression2 [ , start_location ] ) 返回 ...
工程脚本插件方案 - c集成Python基础篇（VC++嵌入Python）
序: 为什么要集成脚本,怎么在工程中集成Python脚本. 在做比较大型的工程时,一般都会分核心层和业务层.核心层要求实现高效和稳定的基础功能,并提供调用接口供业务层调用的一种标准的框架划分.在实际中 ...
commons-logging和slf4j都是日志的接口
过上面的图,可以简单的理清关系! commons-logging和slf4j都是日志的接口,供用户使用,而没有提供实现! log4j,logback等等才是日志的真正实现. 当我们调用接口时,接口的工 ...
关于sql 中 group by 和 having
今天看到园里一篇文章(http://www.cnblogs.com/sheldon-lou/p/4881230.html)中面试中有关sql 查询方面的问题, 想想自己从上大学就学习数据库,到后来自己 ...
Poj 1002 487-3279(二叉搜索树)
题目链接:http://poj.org/problem?id=1002 思路分析:先对输入字符进行处理,转换为标准形式:插入标准形式的电话号码到查找树中,若有相同号码计数器增加1,再中序遍历查找树. ...
HDU 4731 Minimum palindrome 2013 ACM/ICPC 成都网络赛
传送门:http://acm.hdu.edu.cn/showproblem.php?pid=4731 题解:规律题,我们可以发现当m大于等于3时,abcabcabc……这个串的回文为1,并且字典数最小 ...
在Xcode中使用C++与Objective-C混编
有时候,出于性能或可移植性的考虑,需要在iOS项目中使用到C++. 假设我们用C++写了下面的People类: // // People.h // MixedWithCppDemo // // ...
MVC 返回 view
RedirectToAction(),即直接返回相同Controller的Index方法: 这个方法还有其他重载方法,比如第二个参数是Controller名,可以指定其他Controller下的Vie ...

Study notes for Sparse Coding

Sparse Coding

Learning

Remarks

References

Study notes for Sparse Coding的更多相关文章

随机推荐

热门专题