(zhuan) Notes on Representation Learning
Notes on Representation Learning

Tags: Deep Learning , Neural Networks
TL;DR: Representation learning can eliminate the need for large labeled data sets to train deep neural networks, opening up new domains to machine learning and transforming the practice of Data Science.
Check out “Notes on Representation Learning” in these three parts.
- Notes on Representation Learning
- Notes on Representation Learning Continued
- Representation Learning Bonus Material
Deep Learning and Labeled Datasets
The greatest strength of Deep Learning (DL) is also one of its biggest weaknesses. DL models frequently have many millions of parameters. The extreme number of parameters—compared to other sorts of machine learning models—gives DL models tremendous flexibility to learn arbitrarily complex functions that simpler models cannot learn. But this flexibility makes it very easy to “overfit” on a training set (essentially, memorize specific examples instead of learning underlying patterns that allow generalization to examples not in the training set).
The conceptually simplest way to prevent overfitting is to train on very large datasets. If the dataset is big in relation to the number of parameters, then the network will not have enough capacity to memorize examples and will be “forced” to instead learn underlying patterns when optimizing a loss function. But creating large, labeled datasets for every task we want to perform is cost prohibitive (and may even be impossible if the goal is general purpose intelligent agents).
This need for large training-sets is often the biggest obstacle to apply DL to real world problems. On small datasets, other types of models can outperform DL to the extent that the constraints of those models match the task at hand. For instance, if there is a simple linear relationship in the data, a linear regression can greatly outperform a DL model trained on a small dataset because the linear constraint of the model corresponds to the data.
Figure 1. Neural Nets have a tendency to over fit when datasets are too small. Here the true relationship between the height and weight of an animal and whether it is a dog or cat, is essentially linear. A linear classifier assumes this relationship and uses the data merely to determine the slope and intercept. A large neural network will require much more data to learn a straight-line partition. With a small dataset, relative to the size of the neural network it will overfit on unusual examples, reducing predictive performance. (Source: https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png)
That correspondence allows the model to learn from a small dataset much more efficiently than a DL model because a DL model needs to learn the linear relationship whereas a linear regression simply assumes it. Simple linear classifiers are sufficient for a simple problem like that illustrated above, however more complex problems require models capable of modeling complex relationships within the data. Much of the work in applying machine learning involves choosing models with constraints and power that match the dataset. While DL has dramatically outperformed all other models on many tasks, to a large extent it has only done so for complex problem where there are big labeled datasets available for training.
Representation Learning
This blog post describes how the need for large, labeled datasets to train DL models is coming to an end. Over the last year there have been many research results demonstrating how DL models can learn much more efficiently than other models—outperforming alternatives even with very small labeled training sets. Indeed, in some remarkable cases, described below, DL models can learn to perform complex tasks with only a single labeled exampled (“one shot learning”) or even without any labeled data at all (“zero-shot learning”). Over the next few years, these research results will be rolled out to production systems, and further innovations will continue to improve data efficiency even more.
The key to this progress is what DL researchers call “representation learning“—a topic considered so important that prominent researchers named the premier DL conference the International Conference on Learning Representations. Part of the enthusiasm for learning representations is that rather than training DL models on labeled data specific to a target task, you can train them on labeled data for a different problem, or more importantly, on unlabeled data. In the process of training on unlabeled data, the model builds up a reusable internal representation of the data. For instance, in an image classification example (further described below), a network first learns to generate bedroom scenes. To do this convincingly it must develop an internal representation of the world: its 3-dimensional structure, visual perspective, interior design, typical bedroom furniture, etc. In other words, using unsupervised learning (on unlabeled data) the model builds an understanding of how the world of bedrooms actually works to produce pictures of bedrooms. Once a network has an internal representation like this it can learn to recognize objects in images much more easily. Learning to recognize a “bed” could become almost as simple as learning to associate the word “bed” with an object that the network already knows a lot about—it’s 3-dimensional shape, colors, location in rooms, typical surrounding furniture, etc. As a result, instead of needing hundreds or thousands of labeled examples of objects, the model could learn from just a handful of examples.
Figure 2. Bedroom scene generated by a DL model. No information about bedrooms, bedroom furniture, lighting, visual perspective, etc. was programmed into the network but it learns enough about those things to produce realistic looking images and plausible bedroom arrangements purely by training on bedroom images. (Source: https://arxiv.org/pdf/1511.06434v2.pdf)
Breakthroughs in representation learning herald a sea-change in machine learning that will help unlock the insights of the big-data era. Today data scientists work by carefully choosing machine learning models with constraints that match their problem domain and then painstakingly tuning those models to squeeze out every last drop of learning available from small labeled datasets. Over the coming years that workflow will move to selecting DL models pre-trained on enormous unlabeled datasets to build up internal representations, and then training on just a handful of labeled datasets examples to solve the task at hand. Instead of just choosing the right model, machine learning practitioners will choose a model and a prepackaged representation already trained up on related data. This workflow is already common in image recognition, where deep learning has been dominant for some time, and in certain NLP domains, like parsing, and is spreading to other domains.
As we continue to transition to this new paradigm, the number of problems we can solve with machine learning will explode. Right now, we are bumping up against the limits of what simple, highly constrained models with no learned internal representation of the world can accomplish. We can’t squeeze more blood from that stone; big jumps in capability will instead come from models that have some understanding of the world and that can thus interpret data within a larger, more meaningful context. The way forward is not magic new machine learning models that can squeeze more accuracy out of labeled datasets without any understanding of the world those datasets come from, but rather pre-trained models that bring an understanding of the world in which they are operating.
Examples of Recent Representation Learning Progress
Here I describe some of the remarkable advances being made in representation learning and how they are increasing the data efficiency of DL models. In all of these examples, DL models were able to learn with much less labeled data than simpler alternatives require. Though this is a small sample, I’ve tried to select examples across different problem domains (natural language processing, image classification, and intelligent agents) and learning types (supervised, semi-supervised, unsupervised, and reinforcement) to illustrate the variety of approaches which are seeing impressive success.
Transfer Learning with Progressive Neural Networks
Progressive Neural Networks (“PNNs”) are DL models specially modified to be able to (1) learn multiple tasks from different datasets in sequence without forgetting tasks learned earlier in the sequence and (2) reuse the representations learned from earlier tasks to accelerate the learning of subsequent tasks. Reusing representations like this from one task to another in order to accelerate subsequent learning is called “transfer learning”, a recurring theme in the examples below. Progressive Neural Networks employ transfer learning to improve data efficiency when learning new tasks—i.e. new tasks are learned with much less labeled data.
To understand the value of transfer learning, you could imagine, for example, a convolutional network learning low level features like edges that are aggregated up to parts of faces like ears and mouths, and ultimately to whole faces. Later, you may retrain that network on a different visual recognition task, perhaps recognition of cars. In that case, the network may be able to mostly reuse the low-level edge features while overwriting the higher-level features to aggregate edges into car parts instead of into face parts.
Figure 3. (Source: http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf)
Retraining a normal convolutional network like this, and, in the process, overwriting some previously learned features, is called fine-tuning. While transfer learning by fine-tuning has seen very successful application, it has some important drawbacks. Most importantly, we may want to transfer knowledge from multiple tasks to a new task. However, during fine-tuning, the ability to do the first task can be catastrophically forgotten when learning the second one. Imagine that after training a model on faces and then cars, as in the example above, you subsequently want to train the model to recognizing people while in their cars (perhaps for a traffic enforcement application). Many of the features that aggregate edges into human facial features like eyes, ears, mouth, could have been re-used, but unfortunately they were destroyed when learning car features. This means that when learning a third task the network may not be able to draw on useful features learned in the first task. The goal behind PNNs is to have a network that can continue learning from diverse datasets, continually expanding its knowledge as it goes.
If you already understand simple Feedforward Neural Networks, PNNs are not hard to conceptualize. Each task the network learns is allocated a “column” of the network, which is a full multi-layered feedforward network. After learning a task, the associated column is frozen so that it cannot be affected by training on future tasks, and a new column is added for the next task. Each layer in the new column gets input not only from lower layers within the new column, but also from lower layers within the frozen columns previously trained on other tasks. This allows the network to take advantage of features it has learned for other tasks, and repurpose them for new tasks without losing knowledge about the previous tasks.
Figure 4. (Source: https://arxiv.org/pdf/1606.04671v3.pdf)
The result is a network architecture that can often learn new skills with much less training data than a network learning from scratch, or even than a network pre-trained on one previous task and then fined-tuned. The authors of the PNN paper demonstrated dramatic improvements in the data efficiency of their AI agents:
Figure 5. Tests of Progressive Neural Networks on variations of the Atari game Pong, illustrating how they learn more efficiently compared to two baselines: Base1, a single column trained on the target task, and Base3, a single column pre-trained and fine-tuned on. (Source: https://arxiv.org/pdf/1606.04671v3.pdf)
Zero Shot Natural Language Translation
Recently there have also been some great examples of transfer learning in the NLP space. A couple of months ago Google announced that it is rolling out DL models for machine translation—called Google Neural Machine Translation (GNMT)—to replace the phrase based models that used to be state-of-the-art. GNMT models use a pair of recurrent neural networks: (1) an encoder that reads in words one at a time and produces a series of vectors representing all words read to that point, and (2) a decoder which reads the encoded vectors and outputs the translation (with an attention mechanism allowing the decoder to focus on the most important encoded vectors for each word it outputs). This method resulted in dramatic translation improvements for all language pairs, in some cases approaching human level.
Figure 6. According to their paper GMNT “reduced translation errors by an average of 60% compared to Google’s phrase-based production system.” (Source: https://research.googleblog.com/2016/09/a-neural-network-for-machine.html)
A few weeks ago, Google researchers published an impressive paper describing how they made a trivial modification of their GNMT architecture that allowed them to use a single network to translate all language pairs, instead of training a separate network for each language pair. To accomplish this, they simply modified their network to accept a token representing which language pair was being translated, and then trained on multiple language pairs at once. This token provides the additional information that the decoder network needs in order to output the appropriate language. Not only were they able to train a single network to translate between many different languages, but they used the same size network as they would normally use for a single language pair, thereby dramatically reducing the number of parameters used for the entire collection of languages.
The really interesting part is that after training on many language pairs, the network was able to translate between language pairs it had never seen or been trained on. In other words, it achieved “zero-shot” translation. The implication is that after training on a number of language pairs, the network develops its own “universal interlingua representation” of the meaning of source sentences independent of the source language. Once it has this representation of the meaning of the sentence it can translate it to any target language it knows about, regardless of whether it has ever seen the source-target combination.
To verify that the neural network actually creates this interlingua representation, the authors used t-SNE to plot a 2-dimensional representation of the intermediate vectors connecting the encoding and decoding networks. Below, in figure (7a) each color represents the intermediate vectors produced when translating semantically identical sentences in English, Korean and Japanese. (Each vector is a dot, and vectors produced in a series as part of translating a single sentence from one language are connected by a line.) The fact that similarly colored (and thus semantically identical) sentences are clustered near each other illustrates that the neural network has understood them to have similar meanings and therefore produces similar intermediate vectors (in its interlingua representation). Figure (7b) zooms in on one example, and figure (7c) re-colors that example to distinguish between the semantically identical sentences in the three different languages.
Figure 7. (Source: https://arxiv.org/pdf/1611.04558v1.pdf%5D)
The takeaway here is that the network developed an internal representation of the problem domain—of the meaning (semantics) of the sentences represented independently of the particular vocabulary or grammar of a language. That representation turned out to be so rich that it enabled the network to translate between language pairs with no labeled training data. The network transferred its learning from language pairs it had seen to pairs that it had never seen before.
The second in the series is available here, and some bonus material here.
©ODSC2017
(zhuan) Notes on Representation Learning的更多相关文章
- 论文阅读笔记 Improved Word Representation Learning with Sememes
论文阅读笔记 Improved Word Representation Learning with Sememes 一句话概括本文工作 使用词汇资源--知网--来提升词嵌入的表征能力,并提出了三种基于 ...
- (转)Predictive learning vs. representation learning 预测学习 与 表示学习
Predictive learning vs. representation learning 预测学习 与 表示学习 When you take a machine learning class, ...
- 论文笔记系列-iCaRL: Incremental Classifier and Representation Learning
导言 传统的神经网络都是基于固定的数据集进行训练学习的,一旦有新的,不同分布的数据进来,一般而言需要重新训练整个网络,这样费时费力,而且在实际应用场景中也不适用,所以增量学习应运而生. 增量学习主要旨 ...
- 【CV】ICCV2015_Unsupervised Visual Representation Learning by Context Prediction
Unsupervised Visual Representation Learning by Context Prediction Note here: it's a learning note on ...
- 翻译 Improved Word Representation Learning with Sememes
翻译 Improved Word Representation Learning with Sememes 题目 Improved Word Representation Learning with ...
- 网络表示学习Network Representation Learning/Embedding
网络表示学习相关资料 网络表示学习(network representation learning,NRL),也被称为图嵌入方法(graph embedding method,GEM)是这两年兴起的工 ...
- 论文笔记之:UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS
UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS ICLR 2 ...
- Hierarchical Attention Based Semi-supervised Network Representation Learning
Hierarchical Attention Based Semi-supervised Network Representation Learning 1. 任务 给定:节点信息网络 目标:为每个节 ...
- Self-Supervised Representation Learning
Self-Supervised Representation Learning 2019-11-11 21:12:14 This blog is copied from: https://lilia ...
随机推荐
- C/C++笔试题(基础题)
为了便于温故而知新,特于此整理 C/C++ 方面相关面试题.分享,共勉. (备注:各题的重要程度与先后顺序无关.不断更新中......欢迎补充) (1)分析下面程序的输出(* 与 -- 运算符优先级问 ...
- ajax评论
评论有好几种格式:有评论树.评论楼等的格式 发表评论注意事项: 1. 展示评论 1. 评论楼(Django模板语言渲染) 1. 从后端查询出所有的评论 2. 如果有父评论就展示父评论 2. 评论树 通 ...
- 使用SpringBoot的优势。
Spring Boot 让开发变得更简单 Spring Boot 对开发效率的提升是全方位的,我们可以简单做一下对比: 在没有使用 Spring Boot 之前我们开发一个 web 项目需要做哪些工作 ...
- 国外互联网大企业(flag)的涨薪方式
国外互联网大企业(flag)指的是:Facebook,Google,Amazon,LinkedIn 至于 A 代表哪家公司存在争议:有人说是Amazon,也有说是Apple,现在更有人说应该是AirB ...
- Base64图片编码原理,base64图片工具介绍,图片在线转换Base64
Base64图片编码原理,base64图片工具介绍,图片在线转换Base64 DataURI 允许在HTML文档中嵌入小文件,可以使用 img 标签或 CSS 嵌入转换后的 Base64 编码,减少 ...
- 今天2.4寸tft触摸屏到手--刷屏驱动小结
2010-04-29 21:28:00 根据给的51程序改成了iccavr,结果改错了2处.导致我找原因找了n久.不过也是一件好事,让我对80i更加熟悉了. 通过protues的逻辑分析仪,找到了问题 ...
- onclick 常用手册
1.如何去使用onclick来跳转到我们指定的页面/跳转到指定url ☆如果只是在本页显示的话,可以直接用location, 方法如下: ①onclick="javascript:windo ...
- self asyncio
import asyncio from threading import Thread import time print('main start:',time.time()) async def d ...
- 写给大忙人的nginx核心配置详解
由于当前很多应该都是前后端分离了,同时大量的基于http的分布式和微服务架构,使得很多时候应用和不同项目组之间的系统相互来回调用,关系复杂.如果使用传统的做法,都在应用中进行各种处理和判断,不仅维护复 ...
- oracle安全应用角色例子
今天在做看OCP的时候有道题是关于应用安全角色的,不是很明白,在网上找了个例子按照步骤验证了下.QUESTION 48You want to create a role to meet these r ...