this blog from: https://opendatascience.com/blog/notes-on-representation-learning-1/

 

Notes on Representation Learning

By Zac Kriegman, Senior Data Scientist in the Thomson Reuters Data Innovation Lab | 02/07/2017

Tags: Deep Learning , Neural Networks

 

TL;DR: Representation learning can eliminate the need for large labeled data sets to train deep neural networks, opening up new domains to machine learning and transforming the practice of Data Science.

Check out “Notes on Representation Learning” in these three parts.

  1. Notes on Representation Learning
  2. Notes on Representation Learning Continued
  3. Representation Learning Bonus Material

Deep Learning and Labeled Datasets

The greatest strength of Deep Learning (DL) is also one of its biggest weaknesses. DL models frequently have many millions of parameters. The extreme number of parameters—compared to other sorts of machine learning models—gives DL models tremendous flexibility to learn arbitrarily complex functions that simpler models cannot learn. But this flexibility makes it very easy to “overfit” on a training set (essentially, memorize specific examples instead of learning underlying patterns that allow generalization to examples not in the training set).

The conceptually simplest way to prevent overfitting is to train on very large datasets.  If the dataset is big in relation to the number of parameters, then the network will not have enough capacity to memorize examples and will be “forced” to instead learn underlying patterns when optimizing a loss function.  But creating large, labeled datasets for every task we want to perform is cost prohibitive (and may even be impossible if the goal is general purpose intelligent agents).

This need for large training-sets is often the biggest obstacle to apply DL to real world problems. On small datasets, other types of models can outperform DL to the extent that the constraints of those models match the task at hand. For instance, if there is a simple linear relationship in the data, a linear regression can greatly outperform a DL model trained on a small dataset because the linear constraint of the model corresponds to the data.

Figure 1. Neural Nets have a tendency to over fit when datasets are too small.  Here the true relationship between the height and weight of an animal and whether it is a dog or cat, is essentially linear.  A linear classifier assumes this relationship and uses the data merely to determine the slope and intercept.  A large neural network will require much more data to learn a straight-line partition.  With a small dataset, relative to the size of the neural network it will overfit on unusual examples, reducing predictive performance. (Source: https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png)

That correspondence allows the model to learn from a small dataset much more efficiently than a DL model because a DL model needs to learn the linear relationship whereas a linear regression simply assumes it. Simple linear classifiers are sufficient for a simple problem like that illustrated above, however more complex problems require models capable of modeling complex relationships within the data.  Much of the work in applying machine learning involves choosing models with constraints and power that match the dataset.  While DL has dramatically outperformed all other models on many tasks, to a large extent it has only done so for complex problem where there are big labeled datasets available for training.

Representation Learning

This blog post describes how the need for large, labeled datasets to train DL models is coming to an end. Over the last year there have been many research results demonstrating how DL models can learn much more efficiently than other models—outperforming alternatives even with very small labeled training sets. Indeed, in some remarkable cases, described below, DL models can learn to perform complex tasks with only a single labeled exampled (“one shot learning”) or even without any labeled data at all (“zero-shot learning”). Over the next few years, these research results will be rolled out to production systems, and further innovations will continue to improve data efficiency even more.

The key to this progress is what DL researchers call “representation learning“—a topic considered so important that prominent researchers named the premier DL conference the International Conference on Learning Representations. Part of the enthusiasm for learning representations is that rather than training DL models on labeled data specific to a target task, you can train them on labeled data for a different problem, or more importantly, on unlabeled data. In the process of training on unlabeled data, the model builds up a reusable internal representation of the data. For instance, in an image classification example (further described below), a network first learns to generate bedroom scenes. To do this convincingly it must develop an internal representation of the world: its 3-dimensional structure, visual perspective, interior design, typical bedroom furniture, etc. In other words, using unsupervised learning (on unlabeled data) the model builds an understanding of how the world of bedrooms actually works to produce pictures of bedrooms. Once a network has an internal representation like this it can learn to recognize objects in images much more easily.  Learning to recognize a “bed” could become almost as simple as learning to associate the word “bed” with an object that the network already knows a lot about—it’s 3-dimensional shape, colors, location in rooms, typical surrounding furniture, etc.  As a result, instead of needing hundreds or thousands of labeled examples of objects, the model could learn from just a handful of examples.

Figure 2.  Bedroom scene generated by a DL model.  No information about bedrooms, bedroom furniture, lighting, visual perspective, etc. was programmed into the network but it learns enough about those things to produce realistic looking images and plausible bedroom arrangements purely by training on bedroom images.  (Source: https://arxiv.org/pdf/1511.06434v2.pdf)

Breakthroughs in representation learning herald a sea-change in machine learning that will help unlock the insights of the big-data era. Today data scientists work by carefully choosing machine learning models with constraints that match their problem domain and then painstakingly tuning those models to squeeze out every last drop of learning available from small labeled datasets. Over the coming years that workflow will move to selecting DL models pre-trained on enormous unlabeled datasets to build up internal representations, and then training on just a handful of labeled datasets examples to solve the task at hand.  Instead of just choosing the right model, machine learning practitioners will choose a model and a prepackaged representation already trained up on related data. This workflow is already common in image recognition, where deep learning has been dominant for some time, and in certain NLP domains, like parsing, and is spreading to other domains.

As we continue to transition to this new paradigm, the number of problems we can solve with machine learning will explode.  Right now, we are bumping up against the limits of what simple, highly constrained models with no learned internal representation of the world can accomplish. We can’t squeeze more blood from that stone; big jumps in capability will instead come from models that have some understanding of the world and that can thus interpret data within a larger, more meaningful context. The way forward is not magic new machine learning models that can squeeze more accuracy out of labeled datasets without any understanding of the world those datasets come from, but rather pre-trained models that bring an understanding of the world in which they are operating.

Examples of Recent Representation Learning Progress

Here I describe some of the remarkable advances being made in representation learning and how they are increasing the data efficiency of DL models.  In all of these examples, DL models were able to learn with much less labeled data than simpler alternatives require.  Though this is a small sample, I’ve tried to select examples across different problem domains (natural language processing, image classification, and intelligent agents) and learning types (supervised, semi-supervised, unsupervised, and reinforcement) to illustrate the variety of approaches which are seeing impressive success.

Transfer Learning with Progressive Neural Networks

Progressive Neural Networks (“PNNs”) are DL models specially modified to be able to (1) learn multiple tasks from different datasets in sequence without forgetting tasks learned earlier in the sequence and (2) reuse the representations learned from earlier tasks to accelerate the learning of subsequent tasks. Reusing representations like this from one task to another in order to accelerate subsequent learning is called “transfer learning”, a recurring theme in the examples below.  Progressive Neural Networks employ transfer learning to improve data efficiency when learning new tasks—i.e. new tasks are learned with much less labeled data.

To understand the value of transfer learning, you could imagine, for example, a convolutional network learning low level features like edges that are aggregated up to parts of faces like ears and mouths, and ultimately to whole faces.  Later, you may retrain that network on a different visual recognition task, perhaps recognition of cars. In that case, the network may be able to mostly reuse the low-level edge features while overwriting the higher-level features to aggregate edges into car parts instead of into face parts.

Figure 3. (Source: http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf)

Retraining a normal convolutional network like this, and, in the process, overwriting some previously learned features, is called fine-tuning.  While transfer learning by fine-tuning has seen very successful application, it has some important drawbacks. Most importantly, we may want to transfer knowledge from multiple tasks to a new task. However, during fine-tuning, the ability to do the first task can be catastrophically forgotten when learning the second one. Imagine that after training a model on faces and then cars, as in the example above, you subsequently want to train the model to recognizing people while in their cars (perhaps for a traffic enforcement application). Many of the features that aggregate edges into human facial features like eyes, ears, mouth, could have been re-used, but unfortunately they were destroyed when learning car features. This means that when learning a third task the network may not be able to draw on useful features learned in the first task. The goal behind PNNs is to have a network that can continue learning from diverse datasets, continually expanding its knowledge as it goes.

If you already understand simple Feedforward Neural Networks, PNNs are not hard to conceptualize. Each task the network learns is allocated a “column” of the network, which is a full multi-layered feedforward network. After learning a task, the associated column is frozen so that it cannot be affected by training on future tasks, and a new column is added for the next task. Each layer in the new column gets input not only from lower layers within the new column, but also from lower layers within the frozen columns previously trained on other tasks. This allows the network to take advantage of features it has learned for other tasks, and repurpose them for new tasks without losing knowledge about the previous tasks.

Figure 4.  (Source: https://arxiv.org/pdf/1606.04671v3.pdf)

The result is a network architecture that can often learn new skills with much less training data than a network learning from scratch, or even than a network pre-trained on one previous task and then fined-tuned.  The authors of the PNN paper demonstrated dramatic improvements in the data efficiency of their AI agents:

Figure 5.  Tests of Progressive Neural Networks on variations of the Atari game Pong, illustrating how they learn more efficiently compared to two baselines: Base1, a single column trained on the target task, and Base3, a single column pre-trained and fine-tuned on. (Source: https://arxiv.org/pdf/1606.04671v3.pdf)

Zero Shot Natural Language Translation

Recently there have also been some great examples of transfer learning in the NLP space.  A couple of months ago Google announced that it is rolling out DL models for machine translation—called Google Neural Machine Translation (GNMT)—to replace the phrase based models that used to be state-of-the-art. GNMT models use a pair of recurrent neural networks: (1) an encoder that reads in words one at a time and produces a series of vectors representing all words read to that point, and (2) a decoder which reads the encoded vectors and outputs the translation (with an attention mechanism allowing the decoder to focus on the most important encoded vectors for each word it outputs).  This method resulted in dramatic translation improvements for all language pairs, in some cases approaching human level.

Figure 6. According to their paper GMNT “reduced translation errors by an average of 60% compared to Google’s phrase-based production system.” (Source: https://research.googleblog.com/2016/09/a-neural-network-for-machine.html)

A few weeks ago, Google researchers published an impressive paper describing how they made a trivial modification of their GNMT architecture that allowed them to use a single network to translate all language pairs, instead of training a separate network for each language pair.  To accomplish this, they simply modified their network to accept a token representing which language pair was being translated, and then trained on multiple language pairs at once.  This token provides the additional information that the decoder network needs in order to output the appropriate language.  Not only were they able to train a single network to translate between many different languages, but they used the same size network as they would normally use for a single language pair, thereby dramatically reducing the number of parameters used for the entire collection of languages.

The really interesting part is that after training on many language pairs, the network was able to translate between language pairs it had never seen or been trained on. In other words, it achieved “zero-shot” translation.  The implication is that after training on a number of language pairs, the network develops its own “universal interlingua representation” of the meaning of source sentences independent of the source language.  Once it has this representation of the meaning of the sentence it can translate it to any target language it knows about, regardless of whether it has ever seen the source-target combination.

To verify that the neural network actually creates this interlingua representation, the authors used t-SNE to plot a 2-dimensional representation of the intermediate vectors connecting the encoding and decoding networks.  Below, in figure (7a) each color represents the intermediate vectors produced when translating semantically identical sentences in English, Korean and Japanese.  (Each vector is a dot, and vectors produced in a series as part of translating a single sentence from one language are connected by a line.)  The fact that similarly colored (and thus semantically identical) sentences are clustered near each other illustrates that the neural network has understood them to have similar meanings and therefore produces similar intermediate vectors (in its interlingua representation).  Figure (7b) zooms in on one example, and figure (7c) re-colors that example to distinguish between the semantically identical sentences in the three different languages.

Figure 7.  (Source: https://arxiv.org/pdf/1611.04558v1.pdf%5D)

The takeaway here is that the network developed an internal representation of the problem domain—of the meaning (semantics) of the sentences represented independently of the particular vocabulary or grammar of a language.  That representation turned out to be so rich that it enabled the network to translate between language pairs with no labeled training data.  The network transferred its learning from language pairs it had seen to pairs that it had never seen before.

The second in the series is available here, and some bonus material here.


©ODSC2017

 

(zhuan) Notes on Representation Learning的更多相关文章

  1. 论文阅读笔记 Improved Word Representation Learning with Sememes

    论文阅读笔记 Improved Word Representation Learning with Sememes 一句话概括本文工作 使用词汇资源--知网--来提升词嵌入的表征能力,并提出了三种基于 ...

  2. (转)Predictive learning vs. representation learning 预测学习 与 表示学习

    Predictive learning vs. representation learning  预测学习 与 表示学习 When you take a machine learning class, ...

  3. 论文笔记系列-iCaRL: Incremental Classifier and Representation Learning

    导言 传统的神经网络都是基于固定的数据集进行训练学习的,一旦有新的,不同分布的数据进来,一般而言需要重新训练整个网络,这样费时费力,而且在实际应用场景中也不适用,所以增量学习应运而生. 增量学习主要旨 ...

  4. 【CV】ICCV2015_Unsupervised Visual Representation Learning by Context Prediction

    Unsupervised Visual Representation Learning by Context Prediction Note here: it's a learning note on ...

  5. 翻译 Improved Word Representation Learning with Sememes

    翻译 Improved Word Representation Learning with Sememes 题目 Improved Word Representation Learning with ...

  6. 网络表示学习Network Representation Learning/Embedding

    网络表示学习相关资料 网络表示学习(network representation learning,NRL),也被称为图嵌入方法(graph embedding method,GEM)是这两年兴起的工 ...

  7. 论文笔记之:UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS

    UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS  ICLR 2 ...

  8. Hierarchical Attention Based Semi-supervised Network Representation Learning

    Hierarchical Attention Based Semi-supervised Network Representation Learning 1. 任务 给定:节点信息网络 目标:为每个节 ...

  9. Self-Supervised Representation Learning

    Self-Supervised Representation Learning 2019-11-11 21:12:14  This blog is copied from: https://lilia ...

随机推荐

  1. mysql分区/分片

    一.分区概念 mysql 5.1 以后支持分区, 有点类似MongoDB中的分片概念. 就是按照一定的规则, 将一个数据库表分解成很多细小的表, 这些细小的表可以是物理的分区, 就是在不同的位置. 但 ...

  2. python 序列化,反序列化

    附: pickle 有大量的配置选项和一些棘手的问题.对于最常见的使用场景,你不需要去担心这个,是如果你要在一个重要的程序中使用pickle 去做序列化的话,最好去查阅一下官方文档. https:// ...

  3. 蒙特卡洛(Monte Carlo)法求定积分

    https://blog.csdn.net/baimafujinji/article/details/53869358

  4. WebSocket和long poll、ajax轮询的区别,ws协议测试

    WebSocket和long poll.ajax轮询的区别,ws协议测试 WebSocket是HTML5出的东西(协议),也就是说HTTP协议没有变化,或者说没关系,但HTTP是不支持持久连接的(长连 ...

  5. EF使用sql语句

    https://www.cnblogs.com/chenwolong/p/SqlQuery.html https://blog.csdn.net/zdhlwt2008/article/details/ ...

  6. python requests接口测试

    Python 标准库中的 urllib2 模块提供了你所需要的大多数 HTTP 功能,但是它的 API 太渣了.它是为另一个时代.另一个互联网所创建的.它需要巨量的工作,甚至包括各种方法覆盖,来完成最 ...

  7. 巧用ELK快速实现网站流量监控可视化

    前言 本文可能不会详细记录每一步实现的过程,但一定程度上可以引领小伙伴走向更开阔的视野,串联每个环节,呈现予你不一样的效果. 业务规模 8个平台 100+台服务器 10+个集群分组 微服务600+ 用 ...

  8. Errors occurred during the build. Errors running builder 'Validation' on pro

    选择项目-->右键-->Properties-->Builders   右面有四个选项,把Validation前面勾去掉

  9. Spring P命名空间 02

    P命名空间 装配属性 使用<property> 元素为Bean 的属性装配值和引用并不太复杂.尽管如此,Spring 的命名空间p 提供了另一种Bean 属性的装配方式,该方式不需要配置如 ...

  10. 2018.1.7java转型

    从昨天的组合和继承开始,我在分析二者的区别,到后面的向上转型,再到hashcode和equals还有==的区别,感觉基础还很不好,也许,目前应该从面向对象开始复习,把暂时笔试宝典放一下. 回忆一下今天 ...