DARTS

2019-ICLR-DARTS Differentiable Architecture Search

  • Hanxiao Liu、Karen Simonyan、Yiming Yang
  • GitHub:2.8k stars
  • Citation:557

Motivation

Current NAS method:

  • Computationally expensive: 2000/3000 GPU days
  • Discrete search space, leads to a large number of architecture evaluations required.

Contribution

  • Differentiable NAS method based on gradient decent.
  • Both CNN(CV) and RNN(NLP).
  • SOTA results on CIFAR-10 and PTB.
  • Efficiency: (2000 GPU days VS 4 GPU days)
  • Transferable: cifar10 to ImageNet, (PTB to WikiText-2).

Method

Search Space

Search for a cell as the building block of the final architecture.

The learned cell could either be stacked to form a CNN or recursively connected to form a RNN.

A cell is a DAG consisting of an ordered sequence of N nodes.

\(\bar{o}^{(i, j)}(x)=\sum_{o \in \mathcal{O}} \frac{\exp \left(\alpha_{o}^{(i, j)}\right)}{\sum_{o^{\prime} \in \mathcal{O}} \exp \left(\alpha_{o^{\prime}}^{(i, j)}\right)} o(x)\)

\(x^{(j)}=\sum_{i<j} o^{(i, j)}\left(x^{(i)}\right)\)

Optimization Target

Our goal is to jointly learn the architecture α and the weights w within all the mixed operations (e.g. weights of the convolution filters).

\(\min _{\alpha} \mathcal{L}_{v a l}\left(w^{*}(\alpha), \alpha\right)\) ......(3)

s.t. \(\quad w^{*}(\alpha)=\operatorname{argmin}_{w} \mathcal{L}_{\text {train}}(w, \alpha)\) .......(4)

The idea is to approximate w∗(α) by adapting w using only a single training step, without solving the inner optimization (equation 4) completely by training until convergence.

\(\nabla_{\alpha} \mathcal{L}_{v a l}\left(w^{*}(\alpha), \alpha\right)\) ......(5)

\(\approx \nabla_{\alpha} \mathcal{L}_{v a l}\left(w-\xi \nabla_{w} \mathcal{L}_{t r a i n}(w, \alpha), \alpha\right)\) ......(6)

  • When ξ = 0, the second-order derivative in equation 7 will disappear.
  • ξ = 0 as the first-order approximation,
  • ξ > 0 as the second-order approximation.

Discrete Arch

To form each node in the discrete architecture, we retain the top-k strongest operations (from distinct nodes) among all non-zero candidate operations collected from all the previous nodes.

we use k = 2 for convolutional cells and k = 1 for recurrent cells

The strength of an operation is defined as \(\frac{\exp \left(\alpha_{o}^{(i, j)}\right)}{\sum_{o^{\prime} \in \mathcal{O}} \exp \left(\alpha_{o^{\prime}}^{(i, j)}\right)}\)

Experiments

We include the following operations in O:

  • 3 × 3 and 5 × 5 separable convolutions,
  • 3 × 3 and 5 × 5 dilated separable convolutions,
  • 3 × 3 max pooling,
  • 3 × 3 average pooling,
  • identity (skip connection?)
  • zero.

All operations are of

  • stride one (if applicable)
  • the feature maps are padded to preserve their spatial resolution.

We use the

  • ReLU-Conv-BN order for convolutional operations,
  • Each separable convolution is always applied twice
  • Our convolutional cell consists of N = 7 nodes, the output node is defined as the depthwise concatenation of all the intermediate nodes (input nodes excluded).

  • The first and second nodes of cell k are set equal to the outputs of cell k−2 and cell k−1

  • Cells located at the 1/3 and 2/3 of the total depth of the network are reduction cells, in which all the operations adjacent to the input nodes are of stride two.

  • The architecture encoding therefore is (αnormal, αreduce),

  • where αnormal is shared by all the normal cells

  • and αreduce is shared by all the reduction cells.

  • To determine the architecture for final evaluation, we run DARTS four times with different random seeds and pick the best cell based on its validation performance obtained by training from scratch for a short period (100 epochs on CIFAR-10 and 300 epochs on PTB).

  • This is particularly important for recurrent cells, as the optimization outcomes can be initialization-sensitive (Fig. 3)

Arch Evaluation

  • To evaluate the selected architecture, we randomly initialize its weights (weights learned during the search process are discarded), train it from scratch, and report its performance on the test set.

  • To evaluate the selected architecture, we randomly initialize its weights (weights learned during the search process are discarded), train it from scratch, and report its performance on the test set.

Result Analysis

  • DARTS achieved comparable results with the state of the art while using three orders of magnitude less computation resources.
  • (i.e. 1.5 or 4 GPU days vs 2000 GPU days for NASNet and 3150 GPU days for AmoebaNet)
  • The longer search time is due to the fact that we have repeated the search process four times for cell selection. This practice is less important for convolutional cells however, because the performance of discovered architectures does not strongly depend on initialization (Fig. 3).

  • It is also interesting to note that random search is competitive for both convolutional and recurrent models, which reflects the importance of the search space design.

Results in Table 3 show that the cell learned on CIFAR-10 is indeed transferable to ImageNet.

  • The weaker transferability between PTB and WT2 (as compared to that between CIFAR-10 and ImageNet) could be explained by the relatively small size of the source dataset (PTB) for architecture search.

  • The issue of transferability could potentially be circumvented by directly optimizing the architecture on the task of interest.

Conclusion

  • We presented DARTS, a simple yet efficient NAS algorithm for both CNN and RNN.
  • SOTA
  • efficiency improvement by several orders of magnitude.

Improve

  • discrepancies between the continuous architecture encoding and the derived discrete architecture. (softmax…)
  • It would also be interesting to investigate performance-aware architecture derivation schemes based on the shared parameters learned during the search process.

Appendix

2019-ICLR-DARTS: Differentiable Architecture Search-论文阅读的更多相关文章

  1. 论文笔记:DARTS: Differentiable Architecture Search

    DARTS: Differentiable Architecture Search 2019-03-19 10:04:26accepted by ICLR 2019 Paper:https://arx ...

  2. 论文笔记系列-DARTS: Differentiable Architecture Search

    Summary 我的理解就是原本节点和节点之间操作是离散的,因为就是从若干个操作中选择某一个,而作者试图使用softmax和relaxation(松弛化)将操作连续化,所以模型结构搜索的任务就转变成了 ...

  3. 论文笔记:Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation

    Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation ...

  4. 2019-ICCV-PDARTS-Progressive Differentiable Architecture Search Bridging the Depth Gap Between Search and Evaluation-论文阅读

    P-DARTS 2019-ICCV-Progressive Differentiable Architecture Search Bridging the Depth Gap Between Sear ...

  5. 论文笔记系列-Auto-DeepLab:Hierarchical Neural Architecture Search for Semantic Image Segmentation

    Pytorch实现代码:https://github.com/MenghaoGuo/AutoDeeplab 创新点 cell-level and network-level search 以往的NAS ...

  6. Research Guide for Neural Architecture Search

    Research Guide for Neural Architecture Search 2019-09-19 09:29:04 This blog is from: https://heartbe ...

  7. 小米造最强超分辨率算法 | Fast, Accurate and Lightweight Super-Resolution with Neural Architecture Search

    本篇是基于 NAS 的图像超分辨率的文章,知名学术性自媒体 Paperweekly 在该文公布后迅速跟进,发表分析称「属于目前很火的 AutoML / Neural Architecture Sear ...

  8. 论文笔记系列-Neural Architecture Search With Reinforcement Learning

    摘要 神经网络在多个领域都取得了不错的成绩,但是神经网络的合理设计却是比较困难的.在本篇论文中,作者使用 递归网络去省城神经网络的模型描述,并且使用 增强学习训练RNN,以使得生成得到的模型在验证集上 ...

  9. 论文笔记:Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

    Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation2019-03-18 14:4 ...

随机推荐

  1. Spring Boot 整合 Apache Dubbo

    Apache Dubbo是一款高性能.轻量级的开源 Java RPC 框架,它提供了三大核心能力:面向接口的远程方法调用,智能容错和负载均衡,以及服务自动注册和发现. 注意,是 Apache Dubb ...

  2. 写了shell脚本想一键启动三台虚拟机的Zookeeper,却不知道为啥总是启动不了

    首先,一键启动的shell脚本是这样的 #! /bin/bash case $1 in "start"){ for i in node01 node02 node03 do ssh ...

  3. Spring Cloud 系列之 Config 配置中心(二)

    本篇文章为系列文章,未读第一集的同学请猛戳这里:Spring Cloud 系列之 Config 配置中心(一) 本篇文章讲解 Config 如何实现配置中心自动刷新. 配置中心自动刷新 点击链接观看: ...

  4. 多线程实践—Python多线程编程

    多线程实践 前面的一些文章和脚本都是只能做学习多线程的原理使用,实际上什么有用的事情也没有做.接下来进行多线程的实践,看一看在实际项目中是怎么使用多线程的. 图书排名示例 Bookrank.py: 该 ...

  5. Spring Boot 之 Spring Batch 批处理实践

    实践内容 从 MariaDB 一张表内读 10 万条记录,经处理后写到 MongoDB . 具体实现 1.新建 Spring Boot 应用,依赖如下: <!-- Web 应用 --> & ...

  6. 【题解】合唱队形——LIS坑爹的二分优化

     题目 [题目描述]N位同学站成一排,音乐老师要请其中的(N-K)位同学出列,使得剩下的K位同学排成合唱队形.合唱队形是指这样的一种队形:设K位同学从左到右依次编号为1,2…,K,他们的身高分别为T1 ...

  7. 一文解读C# 动态拦截第三方进程中的方法函数(外挂必备)

    一.前言 由于项目需要,最近研究了一下跨进程通讯改写第三方程序中的方法(运行中),把自己程序中的目标方法直接覆盖第三方程序中的方法函数:一直没有头绪,通过搜索引擎找了一大堆解决方案,资料甚是稀少,最后 ...

  8. 【雕爷学编程】Arduino动手做(57)---四档矩形波模块

    37款传感器与模块的提法,在网络上广泛流传,其实Arduino能够兼容的传感器模块肯定是不止37种的.鉴于本人手头积累了一些传感器和模块,依照实践出真知(一定要动手做)的理念,以学习和交流为目的,这里 ...

  9. Java 代码精简

    Java 代码精简 利用语法 利用三元表达式 普通 String title; if (isMember(phone)) { title = "会员"; } else { titl ...

  10. Docker 部署Spring Boot 项目并连接mysql、redis容器(记录过程)

    Spring Boot 项目配置 将写好的Spring Boot 项目通过maven 进行package打包获得可执行Jar 再src/main/docker(放哪都行)下编写创建Dockerfile ...