DARTS

2019-ICLR-DARTS Differentiable Architecture Search

Hanxiao Liu、Karen Simonyan、Yiming Yang
GitHub：2.8k stars
Citation：557

Motivation

Current NAS method:

Computationally expensive: 2000/3000 GPU days
Discrete search space, leads to a large number of architecture evaluations required.

Contribution

Differentiable NAS method based on gradient decent.
Both CNN(CV) and RNN(NLP).
SOTA results on CIFAR-10 and PTB.
Efficiency: (2000 GPU days VS 4 GPU days)
Transferable: cifar10 to ImageNet, (PTB to WikiText-2).

Method

Search Space

Search for a cell as the building block of the final architecture.

The learned cell could either be stacked to form a CNN or recursively connected to form a RNN.

A cell is a DAG consisting of an ordered sequence of N nodes.

\(\bar{o}^{(i, j)}(x)=\sum_{o \in \mathcal{O}} \frac{\exp \left(\alpha_{o}^{(i, j)}\right)}{\sum_{o^{\prime} \in \mathcal{O}} \exp \left(\alpha_{o^{\prime}}^{(i, j)}\right)} o(x)\)

\(x^{(j)}=\sum_{i<j} o^{(i, j)}\left(x^{(i)}\right)\)

Optimization Target

Our goal is to jointly learn the architecture α and the weights w within all the mixed operations (e.g. weights of the convolution filters).

\(\min _{\alpha} \mathcal{L}_{v a l}\left(w^{*}(\alpha), \alpha\right)\) ......(3)

s.t. \(\quad w^{*}(\alpha)=\operatorname{argmin}_{w} \mathcal{L}_{\text {train}}(w, \alpha)\) .......(4)

The idea is to approximate w∗(α) by adapting w using only a single training step, without solving the inner optimization (equation 4) completely by training until convergence.

\(\nabla_{\alpha} \mathcal{L}_{v a l}\left(w^{*}(\alpha), \alpha\right)\) ......(5)

\(\approx \nabla_{\alpha} \mathcal{L}_{v a l}\left(w-\xi \nabla_{w} \mathcal{L}_{t r a i n}(w, \alpha), \alpha\right)\) ......(6)

When ξ = 0, the second-order derivative in equation 7 will disappear.

ξ = 0 as the first-order approximation,

ξ > 0 as the second-order approximation.

Discrete Arch

To form each node in the discrete architecture, we retain the top-k strongest operations (from distinct nodes) among all non-zero candidate operations collected from all the previous nodes.

we use k = 2 for convolutional cells and k = 1 for recurrent cells

The strength of an operation is defined as \(\frac{\exp \left(\alpha_{o}^{(i, j)}\right)}{\sum_{o^{\prime} \in \mathcal{O}} \exp \left(\alpha_{o^{\prime}}^{(i, j)}\right)}\)

Experiments

We include the following operations in O:

3 × 3 and 5 × 5 separable convolutions,

3 × 3 and 5 × 5 dilated separable convolutions,

3 × 3 max pooling,

3 × 3 average pooling,

identity (skip connection?)

zero.

All operations are of

stride one (if applicable)

the feature maps are padded to preserve their spatial resolution.

We use the

ReLU-Conv-BN order for convolutional operations,

Each separable convolution is always applied twice

Our convolutional cell consists of N = 7 nodes, the output node is defined as the depthwise concatenation of all the intermediate nodes (input nodes excluded).

The first and second nodes of cell k are set equal to the outputs of cell k−2 and cell k−1

Cells located at the 1/3 and 2/3 of the total depth of the network are reduction cells, in which all the operations adjacent to the input nodes are of stride two.

The architecture encoding therefore is (αnormal, αreduce),

where αnormal is shared by all the normal cells

and αreduce is shared by all the reduction cells.

To determine the architecture for final evaluation, we run DARTS four times with different random seeds and pick the best cell based on its validation performance obtained by training from scratch for a short period (100 epochs on CIFAR-10 and 300 epochs on PTB).

This is particularly important for recurrent cells, as the optimization outcomes can be initialization-sensitive (Fig. 3)

Arch Evaluation

To evaluate the selected architecture, we randomly initialize its weights (weights learned during the search process are discarded), train it from scratch, and report its performance on the test set.

Result Analysis

DARTS achieved comparable results with the state of the art while using three orders of magnitude less computation resources.

(i.e. 1.5 or 4 GPU days vs 2000 GPU days for NASNet and 3150 GPU days for AmoebaNet)

The longer search time is due to the fact that we have repeated the search process four times for cell selection. This practice is less important for convolutional cells however, because the performance of discovered architectures does not strongly depend on initialization (Fig. 3).

It is also interesting to note that random search is competitive for both convolutional and recurrent models, which reflects the importance of the search space design.

Results in Table 3 show that the cell learned on CIFAR-10 is indeed transferable to ImageNet.

The weaker transferability between PTB and WT2 (as compared to that between CIFAR-10 and ImageNet) could be explained by the relatively small size of the source dataset (PTB) for architecture search.

The issue of transferability could potentially be circumvented by directly optimizing the architecture on the task of interest.

Conclusion

We presented DARTS, a simple yet efficient NAS algorithm for both CNN and RNN.
SOTA
efficiency improvement by several orders of magnitude.

Improve

discrepancies between the continuous architecture encoding and the derived discrete architecture. (softmax…)
It would also be interesting to investigate performance-aware architecture derivation schemes based on the shared parameters learned during the search process.

Appendix

2019-ICLR-DARTS: Differentiable Architecture Search-论文阅读的更多相关文章

论文笔记：DARTS: Differentiable Architecture Search
DARTS: Differentiable Architecture Search 2019-03-19 10:04:26accepted by ICLR 2019 Paper:https://arx ...
论文笔记系列-DARTS: Differentiable Architecture Search
Summary 我的理解就是原本节点和节点之间操作是离散的,因为就是从若干个操作中选择某一个,而作者试图使用softmax和relaxation(松弛化)将操作连续化,所以模型结构搜索的任务就转变成了 ...
论文笔记：Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation
Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation ...
2019-ICCV-PDARTS-Progressive Differentiable Architecture Search Bridging the Depth Gap Between Search and Evaluation-论文阅读
P-DARTS 2019-ICCV-Progressive Differentiable Architecture Search Bridging the Depth Gap Between Sear ...
论文笔记系列-Auto-DeepLab:Hierarchical Neural Architecture Search for Semantic Image Segmentation
Pytorch实现代码:https://github.com/MenghaoGuo/AutoDeeplab 创新点 cell-level and network-level search 以往的NAS ...
Research Guide for Neural Architecture Search
Research Guide for Neural Architecture Search 2019-09-19 09:29:04 This blog is from: https://heartbe ...
小米造最强超分辨率算法 | Fast, Accurate and Lightweight Super-Resolution with Neural Architecture Search
本篇是基于 NAS 的图像超分辨率的文章,知名学术性自媒体 Paperweekly 在该文公布后迅速跟进,发表分析称「属于目前很火的 AutoML / Neural Architecture Sear ...
论文笔记系列-Neural Architecture Search With Reinforcement Learning
摘要神经网络在多个领域都取得了不错的成绩,但是神经网络的合理设计却是比较困难的.在本篇论文中,作者使用递归网络去省城神经网络的模型描述,并且使用增强学习训练RNN,以使得生成得到的模型在验证集上 ...
论文笔记：Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation2019-03-18 14:4 ...

随机推荐

YOLACT : 首个实时one-stage实例分割模型，29.8mAP/33.5fps | ICCV 2019
论文巧妙地基于one-stage目标检测算法提出实时实例分割算法YOLACT,整体的架构设计十分轻量,在速度和效果上面达到很好的trade-off. 来源:[晓飞的算法工程笔记] 公众号论文: ...
spring mvc 中使用session
举例:用户登录成功之后,把用户对象放置到session中第一步,用户登录成功之后把用户对象首先放到Model中第二步,要在控制器上加SessionAttributes注解,把放到model中的对象 ...
pycharm添加头注释
1.进入setting->Editor->File and Code Templates->Python Script 2.添加内容 # coding = 'utf-8'# @作者: ...
拒绝老土！暗黑风格半透平面化主题—InfinityFreedom正式发布
经常听到“路由器界面土点就土点吧,凑合能用就成.” 诚然,路由器重要的是功能,但为什么要辣眼睛呢? 拯救喜欢折腾的你,抢救干涩的眼球,原创OpenWrt主题Infinity Freedom正式发布! ...
STM32 OSAL操作系统抽象层的移植
文章目录什么是 OSAL? 源码安装 Linux 上OSAL的移植 STM32上OSAL的移植关键点测试代码结语附件什么是 OSAL? 今天同学忽然问我有没有搞过OSAL,忽然间一头雾水, ...
jbpm4.4 发送邮件
测了两天终于成功发送出邮件了,坑爹呢!原来一直用QQ邮箱发送,发现发送不了,提示要用ssl协议进行发送,后来换成了126邮箱,发送成功了!具体配置如下: jbpm定义文件 <?xml versi ...
关于tez-ui的"All DAGs"和"Hive Queries"页面信息为空的问题解决过程
近段时间发现公司的HDP大数据平台的tez-ui页面不能用了,页面显示为空,导致通过hive提交的sql不能方便地查找到Yarn上对应的applicationId,只能通过beeline的屏幕输出信息 ...
equals(), "== ",hashcode() 详细解释
Object 通用方法容易混淆的定义先搞清楚各自的定义 "==" 用来判断相等 equals() 用来判断等价 hashcode() 用来返回散列值 "==&quo ...
不吹牛X，我真的干掉了if-else
我们在web开发中,经常使用数据库表中的字段作为"标记"来表示多个"状态",比如: 我们就以某宝的在线购物流程为例进行分析.在订单表中,使用zt字段来表示定单的 ...
sqlite聚合函数
常见聚合函数 avg(X) 用于返回组中所有非空列的平均值.字符串(string)或二进制数据(BLOB)等非数字类型当作0来计算.结果是浮点型的数据,即便所有数据中只有一个整数(integer)的数 ...

2019-ICLR-DARTS: Differentiable Architecture Search-论文阅读