Video Architecture Search

2019-10-20 06:48:26

This blog is from: https://ai.googleblog.com/2019/10/video-architecture-search.html

Posted by Michael S. Ryoo, Research Scientist and AJ Piergiovanni, Student Researcher, Robotics at Google

Video understanding is a challenging problem. Because a video contains spatio-temporal data, its feature representation is required to abstract both appearance and motion information. This is not only essential for automated understanding of the semantic content of videos, such as web-video classification or sport activity recognition, but is also crucial for robot perception and learning. Just like humans, an input from a robot’s camera is seldom a static snapshot of the world, but takes the form of a continuous video.

The abilities of today’s deep learning models are greatly dependent on their neural architectures. Convolutional neural networks (CNNs) for videos are normally built by manually extending known 2D architectures such as Inception and ResNet to 3D or by carefully designing two-stream CNN architectures that fuse together both appearance and motion information. However, designing an optimal video architecture to best take advantage of spatio-temporal information in videos still remains an open problem. Although neural architecture search (e.g., Zoph et al, Real et al) to discover good architectures has been widely explored for images, machine-optimized neural architectures for videos have not yet been developed. Video CNNs are typically computation- and memory-intensive, and designing an approach to efficiently search for them while capturing their unique properties has been difficult.

In response to these challenges, we have conducted a series of studies into automatic searches for more optimal network architectures for video understanding. We showcase three different neural architecture evolution algorithms: learning layers and their module configuration (EvaNet); learning multi-stream connectivity (AssembleNet); and building computationally efficient and compact networks (TinyVideoNet). The video architectures we developed outperform existing hand-made models on multiple public datasets by a significant margin, and demonstrate a 10x~100x improvement in network runtime.

EvaNet: The first evolved video architectures
EvaNet, which we introduce in “Evolving Space-Time Neural Architectures for Videos” at ICCV 2019, is the very first attempt to design neural architecture search for video architectures. EvaNet is a module-level architecture search that focuses on finding types of spatio-temporal convolutional layers as well as their optimal sequential or parallel configurations. An evolutionary algorithm with mutation operators is used for the search, iteratively updating a population of architectures. This allows for parallel and more efficient exploration of the search space, which is necessary for video architecture search to consider diverse spatio-temporal layers and their combinations. EvaNet evolves multiple modules (at different locations within the network) to generate different architectures.

Our experimental results confirm the benefits of such video CNN architectures obtained by evolving heterogeneous modules. The approach often finds that non-trivial modules composed of multiple parallel layers are most effective as they are faster and exhibit superior performance to hand-designed modules. Another interesting aspect is that we obtain a number of similarly well-performing, but diverse architectures as a result of the evolution, without extra computation. Forming an ensemble with them further improves performance. Due to their parallel nature, even an ensemble of models is computationally more efficient than the other standard video networks, such as (2+1)D ResNet. We have open sourced the code.

Examples of various EvaNet architectures. Each colored box (large or small) represents a layer with the color of the box indicating its type: 3D conv. (blue), (2+1)D conv. (orange), iTGM (green), max pooling (grey), averaging (purple), and 1x1 conv. (pink). Layers are often grouped to form modules (large boxes). Digits within each box indicate the filter size.

AssembleNet: Building stronger and better (multi-stream) models
In “AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures”, we look into a new method of fusing different sub-networks with different input modalities (e.g., RGB and optical flow) and temporal resolutions. AssembleNet is a “family” of learnable architectures that provide a generic approach to learn the “connectivity” among feature representations across input modalities, while being optimized for the target task. We introduce a general formulation that allows representation of various forms of multi-stream CNNs as directed graphs, coupled with an efficient evolutionary algorithm to explore the high-level network connectivity. The objective is to learn better feature representations across appearance and motion visual clues in videos. Unlike previous hand-designed two-stream models that use late fusion or fixed intermediate fusion, AssembleNet evolves a population of overly-connected, multi-stream, multi-resolution architectures while guiding their mutations by connection weight learning. We are looking at four-stream architectures with various intermediate connections for the first time — 2 streams per RGB and optical flow, each one at different temporal resolutions.

The figure below shows an example of an AssembleNet architecture, found by evolving a pool of random initial multi-stream architectures over 50~150 rounds. We tested AssembleNet on two very popular video recognition datasets: Charades and Moments-in-Time (MiT). Its performance on MiT is the first above 34%. The performances on Charades is even more impressive at 58.6% mean Average Precision (mAP), whereas previous best known results are 42.5 and 45.2.

The representative AssembleNet model evolved using the Moments-in-Time dataset. A node corresponds to a block of spatio-temporal convolutional layers, and each edge specifies their connectivity. Darker edges mean stronger connections. AssembleNet is a family of learnable multi-stream architectures, optimized for the target task.

A figure comparing AssembleNet with state-of-the-art, hand-designed models on Charades (left) and Moments-in-Time (right) datasets. AssembleNet-50 or AssembleNet-101 has an equivalent number of parameters to a two-stream ResNet-50 or ResNet-101.

Tiny Video Networks: The fastest video understanding networks
In order for a video CNN model to be useful for devices operating in a real-world environment, such as that needed by robots, real-time, efficient computation is necessary. However, achieving state-of-the-art results on video recognition tasks currently requires extremely large networks, often with tens to hundreds of convolutional layers, that are applied to many input frames. As a result, these networks often suffer from very slow runtimes, requiring at least 500+ ms per 1-second video snippet on a contemporary GPU and 2000+ ms on a CPU. In Tiny Video Networks, we address this by automatically designing networks that provide comparable performance at a fraction of the computational cost. Our Tiny Video Networks (TinyVideoNets) achieve competitive accuracy and run efficiently, at real-time or better speeds, within 37 to 100 ms on a CPU and 10 ms on a GPU per ~1 second video clip, achieving hundreds of times faster speeds than the other human-designed contemporary models.

These performance gains are achieved by explicitly considering the model run-time during the architecture evolution and forcing the algorithm to explore the search space while including spatial or temporal resolution and channel size to reduce computations. The below figure illustrates two simple, but very effective architectures, found by TinyVideoNet. Interestingly the learned model architectures have fewer convolutional layers than typical video architectures: Tiny Video Networks prefers lightweight elements, such as 2D pooling, gating layers, and squeeze-and-excitation layers. Further, TinyVideoNet is able to jointly optimize parameters and runtime to provide efficient networks that can be used by future network exploration.

TinyVideoNet (TVN) architectures evolved to maximize the recognition performance while keeping its computation time within the desired limit. For instance, TVN-1 (top) runs at 37 ms on a CPU and 10ms on a GPU. TVN-2 (bottom) runs at 65ms on a CPU and 13ms on a GPU.

CPU runtime of TinyVideoNet models compared to prior models (left) and runtime vs. model accuracy of TinyVideoNets compared to (2+1)D ResNet models (right). Note that TinyVideoNets take a part of this time-accuracy space where no other models exist, i.e., extremely fast but still accurate.

Conclusion
To our knowledge, this is the very first work on neural architecture search for video understanding. The video architectures we generate with our new evolutionary algorithms outperform the best known hand-designed CNN architectures on public datasets, by a significant margin. We also show that learning computationally efficient video models, TinyVideoNets, is possible with architecture evolution. This research opens new directions and demonstrates the promise of machine-evolved CNNs for video understanding.

Acknowledgements
This research was conducted by Michael S. Ryoo, AJ Piergiovanni, and Anelia Angelova. Alex Toshev and Mingxing Tan also contributed to this work. We thank Vincent Vanhoucke, Juhana Kangaspunta, Esteban Real, Ping Yu, Sarah Sirajuddin, and the Robotics at Google team for discussion and support.

Video Architecture Search的更多相关文章

论文笔记：Progressive Neural Architecture Search
Progressive Neural Architecture Search 2019-03-18 20:28:13 Paper:http://openaccess.thecvf.com/conten ...
论文笔记：Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation2019-03-18 14:4 ...
论文笔记：Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells
Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells 2019-04- ...
（转）Illustrated: Efficient Neural Architecture Search ---Guide on macro and micro search strategies in ENAS
Illustrated: Efficient Neural Architecture Search --- Guide on macro and micro search strategies in ...
（转）The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
论文笔记：ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware 2019-03-19 16:13:18 Pape ...
论文笔记：DARTS: Differentiable Architecture Search
DARTS: Differentiable Architecture Search 2019-03-19 10:04:26accepted by ICLR 2019 Paper:https://arx ...
论文笔记系列-DARTS: Differentiable Architecture Search
Summary 我的理解就是原本节点和节点之间操作是离散的,因为就是从若干个操作中选择某一个,而作者试图使用softmax和relaxation(松弛化)将操作连续化,所以模型结构搜索的任务就转变成了 ...
论文笔记系列-Simple And Efficient Architecture Search For Neural Networks
摘要本文提出了一种新方法,可以基于简单的爬山过程自动搜索性能良好的CNN架构,该算法运算符应用网络态射,然后通过余弦退火进行短期优化运行. 令人惊讶的是,这种简单的方法产生了有竞争力的结果,尽管只需 ...

随机推荐

jQuery源码学习一：创建一个jquery实例
前言: jquery是每个前端都会的基础技能,众所周知,jquery返回的是jquery实例方法,但是我们似乎是直接使用$就可以获取到jquery的方法啦,可以在浏览器中判断一下 window.$ 和 ...
sqlserver一次性修改多条
修改客户表编号为 0101007002,0101007003的楼栋号 007-1-102,007-1-201 UPDATE gas_customerSET building= CASEWHEN g ...
swipe滑动操作
1.swipe() 滑动用法 swipe(self, start_x, start_y, end_x, end_y, duration=None) :Args: - start_x - 开始滑动的x坐 ...
Docker 修改容器内的时区
利用docker发布服务,发现程序用获取的系统时间比正常时间晚了8个小时进入容器 docker exec -it /bin/sh 查询时间 date -R 发现时区为0时区解决思路 1.复制相应 ...
Intel重大漏洞之Meltdown和Spectre
史上最大漏洞危机:影响所有 iPhone.Android.PC 设备,修复困难重重近日,英特尔的日子可并不好过. 作为全球知名芯片制造商,任何有关英特尔芯片漏洞的问题都会导致全球上百万设备遭受牵连. ...
Apache JMeter系列.1
最爱看统计 --01-- 简介 Apache JMeter Apache JMeter可用于测试静态和动态资源(文件,Servlet,Perl脚本,Java对象,数据库和查询,FTP服务器等)的性能. ...
什么是 MFA？
Multi-Factor Authentication (MFA) 是一种简单有效的最佳安全实践方法,它能够在用户名和密码之外再额外增加一层安全保护. 启用 MFA 后,用户登录阿里云网站时,系统将要 ...
ES6 的class类笔记
class Person{ // 构造 constructor(x,y){ this.x = x; this.y = y; } toString(){ return (this.x + "的 ...
Virtual DOM的渲染机制--猜测
一个node的状态发生变化: 会对当前结点和子节点的数据全部进行更新: 然后进行dom比较: 比较完毕后一次性提交: 相对于以前的渲染方式: 每一个node的数据发生变化,都会产生一次渲染提交: 以上 ...
SSH和SSM对比（一）
当下流行的两种企业开发MVC开源框架,是我们Java程序猿必备知识能力.MVC,即模型(model)-视图(view)-控制器(controller)的缩写,一种软件设计典范,用一种业务逻辑.数据.界 ...

Video Architecture Search

Video Architecture Search

Video Architecture Search的更多相关文章

随机推荐

热门专题