【ML】Two-Stream Convolutional Networks for Action Recognition in Videos
Two-Stream Convolutional Networks for Action Recognition in Videos
&
Towards Good Practices for Very Deep Two-Stream ConvNets
Note here: it's a learning note on the topic of video representations. This note incorporates two papers about popular two-stream architecture.
Link: http://arxiv.org/pdf/1406.2199v2.pdf
http://arxiv.org/pdf/1507.02159v1.pdf
Motivation: CNN has significantly boosted the performance of object recognition in still images. However, the use of it for video recognition with stacked frames doesn’t outperform the one with individual frame (work by Karpathy), which indicates traditional way of adapting CNN to video clips doesn’t capture the motion well.
Proposed Model:
In order to learn the spatio-temporal features well, this paper proposed a two-stream architecture for video recognition. It passes the spatial information (single static RGB frame) and another temporal information (optical flow of multiple frames) through the ConvNet. Then fuse the parallel outputs of two streams to form the final class score fusion.
The overall pipeline is shown below:

- ConvNet input configurations:
There are some options for the input of temporal stream. The author discussed about utilizing optical flow stacking and trajectory stacking as motional information. The former one considers displacements of each point between consecutive frames, while the latter one focuses on the displacements of every point in the initial frame throughout the entire sequences.
They also mentioned bi-directional optical flow to enhance the capacity of video representations; and mean flow subtraction to avoid the influences of camera motion.
Visualization:
The visualization of filters in this architecture is shown below.
Each column corresponds to a filter, each row – to an input channel.
As we can draw from the image, one single filter composed with half black and half white means to compute spatial derivative; and the filters in a column with black turning into white gradually means to compute temporal derivative.
With the intuition above, we can see how the two-stream architecture captures the spatio-temporal features well.

Improvements:
There is another paper named Towards Good Practices for Very Deep Two-Stream ConvNets, which improves the efficiency of two-stream model in practice.
They argue that previous two-stream model didn’t significantly outperform other hand-crafted features for the mainly two reasons: first, the network is not deep enough as VGGNet&GoogLeNet; second, the lack of plenty training data limits its performance.
Thus, they proposed some suggestions to learn a more powerful two-stream model:
- Pre-training for Two-stream ConvNets: pre-train both spatial and temporal nets on ImageNet.
- Smaller Learning Rate.
- More Data Augmentation Techniques
- High Dropout Ratio: make the training of deep network with small amount of data easier.
- Multi-GPU training.
【ML】Two-Stream Convolutional Networks for Action Recognition in Videos的更多相关文章
- 【CV论文阅读】Two stream convolutional Networks for action recognition in Vedios
论文的三个贡献 (1)提出了two-stream结构的CNN,由空间和时间两个维度的网络组成. (2)使用多帧的密集光流场作为训练输入,可以提取动作的信息. (3)利用了多任务训练的方法把两个数据集联 ...
- 【ML】ICLR2016_Delving Deeper into Convolutional Networks
ICLR2016_DELVING DEEPER INTO CONVOLUTIONAL NETWORKS Note here: Ballas recently proposed a novel fram ...
- 目标检测--Spatial pyramid pooling in deep convolutional networks for visual recognition(PAMI, 2015)
Spatial pyramid pooling in deep convolutional networks for visual recognition 作者: Kaiming He, Xiangy ...
- Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Kaiming He, Xiangyu Zh ...
- SPPNet论文翻译-空间金字塔池化Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
http://www.dengfanxin.cn/?p=403 原文地址 我对物体检测的一篇重要著作SPPNet的论文的主要部分进行了翻译工作.SPPNet的初衷非常明晰,就是希望网络对输入的尺寸更加 ...
- 深度学习论文翻译解析(九):Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
论文标题:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition 标题翻译:用于视觉识别的深度卷积神 ...
- 【Semantic Segmentation】 Instance-sensitive Fully Convolutional Networks论文解析(转)
这篇文章比较简单,但还是不想写overview,转自: https://blog.csdn.net/zimenglan_sysu/article/details/52451098 另外,读这篇pape ...
- 【注意力机制】Attention Augmented Convolutional Networks
注意力机制之Attention Augmented Convolutional Networks 原始链接:https://www.yuque.com/lart/papers/aaconv 核心内容 ...
- 【ML】Predict and Constrain: Modeling Cardinality in Deep Structured Prediction -预测和约束:在深度结构化预测中建模基数
[论文标题]Predict and Constrain: Modeling Cardinality in Deep Structured Prediction (35th-ICML,PMLR) [ ...
随机推荐
- WampServer 安装使用详解
WampServer集成环境的搭建.安装.使用.配置 什么是WampServer WampServer是一款由法国人开发的Apache Web服务器.PHP解释器以及MySQL数据库的整合软件包.免去 ...
- Nginx实现页面缓存
页面缓存 1.缓存指令 Nginx的缓存配置比较直观简单,具体有下面几个指令需要知道: A.proxy_cache_path 格式:proxy_cache_path path [levels=numb ...
- GUI_文件管理器(练习)
实现想windows下的文件管理器(主要是监听器里的方法,showDir()写法) package com.mywindow.test; import java.awt.event.ActionEve ...
- python第四十六课——函数重写
3.函数重写(override) 前提:必须有继承性 原因: 父类中的功能(函数),子类需要用,但是父类中函数的函数体内容和我现在要执行的逻辑还不相符 那么可以将函数名保留(功能还是此功能),但是将函 ...
- 2个Excel表格核对技巧
技巧1.利用Spreadsheet Camprare一秒钟识别差异数据 如下图所示,我们如何快速比对我们自己做的表格和上司修改后的表格的差异呢?这里首先来介绍一个非常棒的工具:Spreadsheet ...
- oracle 11gR2 ASM添加和删除磁盘
一.环境 oracle 11gR2 RAC + Oracle Linux Server release 5.9 二.实施 备注:安全起见,操作之前停数据库实例.ASM实例 1.节点1.2磁盘信息 -- ...
- ROS教程3 ROS自定义msg类型及使用
1ROS自定义msg类型及使用 http://blog.csdn.net/u013453604/article/details/72903398 首先创建一个空的package单独存放msg类型(当然 ...
- smartpass
1.smartpass 是用户注册后,产生的用户名密码 与每个摄像头的用户名密码不一致 2.每个设备初始化登录密码为admin admin,如果需要修改,则在进入该设备IP地址,设置——>用户管 ...
- 一图尽知XMIND
- 斯坦福HAI—细数全球18件AI大事记
3 月 18 日,由李飞飞担任所长之一的「以人为本人工智能研究所」(HAI)自启动以来不短的时间后,终于完成了正式成立的高光时刻.而正式上线的官网日前也更新了两条博文,一篇是详尽介绍 HAI 的文章: ...