【ML】Two-Stream Convolutional Networks for Action Recognition in Videos
Two-Stream Convolutional Networks for Action Recognition in Videos
&
Towards Good Practices for Very Deep Two-Stream ConvNets
Note here: it's a learning note on the topic of video representations. This note incorporates two papers about popular two-stream architecture.
Link: http://arxiv.org/pdf/1406.2199v2.pdf
http://arxiv.org/pdf/1507.02159v1.pdf
Motivation: CNN has significantly boosted the performance of object recognition in still images. However, the use of it for video recognition with stacked frames doesn’t outperform the one with individual frame (work by Karpathy), which indicates traditional way of adapting CNN to video clips doesn’t capture the motion well.
Proposed Model:
In order to learn the spatio-temporal features well, this paper proposed a two-stream architecture for video recognition. It passes the spatial information (single static RGB frame) and another temporal information (optical flow of multiple frames) through the ConvNet. Then fuse the parallel outputs of two streams to form the final class score fusion.
The overall pipeline is shown below:

- ConvNet input configurations:
There are some options for the input of temporal stream. The author discussed about utilizing optical flow stacking and trajectory stacking as motional information. The former one considers displacements of each point between consecutive frames, while the latter one focuses on the displacements of every point in the initial frame throughout the entire sequences.
They also mentioned bi-directional optical flow to enhance the capacity of video representations; and mean flow subtraction to avoid the influences of camera motion.
Visualization:
The visualization of filters in this architecture is shown below.
Each column corresponds to a filter, each row – to an input channel.
As we can draw from the image, one single filter composed with half black and half white means to compute spatial derivative; and the filters in a column with black turning into white gradually means to compute temporal derivative.
With the intuition above, we can see how the two-stream architecture captures the spatio-temporal features well.

Improvements:
There is another paper named Towards Good Practices for Very Deep Two-Stream ConvNets, which improves the efficiency of two-stream model in practice.
They argue that previous two-stream model didn’t significantly outperform other hand-crafted features for the mainly two reasons: first, the network is not deep enough as VGGNet&GoogLeNet; second, the lack of plenty training data limits its performance.
Thus, they proposed some suggestions to learn a more powerful two-stream model:
- Pre-training for Two-stream ConvNets: pre-train both spatial and temporal nets on ImageNet.
- Smaller Learning Rate.
- More Data Augmentation Techniques
- High Dropout Ratio: make the training of deep network with small amount of data easier.
- Multi-GPU training.
【ML】Two-Stream Convolutional Networks for Action Recognition in Videos的更多相关文章
- 【CV论文阅读】Two stream convolutional Networks for action recognition in Vedios
论文的三个贡献 (1)提出了two-stream结构的CNN,由空间和时间两个维度的网络组成. (2)使用多帧的密集光流场作为训练输入,可以提取动作的信息. (3)利用了多任务训练的方法把两个数据集联 ...
- 【ML】ICLR2016_Delving Deeper into Convolutional Networks
ICLR2016_DELVING DEEPER INTO CONVOLUTIONAL NETWORKS Note here: Ballas recently proposed a novel fram ...
- 目标检测--Spatial pyramid pooling in deep convolutional networks for visual recognition(PAMI, 2015)
Spatial pyramid pooling in deep convolutional networks for visual recognition 作者: Kaiming He, Xiangy ...
- Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Kaiming He, Xiangyu Zh ...
- SPPNet论文翻译-空间金字塔池化Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
http://www.dengfanxin.cn/?p=403 原文地址 我对物体检测的一篇重要著作SPPNet的论文的主要部分进行了翻译工作.SPPNet的初衷非常明晰,就是希望网络对输入的尺寸更加 ...
- 深度学习论文翻译解析(九):Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
论文标题:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition 标题翻译:用于视觉识别的深度卷积神 ...
- 【Semantic Segmentation】 Instance-sensitive Fully Convolutional Networks论文解析(转)
这篇文章比较简单,但还是不想写overview,转自: https://blog.csdn.net/zimenglan_sysu/article/details/52451098 另外,读这篇pape ...
- 【注意力机制】Attention Augmented Convolutional Networks
注意力机制之Attention Augmented Convolutional Networks 原始链接:https://www.yuque.com/lart/papers/aaconv 核心内容 ...
- 【ML】Predict and Constrain: Modeling Cardinality in Deep Structured Prediction -预测和约束:在深度结构化预测中建模基数
[论文标题]Predict and Constrain: Modeling Cardinality in Deep Structured Prediction (35th-ICML,PMLR) [ ...
随机推荐
- CanalSharp.AspNetCore v0.0.4-支持输出到MongoDB
一.多样输出支持 CanalSharp.AspNetCore是一个基于CanalSharp的适用于ASP.NET Core的一个后台任务组件,它可以随着ASP.NET Core实例的启动而启动,目前采 ...
- Java中equals()和“==”区别
1.对于基础数据类型,使用“=="比较值是否相等: 2.对于复合数据类型(类),使用equals()和“==”效果是一样的,两者比较的都是对象在内存中的存放地址(确切的说,是堆内存地址). ...
- Django之ORM查询
ORM 映射关系: 表名 <-------> 类名 字段 <-------> 属性 表记录 <------->类实例对象图书管理系统的增删改查:代码如下:views ...
- python六十三课——高阶函数之sorted
演示sorted函数的使用,以及和sort的区别:我们将sorted和sort进行一番比较:相同点:它们都是来实现排序的操作(功能层面)不同点:列表中的sort函数,它执行完毕后会直接影响原本这个li ...
- eclipse+tomcat测试连接时候HTTP Status 404错误
想要在eclipse里部署tomcat,结果tomcat单独可以通过连接测试,用eclipse就404了 404肯定都是目录不对,试了半天在eclipse下改了一下配置和文件位置就行了 1.先在菜单栏 ...
- nodejs stream 手册学习
nodejs stream 手册 https://github.com/jabez128/stream-handbook 在node中,流可以帮助我们将事情的重点分为几份,因为使用流可以帮助我们将实现 ...
- linux连接iscsi存储方法
当前存储openfiler IP为 192.168.221.99 端口为3260 安装.启动iscsi rpm包 并改为开机自动运行 探测存储服务器 iscsiadm -m discovery ...
- shell编程之函数
一.函数定义和调用 函数是Shell脚本中自定义的一系列执行命令,一般来说函数应该设置有返回值(正确返回0,错误返回非0).对于错误返回,可以定义其他非0正值来细化错误.使用函数最大的好处是可避免出现 ...
- 升级pip后出现ImportError: cannot import name main
https://blog.csdn.net/accumulate_zhang/article/details/80269313
- 学习测试框架Mocha
学习测试框架Mocha 注意:是参考阮老师的文章来学的.虽然阮老师有讲解,但是觉得自己敲一遍,然后记录一遍效果会更好点.俗话说,好记性不如烂笔头. Mocha 是javascript测试框架之一,可以 ...