Two-Stream Convolutional Networks for Action Recognition in Videos

&

Towards Good Practices for Very Deep Two-Stream ConvNets

Note here: it's a learning note on the topic of video representations. This note incorporates two papers about popular two-stream architecture.

Link: http://arxiv.org/pdf/1406.2199v2.pdf

http://arxiv.org/pdf/1507.02159v1.pdf

Motivation: CNN has significantly boosted the performance of object recognition in still images. However, the use of it for video recognition with stacked frames doesn’t outperform the one with individual frame (work by Karpathy), which indicates traditional way of adapting CNN to video clips doesn’t capture the motion well.

Proposed Model:

In order to learn the spatio-temporal features well, this paper proposed a two-stream architecture for video recognition. It passes the spatial information (single static RGB frame) and another temporal information (optical flow of multiple frames) through the ConvNet. Then fuse the parallel outputs of two streams to form the final class score fusion.

The overall pipeline is shown below:

-      ConvNet input configurations:

There are some options for the input of temporal stream. The author discussed about utilizing optical flow stacking and trajectory stacking as motional information. The former one considers displacements of each point between consecutive frames, while the latter one focuses on the displacements of every point in the initial frame throughout the entire sequences.

They also mentioned bi-directional optical flow to enhance the capacity of video representations; and mean flow subtraction to avoid the influences of camera motion.

Visualization:

The visualization of filters in this architecture is shown below.

Each column corresponds to a filter, each row – to an input channel.

As we can draw from the image, one single filter composed with half black and half white means to compute spatial derivative; and the filters in a column with black turning into white gradually means to compute temporal derivative.

With the intuition above, we can see how the two-stream architecture captures the spatio-temporal features well.

Improvements:

There is another paper named Towards Good Practices for Very Deep Two-Stream ConvNets, which improves the efficiency of two-stream model in practice.

They argue that previous two-stream model didn’t significantly outperform other hand-crafted features for the mainly two reasons: first, the network is not deep enough as VGGNet&GoogLeNet; second, the lack of plenty training data limits its performance.

Thus, they proposed some suggestions to learn a more powerful two-stream model:

-      Pre-training for Two-stream ConvNets: pre-train both spatial and temporal nets on ImageNet.

-      Smaller Learning Rate.

-      More Data Augmentation Techniques

-      High Dropout Ratio: make the training of deep network with small amount of data easier.

-      Multi-GPU training.

【ML】Two-Stream Convolutional Networks for Action Recognition in Videos的更多相关文章

  1. 【CV论文阅读】Two stream convolutional Networks for action recognition in Vedios

    论文的三个贡献 (1)提出了two-stream结构的CNN,由空间和时间两个维度的网络组成. (2)使用多帧的密集光流场作为训练输入,可以提取动作的信息. (3)利用了多任务训练的方法把两个数据集联 ...

  2. 【ML】ICLR2016_Delving Deeper into Convolutional Networks

    ICLR2016_DELVING DEEPER INTO CONVOLUTIONAL NETWORKS Note here: Ballas recently proposed a novel fram ...

  3. 目标检测--Spatial pyramid pooling in deep convolutional networks for visual recognition(PAMI, 2015)

    Spatial pyramid pooling in deep convolutional networks for visual recognition 作者: Kaiming He, Xiangy ...

  4. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Kaiming He, Xiangyu Zh ...

  5. SPPNet论文翻译-空间金字塔池化Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    http://www.dengfanxin.cn/?p=403 原文地址 我对物体检测的一篇重要著作SPPNet的论文的主要部分进行了翻译工作.SPPNet的初衷非常明晰,就是希望网络对输入的尺寸更加 ...

  6. 深度学习论文翻译解析(九):Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    论文标题:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition 标题翻译:用于视觉识别的深度卷积神 ...

  7. 【Semantic Segmentation】 Instance-sensitive Fully Convolutional Networks论文解析(转)

    这篇文章比较简单,但还是不想写overview,转自: https://blog.csdn.net/zimenglan_sysu/article/details/52451098 另外,读这篇pape ...

  8. 【注意力机制】Attention Augmented Convolutional Networks

    注意力机制之Attention Augmented Convolutional Networks 原始链接:https://www.yuque.com/lart/papers/aaconv 核心内容 ...

  9. 【ML】Predict and Constrain: Modeling Cardinality in Deep Structured Prediction -预测和约束:在深度结构化预测中建模基数

    [论文标题]Predict and Constrain: Modeling Cardinality in Deep Structured Prediction   (35th-ICML,PMLR) [ ...

随机推荐

  1. do-while语句及for语句(初学者)

    1.do-while语句的一般形式为: do 语句 while(表达式): 这个循环与while循环的不同在于:它先执行循环中的语句,然后再判断这个表达式是否为真,如果为真则继续循环:如果为假,则中止 ...

  2. python scrapy爬虫框架概念介绍(个人理解总结为一张图)

    python的scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架   python和scrapy的安装就不介绍了,资料很多 这里我个人总结一下,能更加快理解scrapy和快速上手一个简 ...

  3. webpack-dev-server 配置

    webpack.config.js 配置 var ExtractTextPlugin = require("extract-text-webpack-plugin"); var C ...

  4. 使用idea搭建SSM框架

    搭建个SSM框架居然花费了我好长时间!特此记录! 需要准备的环境: idea 2017.1 jdk1.8 Maven 3.3.9  请提前将idea与Maven.jdk配置好,本次项目用的都是比较新的 ...

  5. 【转】MySQL 当记录不存在时insert,当记录存在时update

    MySQL当记录不存在时insert,当记录存在时更新:网上基本有三种解决方法 第一种: 示例一:insert多条记录 假设有一个主键为 client_id 的 clients 表,可以使用下面的语句 ...

  6. javascript中call,apply,bind的用法对比分析

    这篇文章主要给大家对比分析了javascript中call,apply,bind三个函数的用法,非常的详细,这里推荐给小伙伴们.   关于call,apply,bind这三个函数的用法,是学习java ...

  7. UCML 参与者关键 与依赖关联外键

  8. 清空visual studio 开发缓存

    C:\Users\Administrator\AppData\Local\Temp\Temporary ASP.NET Files C:\Windows\Microsoft.NET\Framework ...

  9. .Net修改网站项目调试时的虚拟目录(未验证)

    有些项目需要在IIS发布的时候,将网站发布到虚拟目录,为了保持调试和发布的路径同一,一般会修改VS调试的虚拟目录 一.Web应用程序 Web应用程序的修改方式非常简单,在解决方案资源管理器->项 ...

  10. 理解webpack之process.env.NODE_ENV详解(十八)

    在node中,有全局变量process表示的是当前的node进程.process.env包含着关于系统环境的信息.但是process.env中并不存在NODE_ENV这个东西.NODE_ENV是用户一 ...