【CV】CVPR2015_A Discriminative CNN Video Representation for Event Detection

A Discriminative CNN Video Representation for Event Detection

Note here: it's a learning note on the topic of video representation, based on the paper below.

Link: http://arxiv.org/pdf/1411.4006v1.pdf

Motivation:

The use of improved Dense Trajectories (IDT) has led good performance on the task of event detection, while the performance of CNN based video representation is worse than that. The author argues the following three main reasons:

Lack of labeled video data to train good models.
Video level event labels are too coarse to finetune a pre-trained model for adapting the event detection task.
The use of average pooling to generate a discriminative video level representation from CNN frame level descriptors works worse than hand-crafted features like IDT.

Proposed Model:

This paper proposes a model mainly targetting at the third problem, namely how to build a cost-efficient and discrimintive video representation based on CNN.

1) Firstly, we should extract frame-level descriptor:

We adopt M filters in the last convolutional layers as M latent concept classifiers. Each convolutional filter is corresponding to one latent concept, and each of it will apply on different location of the frame. So we’ll get responses of discrimintive latent concepts on different locations of the frame.

After that, we apply max-pooling operation on all concepts descriptors and concatenate different responses at the same location to form vectors each of which containing various concepts descriptions at this location.

By now, we’ve extract frame-level features.

(Actually, they didn’t do anything special at this step, they just give a new illustration of responses in CNN and rearrange those responses for further process.)

2) Secondly, we need to encode a discrimintive video-level descriptor from all these frame-level descriptors:

They introduce and compare three different encoding methods in the paper.

However, as I’m not proficient in the mathematical meanings of them, I can just give a briefly look at them instead of going further.

Fisher Vector Encoding (refer to: http://blog.csdn.net/breeze5428/article/details/32706507 & http://www.cnblogs.com/CBDoctor/archive/2011/11/06/2236286.html)
VLAD Encoding (simplified version of Fisher Vector Encoding)
Average Pooling

Through experiment, they find out VLAD is better than other encoding methods. (You can refer to the paper for details about that experiment.)

"This is the ﬁrst work on the video pooling of CNN descriptors and we broaden the encoding methods from local descriptors to CNN descriptors in video analysis."

(That's the takeaway in their work. They're the first to apply these encoding methods on the CNN descriptors. Previously, most of the works utilize Fisher Vector Encoding to encode a general feature of an image from local descriptors like HOG, SIFT, HOF and so on.)

3) Lastly, we get a video-level descriptor and feed it into a SVM to do detection task.

Two Tricks:

1) Spatial Pyramid Pooling: they apply four different CNN max-pooling operations to give more spatial locations for a single frame, which makes the descriptor more discrimintive. And that’s also more cost-friendly than applying spatial pyramid on raw frame.

2) Representation Compression: they do Product Quntization to compress the final representation while still maintain or even slightly improve the original performance.

【CV】CVPR2015_A Discriminative CNN Video Representation for Event Detection的更多相关文章

论文阅读（Weilin Huang——【TIP2016】Text-Attentional Convolutional Neural Network for Scene Text Detection）
Weilin Huang--[TIP2015]Text-Attentional Convolutional Neural Network for Scene Text Detection) 目录作者 ...
【CV】ICCV2015_Unsupervised Learning of Visual Representations using Videos
Unsupervised Learning of Visual Representations using Videos Note here: it's a learning note on Prof ...
【PSMA】Progressive Sample Mining and Representation Learning for One-Shot Re-ID
目录主要挑战主要的贡献和创新点提出的方法总体框架与算法 Vanilla pseudo label sampling (PLS) PLS with adversarial learning Tr ...
【CV】ICCV2015_Learning Temporal Embeddings for Complex Video Analysis
Learning Temporal Embeddings for Complex Video Analysis Note here: it's a review note on novel work ...
【CV】ICCV2015_Unsupervised Visual Representation Learning by Context Prediction
Unsupervised Visual Representation Learning by Context Prediction Note here: it's a learning note on ...
【CV】ICCV2015_Unsupervised Learning of Spatiotemporally Coherent Metrics
Unsupervised Learning of Spatiotemporally Coherent Metrics Note here: it's a learning note on the to ...
【CV】ICCV2015_Describing Videos by Exploiting Temporal Structure
Describing Videos by Exploiting Temporal Structure Note here: it's a learning note on the topic of v ...
【ML】ICML2015_Unsupervised Learning of Video Representations using LSTMs
Unsupervised Learning of Video Representations using LSTMs Note here: it's a learning notes on new L ...
【实战问题】【3】iPhone无法播放video标签中的视频
问题:视频都是MP4格式,视频可以在手机上正常播放.video标签中的视频在安卓点击可以播放,但在iPhone无法播放解决方案: 1,视频编码格式问题,具体iPhone手机支持的是哪些格式可见官方的 ...

随机推荐

ccf--20150903--模板生成系统
本题思路:首先,使用一个map来存储所有需要替换的关键词,然后,再逐行的替换掉其中的关键词,记住,find每次的其实位置不一样,否则会出现递归生成没有出现关键词就清空掉.最后输出. 题目和代码如下: ...
用python写个简单的小程序，编译成exe跑在win10上
每天的工作其实很无聊,早知道应该去IT公司闯荡的.最近的工作内容是每逢一个整点,从早7点到晚11点,去查一次客流数据,整理到表格中,上交给素未蒙面的上线,由他呈交领导查阅. 人的精力毕竟是有限的,所以 ...
File类_构造函数
File类:用来将文件或者文件夹封装成对象方便对文件或或文件夹的属性信息进行操作File对象可以作为参数传递给流的构造函数 import java.io.File; public class File ...
关于对浏览器发送POST请求的一点研究
网上对与HTTP的Method,GET和POST的区别,说得毕竟详细.然后提到一点,说浏览器对两者的还有一个比较容易让人忽略的区别就是:POST会分2次发送,而GET只1次. GET发送1次,这个没什 ...
MyISAM to InnoDB: Why and How（MYSQL官方译文）
原文地址:https://www.mysql.com/why-mysql/presentations/myisam-2-innodb-why-and-how/ MySQL使用一个插拔式的存储引擎架构, ...
adb报错问题解决方法
1,报错信息:adb server version (31) doesn't match this client (40); killing 解决方法: 一: 主要是前面的31或者其他,比如32/31 ...
loback学习
博客链接 http://aub.iteye.com/blog/1101222
使用js切割URL的参数
对于一些开发场景,不使用Jsp或freemarker及其其他的模板引擎时,通常通过切割url获得对应的参数,然后通过AJAX与后台交互得到对应的数据下面是演示实例: test.html <!D ...
sd错误---2
一道水题绊了我,居然愿意很简单我多打了一遍int main 阴错阳差的自定义的函数上面出现了int main 所以以后想想好思路(是那种很全很全的思路) 再写代码一定要避免sd错误!!! ...
C#基础巩固(3)-Linq To XML 读取XML
记录下一些读取XML的方法,以免到用的时候忘记了,还得花时间去找. 一.传统写法读取XML 现在我有一个XML文件如下: 现在我要查找名字为"王五"的这个人的 Id 和sex(性别 ...

【CV】CVPR2015_A Discriminative CNN Video Representation for Event Detection

【CV】CVPR2015_A Discriminative CNN Video Representation for Event Detection的更多相关文章

随机推荐

热门专题