A Discriminative CNN Video Representation for Event Detection

Note here: it's a learning note on the topic of video representation, based on the paper below.

Link: http://arxiv.org/pdf/1411.4006v1.pdf

Motivation:

The use of improved Dense Trajectories (IDT) has led good performance on the task of event detection, while the performance of CNN based video representation is worse than that. The author argues the following three main reasons:

  • Lack of labeled video data to train good models.
  • Video level event labels are too coarse to finetune a pre-trained model for adapting the event detection task.
  • The use of average pooling to generate a discriminative video level representation from CNN frame level descriptors works worse than hand-crafted features like IDT.

Proposed Model:

This paper proposes a model mainly targetting at the third problem, namely how to build a cost-efficient and discrimintive video representation based on CNN.

1)     Firstly, we should extract frame-level descriptor:

We adopt M filters in the last convolutional layers as M latent concept classifiers. Each convolutional filter is corresponding to one latent concept, and each of it will apply on different location of the frame. So we’ll get responses of discrimintive latent concepts on different locations of the frame.

After that, we apply max-pooling operation on all concepts descriptors and concatenate different responses at the same location to form vectors each of which containing various concepts descriptions at this location.

By now, we’ve extract frame-level features.

(Actually, they didn’t do anything special at this step, they just give a new illustration of responses in CNN and rearrange those responses for further process.)

2)     Secondly, we need to encode a discrimintive video-level descriptor from all these frame-level descriptors:

They introduce and compare three different encoding methods in the paper.

However, as I’m not proficient in the mathematical meanings of them, I can just give a briefly look at them instead of going further.

Through experiment, they find out VLAD is better than other encoding methods. (You can refer to the paper for details about that experiment.)

"This is the first work on the video pooling of CNN descriptors and we broaden the encoding methods from local descriptors to CNN descriptors in video analysis."

(That's the takeaway in their work. They're the first to apply these encoding methods on the CNN descriptors. Previously, most of the works utilize Fisher Vector Encoding to encode a general feature of an image from local descriptors like HOG, SIFT, HOF and so on.)

3)     Lastly, we get a video-level descriptor and feed it into a SVM to do detection task.

Two Tricks:

1)     Spatial Pyramid Pooling: they apply four different CNN max-pooling operations to give more spatial locations for a single frame, which makes the descriptor more discrimintive. And that’s also more cost-friendly than applying spatial pyramid on raw frame.

2)     Representation Compression: they do Product Quntization to compress the final representation while still maintain or even slightly improve the original performance.

【CV】CVPR2015_A Discriminative CNN Video Representation for Event Detection的更多相关文章

  1. 论文阅读(Weilin Huang——【TIP2016】Text-Attentional Convolutional Neural Network for Scene Text Detection)

    Weilin Huang--[TIP2015]Text-Attentional Convolutional Neural Network for Scene Text Detection) 目录 作者 ...

  2. 【CV】ICCV2015_Unsupervised Learning of Visual Representations using Videos

    Unsupervised Learning of Visual Representations using Videos Note here: it's a learning note on Prof ...

  3. 【PSMA】Progressive Sample Mining and Representation Learning for One-Shot Re-ID

    目录 主要挑战 主要的贡献和创新点 提出的方法 总体框架与算法 Vanilla pseudo label sampling (PLS) PLS with adversarial learning Tr ...

  4. 【CV】ICCV2015_Learning Temporal Embeddings for Complex Video Analysis

    Learning Temporal Embeddings for Complex Video Analysis Note here: it's a review note on novel work ...

  5. 【CV】ICCV2015_Unsupervised Visual Representation Learning by Context Prediction

    Unsupervised Visual Representation Learning by Context Prediction Note here: it's a learning note on ...

  6. 【CV】ICCV2015_Unsupervised Learning of Spatiotemporally Coherent Metrics

    Unsupervised Learning of Spatiotemporally Coherent Metrics Note here: it's a learning note on the to ...

  7. 【CV】ICCV2015_Describing Videos by Exploiting Temporal Structure

    Describing Videos by Exploiting Temporal Structure Note here: it's a learning note on the topic of v ...

  8. 【ML】ICML2015_Unsupervised Learning of Video Representations using LSTMs

    Unsupervised Learning of Video Representations using LSTMs Note here: it's a learning notes on new L ...

  9. 【实战问题】【3】iPhone无法播放video标签中的视频

    问题:视频都是MP4格式,视频可以在手机上正常播放.video标签中的视频在安卓点击可以播放,但在iPhone无法播放 解决方案: 1,视频编码格式问题,具体iPhone手机支持的是哪些格式可见官方的 ...

随机推荐

  1. 4.5Python数据类型(5)之列表类型

    返回总目录 目录: 1.列表的定义 2.列表的常规操作 3.列表的额外操作 (一)列表的定义: 列表的定义 [var1, var2, --, var n ] # (1)列表的定义 [var1, var ...

  2. Linux 小知识翻译 - 「克隆」

    最近比较流行的Linux发行版,得是连新闻都报道的,刚刚发布新版的「CentOS」了. 「CentOS」一般被称为Red Hat EnterpriseLinux的克隆版本,这是什么意思呢? Linux ...

  3. File类_构造函数

    File类:用来将文件或者文件夹封装成对象方便对文件或或文件夹的属性信息进行操作File对象可以作为参数传递给流的构造函数 import java.io.File; public class File ...

  4. Python向上取整,向下取整以及四舍五入函数

    import math f = 11.2 print math.ceil(f) #向上取整 print math.floor(f) #向下取整 print round(f) #四舍五入 #这三个函数的 ...

  5. mysql如何修改开启允许远程连接 (windows)

    每天学习一点点 编程PDF电子书免费下载: http://www.shitanlife.com/code 关于mysql远程连接的问题,大家在公司工作中,经常会遇到mysql数据库存储于某个人的电脑上 ...

  6. kafka监控kafka-eagle 容器化配置

    由于kafka.zk 集群已经部署在k8s中,  kafka的服务名 kafka-hs, zk的服务名为:zk-cs ,对kafka进行监控,所以需要把监控部署到k8s中,选择使用kafka-eagl ...

  7. 对Promise的理解?

    ES6原生提供了promise对象 所谓Promise,就是一个对象,用来传递异步操作的消息.它代表了某个未来才会知道结果的事件(通过是一个异步操作),并且这个事件提供统一的API,可供进一步处理 P ...

  8. k8s部署spring-boot项目失败

    现象:spring-boot项目启动到某个地方停止,然后容器重启 解决:扩大内存和核心数

  9. 深入浅出的webpack4构建工具--webpack4+vue+route+vuex项目构建(十七)

    阅读目录 一:vue传值方式有哪些? 二:理解使用Vuex 三:webpack4+vue+route+vuex 项目架构 回到顶部 一:vue传值方式有哪些? 在vue项目开发过程中,经常会使用组件来 ...

  10. HTML5中的execCommand命令

    HTML5中的execCommand命令 在html5中,可以通过execCommand方法来运行一条命令,每一条命令都将对用户通过鼠标所选取的内容执行一些操作. 1. execCommand方法 浏 ...