论文阅读: End-to-end Learning of Action Detection from Frame Glimpses in Videos

End-to-End Learning of Action Detection from Frame Glimpses in Videos

CVPR 2016

　　Motivation:

　本文主要是想借助空间的 attention model 来去协助进行行人识别的工作。作者认为 long, read-world videos 是一个非常具有挑战的视觉问题。算法必须推理出是否出现了某个 action, 并且还要在时间步骤上推出出现在什么时刻。大部分的工作都是通过构建 frame-level classifiers，通过多个时间尺度，并且采用后期处理的方法，如：duration priors and non-maximum supression。但是这种间接地动作定位的方法在时间复杂度和精确度上都有待提升。

　　本文提出了一种 end to end 的方法进行行为的识别，直接来推理 action的 temporal bounds。我们的关键直觉在于：行为识别的过程是一个 continuous, iterative observation and refinement. 我们可以序列的决定 where to look and how to refine our hypotheses 以得到准确的行为定位和尽可能少的搜索。

　　基于该想法，我们提出 single coherent model，将 long video 作为输入，输出检测到的 action instances 的 temporal bounds。我们的模型可以看做是 an agent 学习一个策略，来序列的决定和优化关于 action instances 的优化假设。本文算法也是基于 Recurrent Model of Visual Attention 这个文章来的，但是 action recognition 提出了一种新的挑战： how to handle a variable-sized set of structured detection outputs ?

　　为了解决这个问题，我们提出的模型可以同时决定 which frame to observe next 以及 when to emit a prediction. 在此基础上，提出了一种奖励机制来确保学习到该策略。本文是第一个提出 end to end approach 来学习视频中行为的检测。

　　Methods:

　　网络结构的设计主要包括两个部分： an observation network, and a recurrent network. 其中，观测网络主要是用来编码 video frame 的视觉表达，而 RNN 主要是用来处理这些观测以及决定下一步观测哪一帧，何时进行发射（即：进行 action 的预测）。

　　1. Observation Network:

　　像图2展示的那样，观测网络每一个时间步骤，观测到一个视频帧；将其编码为 a feature vector $o_n$，并且将其作为 RNN 的输入。

　　更重要的是，$o_n$ 编码了 where 和 what 的信息，即：where in the video an observation was taken 以及 what was seen。观测网络的输入是：the normalized temporal location of the observation, $l_n \in [0, 1]$，以及 对应的视频帧 $v_{l_n}$。

　　2. Recurrent Network:

　　RNN 网络 fh，构成了 agent 的核心。像图2所示的那样，每一个时刻 n 的输入是：观测特征向量 $o_n$。网络的 hidden state $h_n$，是 $o_n$ 和前一个时刻状态 $h_{n-1}$，关于 action instance，对 temporal hypotheses 进行了建模。

　　随着 agent 在视频中进行推理，在每一个时间步骤，有三个输出：candidate detection $d_n$, binary indicator $p_n$ 表明是否对 $d_n$ 进行发射，以得到一个预测结果，temporal location $l_{n+1}$ 表示下一个时刻需要观测的视频帧。

　　Candidate detection （候选检测）:

　　利用函数 $d_n = f_d(h_n; \theta_d)$ 得到一个候选检测 $d_n$，其中 fd 是全连接层。dn 是一个 tuple $(s_n, e_n, c_n) \in [0, 1]^3$，其中，sn 和 en 是归一化的开始和结束的检测位置，cn 是检测的候选置信度。这个 candidate detection 代表了当前action instance周围的 agent 假设。然而，并非每一个时刻都进行检测，因为这会导致大量的 noise 以及许多 false positive。相反，the agent 利用 separate prediction indicator output 来表示候选检测应该被发射，以得到当前的 prediction。

　　Prediction indicator （预测指示器）:

　　二进制的 prediction indicator $p_n$ 表示对应的候选检测 dn 应该被发射作为一个 prediction。$P_n = f_p(h_n; \theta_p)$，其中 fp 是全连接层，紧跟着 sigmoid nonlinearity. 在训练的时候，fp is used to parameterize a Bernoulli distribution from which $p_n$ is sampled; 在测试的时候，选择最大后验预测。

　　候选检测和预测标识符的组合对于检测问题来说，是非常重要的，因为 positive instance 可能随处可见，也可能根本不会出现。这样就确保了该网络可以

　　Location of next observation （下一个观测的位置）:

　　时间的位置 $l_{n+1} \in [0, 1]$ 表示了 agent 选择的下一个时刻要选择的 video frame。位置不受限，agent 可以随意的在 video 上进行 skip。

　　位置的计算依赖于函数 $l_{n+1} = f_l (h_n; \theta_l)$，其中 fl 是全连接层， agent 的决策是关于其过去观测的历史以及 temporal location 的函数。在训练的时候，$l_{n+1}$ 是从 Gaussian distribution 上采样出来的；在测试的时候，the maximum a posteriori estimate is used.

　　Training :

　　我们的最终目标是输出一组检测到的动作（output a set of detected actions）。为了达到这个目标，我们需要训练在每个时刻都有三个输出：candidate detection, prediction indicator, and next observation location. 给定长视频中，时序动作标注的监督，训练这些涉及到以下几个挑战：

　　1. suitable loss

　　2. reward function

　　3. handling non-differentiable model components.

　　我们这里采用 REINFORCE 的方法来训练 $p_n and L_{n+1}$ 以及监督学习的方法来训练 $d_n$。

　　1. Candidate dections:

　　候选检测是利用反向传播来训练，已得到最大化每个 candidate 的得分。我们希望最大化 correctness，而不管是否一个 candidate 被无限的发射，因为 candidate 编码了 agent's 关于 actions 的hypotheses。这需要在训练的时候，将每一个候选和 gt instance 进行匹配。我们利用一个观察：at each timestep, the agent should form a hyposis around the action instance (if any) nearest its current location in the video. 这使得我们可以设计一个简单有效的匹配算法。

　　Matching to ground truth.

　　给定一组候选检测 D，这些候选检测是由 RNN 走了 N 个时间步骤得到的，给定的 gt action instance g1, ... M, 每一个候选和一个 gt instance 匹配，否则，如果M=0，则为 none 。

　　我们定义 matching function 如下：

　　换句话说，

　　Loss function （损失函数）：

　　一旦 candidate detections 已经和 gt instance 进行了匹配，我们优化一个 multi-task classification and localization loss function over the set D:

　　此处，分类项 $L_{cls(d_n)}$ 是检测置信度 $c_n$ 的标准交叉熵损失函数，当匹配 dn 和一个 gt instance 匹配上的时候，就鼓励 the confidence 接近1，否则就是 0。

　　此处的 matching function 其实不是那么容易理解，要注意每个细节，不然太容易懵逼了。。。

　　该 loss function 是根据反向传播进行优化的。

　　2. Observation and emission sequences :

　　the observation location and prediction indicator output 是不可倒的，无法利用反向传播进行求解优化。

　　然后作者对 RL 算法进行了简介，指出目标函数不可导的原因在于：

　　　　this leads to a non-trivial optimization problem due to the high-dimensional space of possible action sequences.

　　REINFORCE 之所以可以解决这个问题，是因为：通过MC采样，来学习网络的参数，他对梯度等式进行了估计：

　　其中，一般都会减去一个 baseline，以降低方差：

　　REINFORCE learns model parameters according to this approximate gradient. The log-probability log πθ of actions leading to high future reward are increased, and those leading to low reward are decreased. Model parameters can then be updated using backpropagation.

　　Reward Function:

　　此处，训练 REINFORCE 需要设计合适的奖励函数。我们的目标是学习一个策略，对于 location and prediction indicator outputs，可以得到的 action detection 更高的 recall，以及更好的精度（high precision）。所以我们的奖励函数，就设计的是要去寻找这个策略，可以使得：最大化 true positive detections, 与此同时，最小化 false positive:

　　在第 Nth 时间步骤，提供所有的奖励，而对于 n < N, 则全为 0，因为我们想要学习的策略可以联合的得到 high overall detection performance. M 是 gt action instance 的个数，Np 是 agent 发射的预测的个数。N+ 是 true positive predictions 的个数；N- 是false positive predictions 的个数，R+ and R- 是这些预测带来的 positive and negative rewards。

　　A prediction is considered correct if its overlap with a ground truth is both greater than a threshold and higher than that of any other prediction.

论文阅读: End-to-end Learning of Action Detection from Frame Glimpses in Videos的更多相关文章

【论文阅读】Deep Mutual Learning
文章:Deep Mutual Learning 出自CVPR2017(18年最佳学生论文) 文章链接:https://arxiv.org/abs/1706.00384 代码链接:https://git ...
论文阅读 | Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition
源地址 arXiv:1712.07465: Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition ...
论文阅读 Dynamic Graph Representation Learning Via Self-Attention Networks
4 Dynamic Graph Representation Learning Via Self-Attention Networks link:https://arxiv.org/abs/1812. ...
[论文阅读] A Discriminative Feature Learning Approach for Deep Face Recognition (Center Loss)
原文: A Discriminative Feature Learning Approach for Deep Face Recognition 用于人脸识别的center loss. 1)同时学习每 ...
【CV论文阅读】Dynamic image networks for action recognition
论文的重点在于后面approximation部分. 在<Rank Pooling>的论文中提到,可以通过训练RankSVM获得参数向量d,来作为视频帧序列的representation.而 ...
论文阅读 | DeepDrawing: A Deep Learning Approach to Graph Drawing
作者:Yong Wang, Zhihua Jin, Qianwen Wang, Weiwei Cui, Tengfei Ma and Huamin Qu 本文发表于VIS2019, 来自于香港科技大学 ...
【CV论文阅读】YOLO：Unified, Real-Time Object Detection
YOLO的一大特点就是快,在处理上可以达到完全的实时.原因在于它整个检测方法非常的简洁,使用回归的方法,直接在原图上进行目标检测与定位. 多任务检测: 网络把目标检测与定位统一到一个深度网络中,而且可 ...
【CV论文阅读】Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
由RCNN到FAST RCNN一个很重要的进步是实现了多任务的训练,但是仍然使用Selective Search算法来获得ROI,而FASTER RCNN就是把获得ROI的步骤使用一个深度网络RPN来 ...
Deep Reinforcement Learning for Dialogue Generation 论文阅读
本文来自李纪为博士的论文 Deep Reinforcement Learning for Dialogue Generation. 1,概述当前在闲聊机器人中的主要技术框架都是seq2seq模型.但 ...

随机推荐

jQuery安装
http://www.runoob.com/jquery/jquery-install.html 网页中添加jQuery: 方法一:可以从http://jquery.com/download/ 下载j ...
QT 通过QNetworkReply *获取对应请求的URL地址
[1]QT 通过QNetworkReply *获取对应请求的URL地址 reply->url().toString(); Good Good Study, Day Day Up. 顺序选择循 ...
win7 64位 python启动报错：无法启动此程序，因为计算机中丢失api-ms-win-crt-process-l1-1-0.dll
安装python3.7,安装成功后,在cmd窗口输入python检查是否安装成功,报错:无法启动此程序,因为计算机中丢失api-ms-win-crt-process-l1-1-0.dll 在网上查询了 ...
Redis的key和value大小限制
Redis的key和value大小限制今天研究了下将java bean序列化到redis中存储起来,突然脑袋灵光一闪,对象大小会不会超过redis限制?不管怎么着,还是搞清楚一下比较好 ...
java 泛型E T ?的区别
Java泛型中的标记符含义: E - Element (在集合中使用,因为集合中存放的是元素) T - Type(Java 类) K - Key(键) V - Value(值) N - Number ...
开源数据流管道-Luigi vs Azkaban vs Oozie vs Airflow
原文链接:https://www.jianshu.com/p/4ae1faea733b 随着企业的发展,他们的工作流程变得更加复杂,越来越多的有着错综复杂依赖关系的工作流需要增加监控,故障排除.如果没 ...
jQuery懒加载插件jquery.lazyload.js使用说明实例
jQuery懒加载插件jquery.lazyload.js使用说明实例很多网站都会用到‘图片懒加载’这种方式对网站进行优化,即延迟加载图片或符合某些条件才开始加载图片.懒加载原理:浏览器会自动对页面中 ...
Qt介绍1---QPA(Qt Platform Abstraction)
Qt是一个夸平台的库(一直宣称“Qt everywhere”),但是Qt底层不是夸平台的.比如:Qt中Gui部件的核心类QWidget,该类除了qwidget.h 和 qwidget.cpp两个原文件 ...
Shell 比较两个数的大小
格式很重要多一个空格少一个空格都可能出错 li@ubuntu:~/test$ cat compare.sh #!/bin/bash read x read y if [ $x -lt $y ] the ...
hud3007 Buried memory
题目链接最小圆覆盖并不知道为什么是O(n)的,而且要随机化点的顺序 #include<algorithm> #include<iostream> #include<c ...

论文阅读: End-to-end Learning of Action Detection from Frame Glimpses in Videos

论文阅读: End-to-end Learning of Action Detection from Frame Glimpses in Videos的更多相关文章

随机推荐

热门专题