Hierarchical Question-Image Co-Attention for Visual Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering

NIPS 2016

Paper: https://arxiv.org/pdf/1606.00061.pdf

Code: https://github.com/jiasenlu/HieCoAttenVQA

Introduction：

　　本文提出了一种新的联合图像和文本特征的协同显著性的概念，使得两个不同模态的特征可以相互引导。

　　此外，作者也对输入的文本信息，从多个角度进行加权处理，构建多个不同层次的 image-question co-attention maps，即：word-level，phrase-level and question-level。

　　最后，在 phrase level，我们提出一种新颖的卷积-池化策略（convolution-pooling strategy）来自适应的选择 the phase size。

Methods：

1. Notation：

　　问题 Q = {q₁, ... , q_T}，其中 q_t 是第 t 个单词的特征向量。我们用 q_t^w, q_t^p, q_t^s 分别表示在位置 t 处的 Word embedding，phrase embedding 以及 question embedding。

　　图像特征表示为 V = {v₁, ... ,v_N}，其中，v_n 是空间位置 n 处的特征向量。

　　图像和问题的 co-attention features 在每一个层次，都可以表示为：v^, q^。

　　不同模块和层的权重可以表示为 W。

2. Question Hierarchy：

　　给定 the 1-hot encoding of the question words Q, 我们首先将单词映射到单词空间，以得到：Q^w. 为了计算词汇的特征，我们采用在单词映射向量上采用 1-D 卷积。具体来说，在每一个单词位置，我们计算 the Word vectors with filters of three window sizes 的内积：unigram, bigram and trigram. 对于第 t 个单词，在窗口大小为 s 时的卷积输出为：

　　其中，W_c^s 是权重参数。单词级别的向量 Q^w是 approximately 0-padding before feeding into bigram and trigram convolutions to maintain the length of the sequence after convolution. 给定卷积的结果，我们然后在每一个单词位置，跨越不同的 n-grams 采用 max-pooling 以得到 phrase-level features：

　　我们的 pooling method 不同于前人的方法，可以自适应的选择 different gram features at each time step, 并且可以保持原始序列的长度和序列。我们利用 LSTM 来编码 max-pooling 之后的 sequence 。对应的 question-level feature 是第 t 个时间步骤的 LSTM hidden vector。

3. Co-Attention：

　　我们提出两种协同显著的机制（two co-attention mechanism），第一种是 parallel co-attention，同时产生 image 和 question attention。第二种是 alternating co-attention，顺序的产生 image 和 question attentions。如图2所示，这些 co-attention mechanisms 可以在所有问题等级上执行。

　　【Parallel Co-Attention】 这种 attention 机制尝试同时对 image 和 question 进行 attend。我们通过计算图像和问题特征在所有的 image-locations and question-locations 进行相似度的计算。具体来说，给定一个图像特征图 V，以及问题的表达 Q，放射矩阵（the affinity matrix）C 可以计算如下：

　　其中，W_b 包括了权重。在计算得到 affinity matrix 之后，计算 image attention 的一种可能的方法是：simply maximize out the affinity over the locations of other modality, i.e.

　　并非选择 the max activation，我们发现如果我们将这个 affinity matrix 看做是一个 feature，然后学习去预测 image 和 question attention maps 可以提升最终的结果：

　　其中 Wv 和 Wq，w_hv，w_hq 是权重参数。a^v 和 a^q 是每一个图像区域 v_n 和单词 q_t 的 attention probability。放射矩阵 C 将 question attention space 转换为 image attention space. 基于上述 attention weights，图像和问题 attention vectors 可以看做是 image feature 和 question feature 的加权求和：

　　【Alternating Co-Attention】分步的协同 attention ，简单来讲，包括三个步骤：

　　1）summarize the question into a single vecror q;

　　2）attend to the image based on the question summary q ;

　　3）attend to the question based on the attended image feature.

　　我们定义 attention operation x^ = A(X; g)，将图像特征 X 以及从问题得到的 attention guidance g 作为输入，然后输出 the attended image vector。这些操作可以表达为：

　　其中，空心符号1 是元素全为 1 的向量。

4. Encoding for Predicting Answers :

　　我们将 VQA 看做是一个 classification task，我们从所有的三个层次的 attended image and question features 来预测答案。我们用 MLP 来迭代的编码 the attention features：

Experiments：

Hierarchical Question-Image Co-Attention for Visual Question Answering的更多相关文章

论文阅读：Learning Visual Question Answering by Bootstrapping Hard Attention
Learning Visual Question Answering by Bootstrapping Hard Attention Google DeepMind ECCV-2018 2018 ...
论文：Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering-阅读总结
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering-阅读总结笔记不能简单的抄写文中 ...
Visual Question Answering with Memory-Augmented Networks
Visual Question Answering with Memory-Augmented Networks 2018-05-15 20:15:03 Motivation: 虽然 VQA 已经取得 ...
Learning Conditioned Graph Structures for Interpretable Visual Question Answering
Learning Conditioned Graph Structures for Interpretable Visual Question Answering 2019-05-29 00:29:4 ...
【自然语言处理】--视觉问答（Visual Question Answering，VQA）从初始到应用
一.前述视觉问答(Visual Question Answering,VQA),是一种涉及计算机视觉和自然语言处理的学习任务.这一任务的定义如下: A VQA system takes as inp ...
论文笔记：Visual Question Answering as a Meta Learning Task
Visual Question Answering as a Meta Learning Task ECCV 2018 2018-09-13 19:58:08 Paper: http://openac ...
A Regularized Competition Model for Question Diffi culty Estimation in Community Question Answering Services-20160520
1.Information publication:EMNLP 2014 author:Jing Liu(在前一篇sigir基础上,拓展模型的论文) 2.What 衡量CQA中问题的困难程度,提出从两 ...
(zhuan) Recurrent Neural Network
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http:/ ...
香侬科技独家对话Facebook人工智能研究院首席科学家Devi Parikh
Facebook 人工智能研究院(FAIR)首席科学家 Devi Parikh 是 2017 年 IJCAI 计算机和思想奖获得者(IJCAI 两个最重要的奖项之一,被誉为国际人工智能领域的「菲尔兹奖 ...

随机推荐

Hibernate，关系映射的多对一单向关联、多对一双向关联、一对一主键关联、一对一外键关联、多对多关系关联
2018-11-10 22:27:02开始写下图内容ORM.Hibernate介绍.hibername.cfg.xml结构: 下图内容hibernate映射文件结构介绍下图内容hibernate ...
Python 构造一个可接受任意数量参数的函数
为了能让一个函数接受任意数量的位置参数,可以使用一个* 参数在这个例子中,rest 是由所有其他位置参数组成的元组.然后我们在代码中把它当成了一个序列来进行后续的计算
通过Hive将数据写入到ElasticSearch
我在<使用Hive读取ElasticSearch中的数据>文章中介绍了如何使用Hive读取ElasticSearch中的数据,本文将接着上文继续介绍如何使用Hive将数据写入到Elasti ...
JVM探秘4---垃圾收集器介绍
Java虚拟机有很多垃圾收集器下面先来了解HotSpot虚拟机中的7种垃圾收集器:Serial.ParNew.Parallel Scavenge.Serial Old.Parallel Old.CM ...
tomcat1章1
package ex01.pyrmont; import java.net.Socket; import java.net.ServerSocket; import java.net.InetAddr ...
jdbc连接oracle数据库问题
下面是JDBC连接oracle数据库流程: String dbURL = "jdbc:oracle:thin:@url:1521:service_name"; String use ...
makefile 变量展开
Makefile中给变量赋值: = 是递归展开式变量 value1 = 5 value2 = $(value1) value1 = 6 最终$(value2)就变成了6 := 是直接展开 ...
双屏互动h5
情侣H5:https://www.25xt.com/allcode/10837.html 双屏互动:https://www.digitaling.com/articles/18180.html
Docker学习笔记之docker-save vs docker-export vs docker-commit
之前对这几个command是忘了记,记了混-所以写下笔记以巩固之. 1.docker save docker save -h Usage: docker save [OPTIONS] IMAGE [I ...
NATS—发布/订阅机制
概念发布/订阅(Publish/subscribe 或pub/sub)是一种消息范式,消息的发送者(发布者)不是计划发送其消息给特定的接收者(订阅者).而是发布的消息分为不同的类别,而不需要知道什么 ...

Hierarchical Question-Image Co-Attention for Visual Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering的更多相关文章

随机推荐

热门专题