The key mechanism of transformer-based models is cross-attentions, which implicitly form graphs over tokens and act as diffusion operators to facilitate information propagation through the graph for question-answering that requires some reasoning over the scene.

基于transformer的模型的关键机制是交叉关注,交叉关注在tokens上隐式地形成图,并充当扩散操作符,以促进信息通过图传播,用于需要对场景进行一些推理的问答。

We reinterpret and reformulate the transformer-based model to explicitly construct latent graphs over tokens and thereby support improved performance for answering visual questions about relations between objects.

我们重新解释和表述基于transformer的模型,以显式地在tokens上构造潜在图,从而支持改进性能,以回答关于对象之间关系的可视化问题。

Coincidentally, transformer-based language encoders can not only take advantage of the tokenization trend but also are intrinsically built for information fusion and alignments due to its core self-attention mechanism.

巧合的是,基于transformer的语言编码器不仅可以利用标记化趋势,而且由于其核心的自我注意机制,其本质上是为信息融合和对齐而构建的。

基于transformer的VQA系统的这种成功表明了两个见解的有效性:图像标记化,以及文本标记和图像标记之间的成对标记交互。

我们观察到成对的tokens交互共同形成了一个图,并且遍历这个图形成了一种推理,这可能是对这些基于transformer的模型的推理能力声明的解释

we reinterpret transformer-based VQA systems as graph convolutions,

We show that our model benefits from its latent graph representations

To the best of our knowledge, current transformer-based models cannot benefit from graph information, and there have not been work on taking advantage of scene graphs or graph representations in general for VQA.

In our model, the goal is to learn to generate a latent graph representation and then perform node classification on the resulting heterogeneous graph.

A typical task for a GCN is node classification, as GCN is capable of learning node representations from a given static homogeneous graph.

Graph Transformer Networks (GTN)  are a model for handling heterogeneous graphs, graphs with various types of edges, as well as generating new graphs.

如何利用场景图scene graph和图表示,并利用transformer机制的图卷积,提供VQA。

Learning Latent Graph Representations for Relational VQA的更多相关文章

  1. 论文解读(GMT)《Accurate Learning of Graph Representations with Graph Multiset Pooling》

    论文信息 论文标题:Accurate Learning of Graph Representations with Graph Multiset Pooling论文作者:Jinheon Baek, M ...

  2. 论文解读(GraRep)《GraRep: Learning Graph Representations with Global Structural Information》

    论文题目:<GraRep: Learning Graph Representations with Global Structural Information>发表时间:  CIKM论文作 ...

  3. 论文解读(LG2AR)《Learning Graph Augmentations to Learn Graph Representations》

    论文信息 论文标题:Learning Graph Augmentations to Learn Graph Representations论文作者:Kaveh Hassani, Amir Hosein ...

  4. Learning Conditioned Graph Structures for Interpretable Visual Question Answering

    Learning Conditioned Graph Structures for Interpretable Visual Question Answering 2019-05-29 00:29:4 ...

  5. 论文解读(DeepWalk)《DeepWalk: Online Learning of Social Representations》

    一.基本信息 论文题目:<DeepWalk: Online Learning of Social Representations>发表时间:  KDD 2014论文作者:  Bryan P ...

  6. 论文解读( N2N)《Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximization》

    论文信息 论文标题:Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximiz ...

  7. 【ML】ICML2015_Unsupervised Learning of Video Representations using LSTMs

    Unsupervised Learning of Video Representations using LSTMs Note here: it's a learning notes on new L ...

  8. 【CV】ICCV2015_Unsupervised Learning of Visual Representations using Videos

    Unsupervised Learning of Visual Representations using Videos Note here: it's a learning note on Prof ...

  9. 论文笔记之:Learning Cross-Modal Deep Representations for Robust Pedestrian Detection

    Learning Cross-Modal Deep Representations for Robust Pedestrian Detection 2017-04-11  19:40:22  Moti ...

随机推荐

  1. 使用aspnetcore前后端分离开发,你一定要知道这个

    前言 用过Vue单页面应用开发的,一定都知道Vue-router这个路由组件,它支持hash和history两种模式. HTML5 History 模式 vue-router 默认 hash 模式 - ...

  2. jdk1.8中hashmap的扩容resize

    当hashmap第一次插入元素.元素个数达到容量阀值threshold时,都会扩容resize(),源码: (假设hashmap扩容前的node数组为旧横向node数组,扩容后的node数组为新横向n ...

  3. Go单体服务开发最佳实践

    单体最佳实践的由来 对于很多初创公司来说,业务的早期我们更应该关注于业务价值的交付,并且此时用户体量也很小,QPS 也非常低,我们应该使用更简单的技术架构来加速业务价值的交付,此时单体的优势就体现出来 ...

  4. 干货 | Nginx 配置文件详解

    一个执着于技术的公众号 前言 在前面章节中,我们介绍了nginx是什么.如何编译安装nginx及如何彻底卸载nginx软件. 干货|给小白的 Nginx 10分钟入门指南 Nginx编译安装及常用命令 ...

  5. MyCat 使用中问题记录

    MyCat问题记录: Unknown charsetIndex:255 异常消息: jvm 1 | 2022-04-27 14:09:13,337 [WARN ][$_NIOREACTOR-13-RW ...

  6. numpy学习Ⅱ

    今天有空再把numpy看一下,补充点不会的,再去看matplotlib 回顾之前笔记,发现之前的numpy学习Ⅰ中关于numpy的行.列.维可能表述有点不清晰,这里再叙述一下 import numpy ...

  7. 透过实例demo带你认识gRPC

    摘要:gRPC是基于定义一个服务,指定一个可以远程调用的带有参数和返回类型的的方法.在服务端,服务实现这个接口并且运行gRPC服务处理客户端调用. 本文分享自华为云社区<gRPC介绍以及spri ...

  8. 渗透:EWSA

    EWSA全称Elcomsoft Wireless Security Auditor.ElcomSoft是一家俄罗斯软件公司,出品过不少密码破解软件,涉及Office.SQL.PDF.EFS等等. EW ...

  9. 890. Find and Replace Pattern - LeetCode

    Question 890. Find and Replace Pattern Solution 题目大意:从字符串数组中找到类型匹配的如xyy,xxx 思路: 举例:words = ["ab ...

  10. 声学感知刻度(mel scale、Bark scale、ERB)与声学特征提取(MFCC、BFCC、GFCC)

    梅尔刻度 梅尔刻度(Mel scale)是一种由听众判断不同频率 音高(pitch)彼此相等的感知刻度,表示人耳对等距音高(pitch)变化的感知.mel 刻度和正常频率(Hz)之间的参考点是将1 k ...