Link of the Paper: https://arxiv.org/abs/1412.2306

Main Points:

  1. An Alignment Model: Convolutional Neural Networks over image regions ( An image -> RCNN -> Top 19 detected locations in addition to the whole image -> the representations based on the pixels Ib inside each bounding box -> a set of h-dimensional vectors {vi | i = 1 ... 20} ), Bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding ( CNN - Structured Objective - BiRNN ).
  2. A Multimodal Recurrent Neural Network architecture: On the image side, Convolutional Neural Networks ( CNNs ) have recently emerged as a powerful class of models for image classification and object detection. On the sentence side, our work takes advantage of pretrained word vectors to obtain low-dimensional representations of words. Finally, Recurrent Neural Networks have been previously used in language modeling, but we additionally condition these models on images.
  3. Authors use bidirectional recurrent neural network to compute word representations in the sentence, dispensing of the need to compute dependency trees and allowing unbounded interactions of words and their context in the sentence.

Other Key Points:

  1. The primary challenge towards generating descriptions of images is in the design of a model that is rich enough to simultaneously reason about contents of images and their representation in the domain of natural language. Additionally, the model should be free of assumptions about specific hard-coded templates, rules or categories and instead rely on learning from the training data. The second, practical challenge is that datasets of image captions are available in large quantities on the internet, but these descriptions multiplex mentions of several entities whose locations in the images are unknown.

Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )的更多相关文章

  1. Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★

    Link of the Paper: https://arxiv.org/pdf/1412.6632.pdf Main Points: The authors propose a multimodal ...

  2. Paper Reading - Show and Tell: A Neural Image Caption Generator ( CVPR 2015 )

    Link of the Paper: https://arxiv.org/abs/1411.4555 Main Points: A generative model ( NIC, GoogLeNet ...

  3. Deep Visual-Semantic Alignments for Generating Image Descriptions(深度视觉-语义对应对于生成图像描述)

    https://cs.stanford.edu/people/karpathy/deepimagesent/ Abstract We present a model that generates na ...

  4. Paper Reading:Deep Neural Networks for YouTube Recommendations

    论文:Deep Neural Networks for YouTube Recommendations 发表时间:2016 发表作者:(Google)Paul Covington, Jay Adams ...

  5. Paper Reading:Deep Neural Networks for Object Detection

    发表时间:2013 发表作者:(Google)Szegedy C, Toshev A, Erhan D 发表刊物/会议:Advances in Neural Information Processin ...

  6. 论文笔记:Visual Semantic Navigation Using Scene Priors

    Visual Semantic Navigation Using Scene Priors 2018-10-21 19:39:26 Paper:  https://arxiv.org/pdf/1810 ...

  7. Paper Reading: Stereo DSO

    开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse ...

  8. 论文笔记:Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association

    Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language ...

  9. 论文:利用深度强化学习模型定位新物体(VISUAL SEMANTIC NAVIGATION USING SCENE PRIORS)

    这是一篇被ICLR 2019 接收的论文.论文讨论了如何利用场景先验知识 (scene priors)来定位一个新场景(novel scene)中未曾见过的物体(unseen objects).举例来 ...

随机推荐

  1. stm32 晶振不起振

    1. STM32f103有内部晶振.刚刚上电时,所有Clock都是源于内部晶振,所以当片内没有程序或内部程序没有使能外部晶振时,外部晶振是不会起振的.2. STM32f103有内部复位电路,只有当检测 ...

  2. linux--yum源,源码包

    一.企业版 搜狐:http://mirrors.sohu.com/ 网易:http://mirrors.163.com/ 阿里云:http://mirrors.aliyun.com/ 腾讯:http: ...

  3. python_frm组件

    一.URL添加 from django.contrib import admin from django.urls import path,re_path from app01 import view ...

  4. Ubuntu16.04安装TensorFlow

    1.查看tensoflow与CUDA对应版本: windows端:https://tensorflow.google.cn/install/source_windows Linux端:https:// ...

  5. mysql如何把一个表直接拷贝到一个新的表

    一:在新表已经建立好的情况下 1,拷贝所有的字段 insert into new_table select * from old_table 2,拷贝部分字段表 insert into new_tab ...

  6. makefile中的一些参数说明

    #obj = main.o sub.o add.o div.o mul.osrc = $(wildcard *.c) #搜索.c文件 可以加路径 obj = $(patsubst %.c, %.o, ...

  7. 安装Maven后使用cmd 执行 mvn -version命令 报错JAVA_HOME should point to a JDK not a JRE

    1. 可以执行maven指令,说明maven的配置没错 2. 打开cmd,在cmd输入: set JAVA_HOME=D:\Program Files\Java\jdk1.8.0_91 3. 再测试是 ...

  8. 学习tp5的第三天(模型)

    一.模型 1.定义基础模型 <?php namespace app\index\model; use think\Model; class User extends Model{ // 设置完整 ...

  9. linux-2.6.22.6内核启动分析之配置

    配置过程最终结果是生成.config文件,我们想要对配置的目的有很清楚的了解,必须先对.config文件进行分析.通过cd命令切换到linux-2.6.22.6内核目录,输入vi .config 可以 ...

  10. C语言堆排序

    堆是一种类似二叉树的数据结构,分为最大堆和最小堆,最大堆得定义是当前节点必须大于左右子节点,堆中所有节点都要符合这个定义.最小堆反之.这一点不同于二叉树排序.假设有数组int a[10] = {90, ...