Link of the Paper: https://arxiv.org/pdf/1412.6632.pdf

Main Points:

  1. The authors propose a multimodal Recurrent Neural Networks ( AlexNet/VGGNet + a multimodal layer + RNNs ). Their work has two major differences from these methods. Firstly, they incorporate a two-layer word embedding system in the m-RNN network structure which learns the word representation more efficiently than the single-layer word embedding. Secondly, they do not use the recurrent layer to store the visual information. The image representation is inputted to the m-RNN model along with every word in the sentence description.
  2. Most of the sentence-image multimodal models use pre-computed word embedding vectors as the initialization of their models. In contrast, the authors randomly initialize their word embedding layers and learn them from the training data.
  3. The m-RNN model is trained using a log-likelihood cost function. The errors can be backpropagated to the three parts ( the vision part, the language part, and the  ) of the m-RNN model to update the model parameters simultaneously.
  4. The hyperparameters, such as layer dimensions and the choice of the non-linear activation functions, are tuned via cross-validation on Flickr8K dataset and are then fixed across all the experiments.

Other Key Points:

  1. Applications for Image Captioning: early childhood education, image retrieval, and navigation for the blind.
  2. There are generally three categories of methods for generating novel sentence descriptions for images. The first category assumes a specific rule of the language grammer. They parse the sentence and divide it into several parts. This kind of method generates sentences that are syntactically correct. The second category retrieves similar captioned images, and generates new descriptions by generalizing and re-composing the retrieved captions. The third category of methods, which is more related to our method, learns a probability density over the space of multimodal inputs, using for example, Deep Boltzmann Machines, and topic models. They generate sentences with richer and more flexible structure than the first group. The probability of generating sentences using the model can serve as the affinity metric for retrieval.
  3. Many previous methods treat the task of describing images as a retrieval task and formulate the problem as a ranking or embedding learning problem. They first extract the word and sentence features ( e.g. Socher et al.(2014) uses dependency tree Recursive Neural Network to extract sentence features ) as well as the image features. Then they optimize a ranking cost to learn an embedding model that maps both the sentence feature and the image feature to a common semantic feature space ( the same semantic space ). In this way, they can directly calculate the distance between images and sentences. These methods genarate image captions by retrieving them from a sentence database. Thus, they lack the ability of generating novel sentences or describing images that contain novel combinations of objects and scenes.
  4. Benchmark datasets for Image Captioning: IAPR TC-12 ( Grubinger et al.(2006) ), Flickr8K ( Rashtchian et al.(2010) ), Flickr30K ( Young et al.(2014) ) and MS COCO ( Lin et al.(2014) ).
  5. Evaluation Metrics for Sentence Generation: Sentence perplexity and BLUE scores.
  6. Tasks related to Image Captioning: Generating Novel Sentences, Retrieving Images Given a Sentence, Retrieving Sentences Given an Image.
  7. The m-RNN model is trained using Baidu's internal deep learning platform PADDLE.

Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★的更多相关文章

  1. Paper Reading - Sequence to Sequence Learning with Neural Networks ( NIPS 2014 )

    Link of the Paper: https://arxiv.org/pdf/1409.3215.pdf Main Points: Encoder-Decoder Model: Input seq ...

  2. 递归神经网络(Recurrent Neural Networks,RNN)

    在深度学习领域,传统的多层感知机(MLP)具有出色的表现,取得了许多成功,它曾在许多不同的任务上——包括手写数字识别和目标分类上创造了记录.甚至到了今天,MLP在解决分类任务上始终都比其他方法要略胜一 ...

  3. Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )

    Link of the Paper: https://arxiv.org/abs/1412.2306 Main Points: An Alignment Model: Convolutional Ne ...

  4. Paper Reading - Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation ( CVPR 2015 )

    Link of the Paper: https://ieeexplore.ieee.org/document/7298856/ A Correlative Paper: Learning a Rec ...

  5. Hyperspectral Image Classification Using Similarity Measurements-Based Deep Recurrent Neural Networks

    用RNN来做像素分类,输入是一系列相近的像素,长度人为指定为l,相近是利用像素相似度或是范围相似度得到的,计算个欧氏距离或是SAM. 数据是两个高光谱数据 1.Pavia University,Ref ...

  6. Attention and Augmented Recurrent Neural Networks

    Attention and Augmented Recurrent Neural Networks CHRIS OLAHGoogle Brain SHAN CARTERGoogle Brain Sep ...

  7. The Unreasonable Effectiveness of Recurrent Neural Networks (RNN)

    http://karpathy.github.io/2015/05/21/rnn-effectiveness/ There’s something magical about Recurrent Ne ...

  8. 课程五(Sequence Models),第一 周(Recurrent Neural Networks) —— 1.Programming assignments:Building a recurrent neural network - step by step

    Building your Recurrent Neural Network - Step by Step Welcome to Course 5's first assignment! In thi ...

  9. (zhuan) Attention in Long Short-Term Memory Recurrent Neural Networks

    Attention in Long Short-Term Memory Recurrent Neural Networks by Jason Brownlee on June 30, 2017 in  ...

随机推荐

  1. docker 安装 MySQL 8.0

    1.查找镜像 查找Docker Hub上的mysql镜像 X230:~$ docker search mysql NAME DESCRIPTION STARS OFFICIAL AUTOMATED m ...

  2. com.alibaba.druid检测排查数据库连接数不释放定位代码

    1.可能标题说的很不明白,其实就是这样一个情况,一个工程项目错误日志出现GetConnectionTimeoutException: wait millis 90000, active 22000的异 ...

  3. Spring MVC中如何解决POST请求中文乱码问题,GET的又如何处理呢

    在web.xml中配置过滤器 GET请求乱码解决: 在Tomcat中service.xml中

  4. Mysql完全卸载(Windows版本)

    (1)控制面板 ---> 程序和功能 ---> 卸载MySQL Installer: (2)删除MySQL软件安装路径下的MySQL目录,默认目录为 C:\Program Files (x ...

  5. php向页面输出中文时出现乱码的解决方法

    今天,刚刚学习PHP发现用echo输出中文时,页面会出现乱码,然后查了一下资料说是浏览器编码格式有问题,要改成utf-8.但是每个人的浏览器编码可能会有所不同,所以找到了一个很好的解决方法, 就是在p ...

  6. ETL项目1:大数据采集,清洗,处理:使用MapReduce进行离线数据分析完整项目

    ETL项目1:大数据采集,清洗,处理:使用MapReduce进行离线数据分析完整项目 思路分析: 1.1 log日志生成 用curl模拟请求,nginx反向代理80端口来生成日志. #! /bin/b ...

  7. 大数据学习:Spark是什么,如何用Spark进行数据分析

    给大家分享一下Spark是什么?如何用Spark进行数据分析,对大数据感兴趣的小伙伴就随着小编一起来了解一下吧.     大数据在线学习 什么是Apache Spark? Apache Spark是一 ...

  8. python学习笔记:第9天 函数初步

    1. 函数的定义及调用 函数:所谓的函数可以看作是对一段代码的封装,也是对一个功能模块的封装,这样方便在下次想用这个功能的时候直接调用这个功能模块,而不用重新去写. 函数的定义:我们使用def关键字来 ...

  9. Laravel框架定时任务2种实现方式示例

    本文实例讲述了Laravel框架定时任务2种实现方式.分享给大家供大家参考,具体如下: 第一种 1.生成一个commands文件 > php artisan make:command test ...

  10. c语言实验报告(四) 从键盘输入字符串a和字符串b,并在a串中的最小元素(不含结束符)后面插入字符串b.

    a串中最小元素后的字符被舍弃了. #include<stdio.h>#include<string.h>void main(){  int i,min=0;  char a[2 ...