[Paper Reading] Show and Tell: A Neural Image Caption Generator

论文链接：https://arxiv.org/pdf/1411.4555.pdf

代码链接：https://github.com/karpathy/neuraltalk & https://github.com/karpathy/neuraltalk2 & https://github.com/zsdonghao/Image-Captioning

主要贡献

在这篇文章中，作者借鉴了神经机器翻译（Neural Machine Translation）领域的方法，将“编码器-解码器（Encoder-Decoder）”模型引入了神经图像标注（Neural Image Captioning）领域，提出了一种端到端（end-to-end）的模型解决图像标注问题。下面展示了从论文中截取的两幅图片，第一幅图片是NIC模型的概述，第二幅图片描述了网络的细节。NIC网络采用卷积神经网络（CNN）作为编码器，长短期记忆网络（LSTM）作为解码器。

实验细节

在文章中，作者提出使用在图像分类任务（Image Classification Task）中预训练好的Inception v2作为编码器，将其最后一个隐藏层提取到的特征作为解码器隐藏层的初始状态。但是，在官方给出的源码neuraltalk中，作者使用了预训练好的VGG16作为了编码器，将Layer FC-4096提取到的特征作为了LSTM隐藏层的初始状态（详见neuraltalk/py_caffe_feat_extract.py line160）。在官方给出的源码neuraltalk2中，同样使用了VGG16作为编码器提取图像特征（详见neuraltalk2/train.lua line27）。在zsdonghao对该方法的TensorFlow实现中，使用了Inception v3作为编码器（详见zsdonghao/Image-Captioning/inception_v3(for TF 0.10).py）。

Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences.

An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence.

在文章中，作者提出使用随机梯度下降（Stochastic Gradient Descent）训练网络。在官方给出的源码neuraltalk2中，作者给出了多种训练网络的优化器及其参数（rmsprop，adagrad，sgd……详见neuraltalk2/misc/optim_updates.lua）。zsdonghao/Image-Captioning使用SGD训练网络，初始学习率2.0，学习率衰减因子0.5，学习率下降后每一代的数量8.0。

It is a neural net which is fully trainable using stochastic gradient descent.

在文章中，作者提出按最大似然训练模型参数。在zsdonghao/Image-Captioning中，作者使用了tensorlayer.cost.cross_entropy_seq_with_mask()（详见zsdonghao/Image-Captioning/buildmodel.py line665）。

The model is trained to maximize the likelihood of the target description sentence given the training image.

在neuraltalk2中，LSTM层的输入（Embedding层的输出）向量维度和LSTM隐藏层的向量维度均设置为512。zsdonghao/Image-Captioning的设置相同。
在zsdonghao/Image-Captioning中，作者将vocabulary_size设置为12000。

[Paper Reading] Show and Tell: A Neural Image Caption Generator的更多相关文章

Paper Reading - Show and Tell: A Neural Image Caption Generator ( CVPR 2015 )
Link of the Paper: https://arxiv.org/abs/1411.4555 Main Points: A generative model ( NIC, GoogLeNet ...
Paper Reading - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention ( ICML 2015 )
Link of the Paper: https://arxiv.org/pdf/1502.03044.pdf Main Points: Encoder-Decoder Framework: Enco ...
[Paper Reading] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
论文链接:https://arxiv.org/pdf/1502.03044.pdf 代码链接:https://github.com/kelvinxu/arctic-captions & htt ...
[Paper Reading] Image Captioning using Deep Neural Architectures (arXiv: 1801.05568v1)
Main Contributions: A brief introduction about two different methods (retrieval based method and gen ...
Paper Reading - Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
Link of the Paper: https://arxiv.org/abs/1609.06647 A Correlative Paper: Show and Tell: A Neural Ima ...
论文：Show and Tell: A Neural Image Caption Generator-阅读总结
Show and Tell: A Neural Image Caption Generator-阅读总结笔记不能简单的抄写文中的内容,得有自己的思考和理解. 一.基本信息标题作者作者单位发表 ...
Paper Reading: Stereo DSO
开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse ...
Paper Reading - Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation ( CVPR 2015 )
Link of the Paper: https://ieeexplore.ieee.org/document/7298856/ A Correlative Paper: Learning a Rec ...
Paper Reading - CNN+CNN: Convolutional Decoders for Image Captioning
Link of the Paper: https://arxiv.org/abs/1805.09019 Innovations: The authors propose a CNN + CNN fra ...

随机推荐

python_网络编程struct模块解决黏包问题
为什么会出现黏包现象: 首先只有在TCP协议中才会出现黏包现象,是因为TCP协议是面向流的协议,在发送的数据传输的过程中还有缓存机制来避免数据丢失,因此,在连续发送小数据的时候,以及接收大小不符的时候 ...
django-rest-framework --- 基础简介
目录 drf知识点 --- 后台 django restframework介绍 drf框架安装接口接口组成如何写接口接口组成的两大部分接口文档 Postman接口工具使用基于resrful ...
nodejs,express链式反应
链式反应--next() const myexpress = require('express'); const bodyparser = require('body-parser'); var se ...
[51Nod] 配对
https://www.51nod.com/onlineJudge/questionCode.html#!problemId=1737 求出树的重心,跑spfa #include <iostre ...
【csp模拟赛6】相遇--LCA
对于30%的数据:暴力枚举判断对于60%的数据:还是暴力枚举,把两条路径都走一遍计一下数就行,出现一个点被访问两次即可判定重合对于100%的数据:找出每条路径中距离根最近的点(lca),判断这个点 ...
Selenium全屏截图，使用PIL拼接滚动截图
Selenium默认的截图save_screenshot只支持对当前窗口内容进行截图,当如果你想要截取整个网页,那么,可以明确的告诉你. Selenium做不到. 你可以手工使用开发者工具Ctrl+S ...
十六、程序包管理之 rpm
c语言程序的构建过程 1.程序源代码 --> 预处理 --> 编译 --> 汇编 --> 链接--> 可执行程序开放源码:就是程序码,文本格式的源代码,写给人类看的程序 ...
Feeding Chicken
D - Feeding Chicken 从左上角开始,往右下角开始遍历,但是遍历的时候需要注意一点,就是遍历的时候需要连起来,就比如第一行从左往右进行遍历,但是第二行不能从左往右了,因为这样就分开了, ...
Python语法 - 推导式
推导式分为列表推导式(list),字典推导式(dict),集合推导式(set)三种列表推导式(list comprehension)最擅长的方式就是对整个列表分别做相同的操作,并且返回得到一个新的列 ...
websocket原理、为何能实现持久连接？
WebSocket 是 HTML5 一种新的协议.它实现了浏览器与服务器全双工通信,能更好的节省服务器资源和带宽并达到实时通讯,它建立在 TCP 之上,同 HTTP 一样通过 TCP 来传输数据,但是 ...

[Paper Reading] Show and Tell: A Neural Image Caption Generator

[Paper Reading] Show and Tell: A Neural Image Caption Generator的更多相关文章

随机推荐

热门专题