Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）

本文重点：

和一般形式的文本处理方式一样，并没有特别大的差异，文章的重点在于提出了一个相似度矩阵

计算过程介绍：

query和document中的首先通过word embedding处理后获得对应的表示矩阵

利用CNN网络进行处理获得各自的feature map，接着pooling后获得query对应的向量表示Xq和document的向量Xd

不同于传统的Siamense网络在这一步利用欧式距离或余弦距离直接对Xq和Xd进行相似性计算后预测结果，网络采用一个相似矩阵来计算Xq和Xd的相似度，然后将Xd，Xq和sim(Xq,Xd)进行连接，并添加了word overlap和IDF word overlap的特征后作为特征向量输入一个神经网络层 --计算句子相似度的方法~基于字重叠(Word Overlap)

神经网络层的输出经过一个全连接层，利用softmax函数得出预测结果、

questions and documents are limited to a single sentence.

The main building blocks of our architecture are two distributional sentence models based on convolutional neural networks. These underlying sentence models work in parallel,mapping queries and documents to their distributional vectors,which are then used to learn the semantic similarity between them.

our model encodes query-document pairs in a rich representation using not only their similarity score but also their intermediate representations; (iii) the architecture of our network makes it straightforward to include any additional similarity features to the mode。

However, their model operates only on unigram or bigrams, while our architecture learns to extract and compose n-grams of higher degrees, thus allowing for capturing longer range dependencies. Additionally, our architecture uses not only the intermediate representations of questions and answers to compute their similarity but also includes them in the final representation, which constitutes a much richer representation of the question-answer pairs.

pointwise：
it is enough to train a binary classifier
pairwise：
the model is explicitly trained to score correct pairs higher than
incorrect pairs with a certain margin
对比：
it requires to consider a larger number of training instances (potentially quadratic in the size of the candidate document set) than the pointwise method, which may lead to slower training times. Still both pointwise and pairwise approaches ignore the fact that ranking is a prediction task on a list of objects.

Most often, producing a better representation ψ() that encodes various aspects of similarity between the input querydocument pairs plays a far more important role in training an accurate
reranker than choosing between different ranking approaches.Hence, in this paper we adopt a simple pointwise method to reranking and focus on modelling a rich representation of query-document pairs using deep learning approaches which is described next.

Our network is composed of a single wide convolutional layer followed by a non-linearity and simple max pooling.--宽卷积网络

The range of allowed values for i defines two types of convolution: narrow and wide. The narrow type restricts i to be in the range [1, |s| − m + 1], which in turn restricts the filter width to be ≤ |s|. To compute the wide type of convolution i ranges from 1 to |s| and sets no restrictions on the size of m and s. The benefits of one type of convolution over the other when dealing with text are discussed in detail in [18]. In short, the wide convolution is able to better handle words at boundaries giving equal attention to all words in the sentence, unlike in narrow convolution, where words close to boundaries are seen fewer times.More importantly, wide convolution also guarantees to always yield valid values even when s is shorter than the filter size m。

It should be noted that an alternative way of computing a convolution was explored in[18],where a series of convolutions are computed between each row of a sentence matrix and a corresponding row of the filter matrix. Essentially, it is a vectorized form of 1d convolution applied between corresponding rows of S and F. As a result, the output feature map is a matrix C ∈ R。

Among the most common choices of activation functions are the following: sigmoid (or logistic), hyperbolic tangent tanh, and a rectified linear (ReLU) function defined as simply max(0, x) to ensure that feature maps are always positive.

Both average and max pooling methods exhibit certain disadvantages: in average pooling, all elements of the input are considered, which may weaken strong activation values. This is especially critical with tanh non-linearity, where strong positive and negative activations can cancel each other out. The max pooling is used more widely and does not suffer from the drawbacks of average pooling. However, as shown in [40], it can lead to strong overfitting on the training set and, hence, poor generalization on the test data.

Recently, max pooling has been generalized to k-max pooling [18], where instead of a single max value, k values are extracted in their original order. This allows for extracting several largest activation values from the input sentence.

Our architecture for matching text pairs：
Our sentence models based on ConvNets learn to map input sentences to vectors, which
can then be used to compute their similarity. These are then usedto compute a query-document similarity score, which together withthe query and document vectors are joined in a single representation.

query和document的相似度的度量：
In this model, we seek a transformation of the candidate document xd = Mxd that is the closest
to the input query xq. The similarity matrix M is a parameter of the network and is optimized during the training。

Adagrad scales the learning rate of SGD on each dimension based on the l2 norm of the history of the error gradient. Adadelta uses both the error gradient history like Adagrad and the weight update history. It has the advantage of not having to set a learning rate at all.

参数大小：
the width m of the convolution filters is set to 5 and the number of convolutional feature maps is 100. We use ReLU activation function and a simple max-pooling.
--
To train the network we use stochastic gradient descent with shuffled mini-batches. We eliminate the need to tune the learning rate by using the Adadelta update rule [39]. The batch size is set to 50 examples. The network is trained for 25 epochs with early stopping, i.e., we stop the training if no update to the best accuracy on the dev set has been made for the last 5 epochs. The accuracy computed on the dev set is the MAP score. At test time we use the parameters of the network that were obtained with the best MAP score on the development (dev) set, i.e., we compute the MAP score after each 10 mini-batch updates and save the network
parameters if a new best dev MAP score was obtained. In practice, the training converges after a few epochs. We set a value for L2 regularization term to 1e−5 for the parameters of convolutional layers and 1e − 4 for all the others. The dropout rate is set to p = 0.5.
--
we keep the word embeddings fixed and initialize the word matrix W from an unsupervised neural language model.
--
We choose the dimensionality of our word embeddings to be 50 to be on the line with the deep
learning model of [38].
--
Word embeddings. We initialize the word embeddings by running word2vec tool [20] on the English Wikipedia dump and the AQUAINT corpus4 containing roughly 375 million words.
To train the embeddings we use the skipgram model with window size 5 and filtering words with frequency less than 5. The resulting model contains 50-dimensional vectors for about 3.5 million words. Embeddings for words not present in the word2vec model are randomly
initialized with each component sampled from the uniform
distribution U[−0.25, 0.25]. We minimally preprocess the data only performing tokenization
and lowercasing all words. To reduce the size of the resulting vocabulary V , we also replace all digits with 0. The size of the word vocabulary V for experiments using TRAIN set is 17,023 with approximately 95% of words initialized using wor2vec embeddings and the remaining 5% words are initialized at random as described in Sec.
--
Additional features. Given that a certain percentage of the words in our word embedding matrix are initialized at random (about 15%for the TRAIN-ALL) and a relatively small number of QA pairs prevents the network to directly learn them from the training data, similarity matching performed by the network will be suboptimal between many question-answer pairs.

In particular, we compute word overlap measures between each question-answer pair and include it as an additional feature vector xfeat in our model. This feature vector contains only four features: word overlap and IDF-weighted word overlap computed between all words and only non-stop words. Computing these features is straightforward and does not require additional pre-processing or external resources。

评估：
MRR ：MRR is only looking at the rank of the first correct answer,hence it is more suitable in cases where for each question there is only a single correct answer.
MAP ：examines the ranks of all the correct answers. It is computed as the mean over the average。

Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）的更多相关文章

ImageNet Classification with Deep Convolutional Neural Networks（译文）转载
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geo ...
IMAGENT CLASSIFICATION WITH DEEP CONVOLUTIONAL NEURAL NETWORKS（翻译）
0 - 摘要我们训练了一个大型的.深度卷积神经网络用来将ImageNet LSVRC-2010竞赛中的120万高分辨率的图像分为1000个不同的类别.在测试集上,我们在top-1和top-5上的错 ...
吴恩达《深度学习》-课后测验-第一门课 (Neural Networks and Deep Learning)-Week 4 - Key concepts on Deep Neural Networks（第四周测验 – 深层神经网络）
Week 4 Quiz - Key concepts on Deep Neural Networks(第四周测验 – 深层神经网络) \1. What is the "cache" ...
(Deep) Neural Networks (Deep Learning) , NLP and Text Mining
(Deep) Neural Networks (Deep Learning) , NLP and Text Mining 最近翻了一下关于Deep Learning 或者普通的Neural Netw ...
（转）Understanding, generalisation, and transfer learning in deep neural networks
Understanding, generalisation, and transfer learning in deep neural networks FEBRUARY 27, 2017 Thi ...
This instability is a fundamental problem for gradient-based learning in deep neural networks. vanishing exploding gradient problem
The unstable gradient problem: The fundamental problem here isn't so much the vanishing gradient pro ...
【论文阅读】Clustering Convolutional Kernels to Compress Deep Neural Networks
文章:Clustering Convolutional Kernels to Compress Deep Neural Networks 链接:http://openaccess.thecvf.com ...
[C1W4] Neural Networks and Deep Learning - Deep Neural Networks
第四周:深层神经网络(Deep Neural Networks) 深层神经网络(Deep L-layer neural network) 目前为止我们学习了只有一个单独隐藏层的神经网络的正向传播和反向 ...
Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Initialization)
声明:所有内容来自coursera,作为个人学习笔记记录在这里. Initialization Welcome to the first assignment of "Improving D ...

随机推荐

返回JSON格式（二十五）
在上述例子中,通过@ControllerAdvice统一定义不同Exception映射到不同错误处理页面.而当我们要实现RESTful API时,返回的错误是JSON格式的数据,而不是HTML页面,这 ...
charles抓包工具使用方法
注意: 1.软件安装证书,移动端安装证书: 2.设置SSL proxying setting: 3.设置Map Remote: 这里主要是为了设置移动端代理服务器,让其重定向到想要连接的环境上,比如手 ...
yii2入门安装 Windows7+wamp+yii2
1.首先先具备环境,下载最新wamp(yii2需要php5.40以上版本的http://www.digpage.com/install.html) wamp下载http://pan.baidu.com ...
linux下查看物理CPU个数、核数、逻辑CPU个数
cat /proc/cpuinfo中的信息 processor 逻辑处理器的id.physical id 物理封装的处理器的id.core id 每个核心的id.cpu cores 位于相同物理封装的 ...
SQLite 剖析
由于sqlite对多进程操作支持效果不太理想,在项目中,为了避免频繁读写文件数据库带来的性能损耗,我们可以采用操作sqlite内存数据库,并将内存数据库定时同步到文件数据库中的方法. 实现思路如下: ...
ubuntu16.10安装搜狗输入法
一.搜狗输入法安装 1.首先到搜狗输入法官网下载搜狗输入法,下载的是个deb文件. 搜狗输入法Linux版下载地址:http://pinyin.sogou.com/linux/?r=pinyin 2. ...
在引用的laravel的@include子模板中传递参数
调用传参: @include("message",['msg'=>'中国']) 在message子模板中调用msg的值: {{msg}}
Win10系列：VC++文件选取
在C++/CX的Windows::Storage::Pickers命名空间中定义了一个FileOpenPicker类,使用此类可以新建一个文件打开选取器,并可以通过这个类里面包含的属性和函数选取一个或 ...
learning scala write to file
scala 写文件功能: scala> import java.io.PrintWriterimport java.io.PrintWriter scala> val outputFile ...
Xilinx SDK编译Microblaze时出错
reference:http://www.eeboard.com/evaluation/digilent-cmod-a7-fpga/9/ 在vivado 2015.4中创建microblaze软核,l ...

Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）

Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks（paper）的更多相关文章

随机推荐

热门专题