本文重点:

和一般形式的文本处理方式一样,并没有特别大的差异,文章的重点在于提出了一个相似度矩阵

计算过程介绍:

query和document中的首先通过word embedding处理后获得对应的表示矩阵

利用CNN网络进行处理获得各自的feature map,接着pooling后获得query对应的向量表示Xq和document的向量Xd

不同于传统的Siamense网络在这一步利用欧式距离或余弦距离直接对Xq和Xd进行相似性计算后预测结果,网络采用一个相似矩阵来计算Xq和Xd的相似度,然后将Xd,Xq和sim(Xq,Xd)进行连接,并添加了word overlap和IDF word overlap的特征后作为特征向量输入一个神经网络层  --计算句子相似度的方法~基于字重叠(Word Overlap)

神经网络层的输出经过一个全连接层,利用softmax函数得出预测结果、

questions and documents are limited to a single sentence.

The main building blocks of our architecture are two distributional sentence models based on convolutional neural networks. These underlying sentence models work in parallel,mapping queries and documents to their distributional vectors,which are then used to learn the semantic similarity between them.

our model encodes query-document pairs in a rich representation using not only their similarity score but also their intermediate representations; (iii) the architecture of our network makes it straightforward to include any additional similarity features to the mode。

However, their model operates only on unigram or bigrams, while our architecture learns to extract and compose n-grams of higher degrees, thus allowing for capturing longer range dependencies. Additionally, our architecture uses not only the intermediate representations of questions and answers to compute their similarity but also includes them in the final representation, which constitutes a much richer representation of the question-answer pairs.

pointwise:
it is enough to train a binary classifier
pairwise:
the model is explicitly trained to score correct pairs higher than
incorrect pairs with a certain margin
对比:
it requires to consider a larger number of training instances (potentially quadratic in the size of the candidate document set) than the pointwise method, which may lead to slower training times. Still both pointwise and pairwise approaches ignore the fact that ranking is a prediction task on a list of objects.

Most often, producing a better representation ψ() that encodes various aspects of similarity between the input querydocument pairs plays a far more important role in training an accurate
reranker than choosing between different ranking approaches.Hence, in this paper we adopt a simple pointwise method to reranking and focus on modelling a rich representation of query-document pairs using deep learning approaches which is described next.

Our network is composed of a single wide convolutional layer followed by a non-linearity and simple max pooling.--宽卷积网络

The range of allowed values for i defines two types of convolution: narrow and wide. The narrow type restricts i to be in the range [1, |s| − m + 1], which in turn restricts the filter width to be ≤ |s|. To compute the wide type of convolution i ranges from 1 to |s| and sets no restrictions on the size of m and s. The benefits of one type of convolution over the other when dealing with text are discussed in detail in [18]. In short, the wide convolution is able to better handle words at boundaries giving equal attention to all words in the sentence, unlike in narrow convolution, where words close to boundaries are seen fewer times.More importantly, wide convolution also guarantees to always yield valid values even when s is shorter than the filter size m。

It should be noted that an alternative way of computing a convolution was explored in[18],where a series of convolutions are computed between each row of a sentence matrix and a corresponding row of the filter matrix. Essentially, it is a vectorized form of 1d convolution applied between corresponding rows of S and F. As a result, the output feature map is a matrix C ∈ R。

Among the most common choices of activation functions are the following: sigmoid (or logistic), hyperbolic tangent tanh, and a rectified linear (ReLU) function defined as simply max(0, x) to ensure that feature maps are always positive.

Both average and max pooling methods exhibit certain disadvantages: in average pooling, all elements of the input are considered, which may weaken strong activation values. This is especially critical with tanh non-linearity, where strong positive and negative activations can cancel each other out. The max pooling is used more widely and does not suffer from the drawbacks of average pooling. However, as shown in [40], it can lead to strong overfitting on the training set and, hence, poor generalization on the test data.

Recently, max pooling has been generalized to k-max pooling [18], where instead of a single max value, k values are extracted in their original order. This allows for extracting several largest activation values from the input sentence.

Our architecture for matching text pairs:
Our sentence models based on ConvNets learn to map input sentences to vectors, which
can then be used to compute their similarity. These are then usedto compute a query-document similarity score, which together withthe query and document vectors are joined in a single representation.

query和document的相似度的度量:
In this model, we seek a transformation of the candidate document xd = Mxd that is the closest
to the input query xq. The similarity matrix M is a parameter of the network and is optimized during the training。

Adagrad scales the learning rate of SGD on each dimension based on the l2 norm of the history of the error gradient. Adadelta uses both the error gradient history like Adagrad and the weight update history. It has the advantage of not having to set a learning rate at all.

参数大小:
the width m of the convolution filters is set to 5 and the number of convolutional feature maps is 100. We use ReLU activation function and a simple max-pooling.
--
To train the network we use stochastic gradient descent with shuffled mini-batches. We eliminate the need to tune the learning rate by using the Adadelta update rule [39]. The batch size is set to 50 examples. The network is trained for 25 epochs with early stopping, i.e., we stop the training if no update to the best accuracy on the dev set has been made for the last 5 epochs. The accuracy computed on the dev set is the MAP score. At test time we use the parameters of the network that were obtained with the best MAP score on the development (dev) set, i.e., we compute the MAP score after each 10 mini-batch updates and save the network
parameters if a new best dev MAP score was obtained. In practice, the training converges after a few epochs. We set a value for L2 regularization term to 1e−5 for the parameters of convolutional layers and 1e − 4 for all the others. The dropout rate is set to p = 0.5.
--
we keep the word embeddings fixed and initialize the word matrix W from an unsupervised neural language model.
--
We choose the dimensionality of our word embeddings to be 50 to be on the line with the deep
learning model of [38].
--
Word embeddings. We initialize the word embeddings by running word2vec tool [20] on the English Wikipedia dump and the AQUAINT corpus4 containing roughly 375 million words.
To train the embeddings we use the skipgram model with window size 5 and filtering words with frequency less than 5. The resulting model contains 50-dimensional vectors for about 3.5 million words. Embeddings for words not present in the word2vec model are randomly
initialized with each component sampled from the uniform
distribution U[−0.25, 0.25]. We minimally preprocess the data only performing tokenization
and lowercasing all words. To reduce the size of the resulting vocabulary V , we also replace all digits with 0. The size of the word vocabulary V for experiments using TRAIN set is 17,023 with approximately 95% of words initialized using wor2vec embeddings and the remaining 5% words are initialized at random as described in Sec.
--
Additional features. Given that a certain percentage of the words in our word embedding matrix are initialized at random (about 15%for the TRAIN-ALL) and a relatively small number of QA pairs prevents the network to directly learn them from the training data, similarity matching performed by the network will be suboptimal between many question-answer pairs.

In particular, we compute word overlap measures between each question-answer pair and include it as an additional feature vector xfeat in our model. This feature vector contains only four features: word overlap and IDF-weighted word overlap computed between all words and only non-stop words. Computing these features is straightforward and does not require additional pre-processing or external resources。

评估:
MRR :MRR is only looking at the rank of the first correct answer,hence it is more suitable in cases where for each question there is only a single correct answer.
MAP :examines the ranks of all the correct answers. It is computed as the mean over the average。

Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks(paper)的更多相关文章

  1. ImageNet Classification with Deep Convolutional Neural Networks(译文)转载

    ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geo ...

  2. IMAGENT CLASSIFICATION WITH DEEP CONVOLUTIONAL NEURAL NETWORKS(翻译)

    0 - 摘要  我们训练了一个大型的.深度卷积神经网络用来将ImageNet LSVRC-2010竞赛中的120万高分辨率的图像分为1000个不同的类别.在测试集上,我们在top-1和top-5上的错 ...

  3. 吴恩达《深度学习》-课后测验-第一门课 (Neural Networks and Deep Learning)-Week 4 - Key concepts on Deep Neural Networks(第四周 测验 – 深层神经网络)

    Week 4 Quiz - Key concepts on Deep Neural Networks(第四周 测验 – 深层神经网络) \1. What is the "cache" ...

  4. (Deep) Neural Networks (Deep Learning) , NLP and Text Mining

    (Deep) Neural Networks (Deep Learning) , NLP and Text Mining 最近翻了一下关于Deep Learning 或者 普通的Neural Netw ...

  5. (转)Understanding, generalisation, and transfer learning in deep neural networks

    Understanding, generalisation, and transfer learning in deep neural networks FEBRUARY 27, 2017   Thi ...

  6. This instability is a fundamental problem for gradient-based learning in deep neural networks. vanishing exploding gradient problem

    The unstable gradient problem: The fundamental problem here isn't so much the vanishing gradient pro ...

  7. 【论文阅读】Clustering Convolutional Kernels to Compress Deep Neural Networks

    文章:Clustering Convolutional Kernels to Compress Deep Neural Networks 链接:http://openaccess.thecvf.com ...

  8. [C1W4] Neural Networks and Deep Learning - Deep Neural Networks

    第四周:深层神经网络(Deep Neural Networks) 深层神经网络(Deep L-layer neural network) 目前为止我们学习了只有一个单独隐藏层的神经网络的正向传播和反向 ...

  9. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Initialization)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Initialization Welcome to the first assignment of "Improving D ...

随机推荐

  1. [转]perftools查看堆外内存并解决hbase内存溢出

    最近线上运行的hbase发现分配了16g内存,但是实际使用了22g,堆外内存达到6g.感觉非常诡异.堆外内存用一般的工具很难查看,可以通过google-perftools来跟踪: http://cod ...

  2. 一篇文章有若干行,以空行作为输入结束的条件。统计一篇文章中单词the(不管大小写,单词the是由空格隔开的)的个数。

    #include <iostream>using namespace std; int k = 0;int n = 0;int main() { char c; char a[1000]; ...

  3. CAD绘制室外平台步骤5.3

    1.在平面上用直线划出台阶范围. “工具”“曲线工具”“线变复线”选择这几条线,它们就变成了一条线. “三维建模”“造型对象”“平板”选择这条封闭的线,回车,选择相邻门窗柱子等,回车输入平台厚度如“- ...

  4. js里面判断一个字符串是否包含某个子串的方法

    1. ES6的includes, 返回 Boolean var string = "foo", substring = "oo"; string.include ...

  5. netty]--最通用TCP黏包解决方案

    netty]--最通用TCP黏包解决方案:LengthFieldBasedFrameDecoder和LengthFieldPrepender 2017年02月19日 15:02:11 惜暮 阅读数:1 ...

  6. day16-python常用的内置模块2

    logging模块的使用 一:日志是我们排查问题的关键利器,写好日志记录,当我们发生问题时,可以快速定位代码范围进行修改.Python有给我们开发者们提供好的日志模块,下面我们就来介绍一下loggin ...

  7. 搭建Hadoop2.7.1的分布式集群

    Hadoop 2.7.1 (2015-7-6更新),hadoop的环境配置不是特别的复杂,但是确实有很多细节需要注意,不然会造成许多配置错误的情况.尽量保证一次配置正确防止反复修改. 网上教程有很多关 ...

  8. 多态概念,C++

    body, table{font-family: 微软雅黑; font-size: 10pt} table{border-collapse: collapse; border: solid gray; ...

  9. Spring Data JPA中的动态查询 时间日期

    功能:Spring Data JPA中的动态查询 实现日期查询 页面对应的dto类private String modifiedDate; //实体类 @LastModifiedDate protec ...

  10. DevExpress ASP.NET v18.2新功能详解(一)

    行业领先的.NET界面控件2018年第二次重大更新——DevExpress v18.2日前正式发布,本站将以连载的形式为大家介绍新版本新功能.本文将介绍了DevExpress ASP.NET Cont ...