推荐系统中的注意力机制——阿里深度兴趣网络（DIN）

参考：

https://zhuanlan.zhihu.com/p/51623339

注意力机制顾名思义，就是模型在预测的时候，对用户不同行为的注意力是不一样的，“相关”的行为历史看重一些，“不相关”的历史甚至可以忽略。那么这样的思想反应到模型中也是直观的。

如果按照之前的做法，我们会一碗水端平的考虑所有行为记录的影响，对应到模型中就是我们会用一个average pooling层把用户交互过的所有商品的embedding vector平均一下形成这个用户的user vector，机灵一点的工程师最多加一个time decay，让最近的行为产生的影响大一些，那就是在做average pooling的时候按时间调整一下权重。

上式中， $V_u$ 是用户的embedding向量， $V_a$ 是候选广告商品的embedding向量， $V_i$ 是用户u的第i次行为的embedding向量，因为这里用户的行为就是浏览商品或店铺，所以行为的embedding的向量就是那次浏览的商品或店铺的embedding向量。

因为加入了注意力机制， $V_u$ 从过去 $V_i$ 的加和变成了 $V_i$ 的加权和， $V_i$ 的权重 $w_i$ 就由 $V_i$ 与 $V_a$ 的关系决定，也就是上式中的 $g(V_i,V_a)$ ，不负责任的说，这个 $g(V_i,V_a)$ 的加入就是本文70%的价值所在。

那么 $g(V_i,V_a)$ 这个函数到底采用什么比较好呢？看完下面的架构图自然就清楚了。

相比原来这个标准的深度推荐网络（Base model），DIN在生成用户embedding vector的时候加入了一个activation unit层，这一层产生了每个用户行为 $V_i$ 的权重，下面我们仔细看一下这个权重是怎么生成的，也就是 $g(V_i,V_a)$ 是如何定义的。

传统的Attention机制中，给定两个item embedding，比如u和v，通常是直接做点积uv或者uWv，其中W是一个|u|x|v|的权重矩阵，但这篇paper中阿里显然做了更进一步的改进，着重看上图右上角的activation unit，首先是把u和v以及u v的element wise差值向量合并起来作为输入，然后喂给全连接层，最后得出权重，这样的方法显然损失的信息更少。但如果你自己想方便的引入attention机制的话，不妨先从点积的方法做起尝试一下，因为这样连训练都不用训练。

再稍微留意一下这个架构图中的红线，你会发现每个ad会有 good_id, shop_id 两层属性，shop_id只跟用户历史中的shop_id序列发生作用，good_id只跟用户的good_id序列发生作用，这样做的原因也是显而易见的。

论文里面，activation unit结构：

activation units are applied on the user behavior features, which performs as a weighted sum pooling to adaptively calculate user representation $v_U$ given a candidate ad A：

where ${e_1, e_2, ..., e_H }$ is the list of embedding vectors of behaviors of user $U$ with length of H, $v_A$ is the embedding vector of ad A.

如果说上面的部分是文70%的价值所在，那么余下30%应该还有这么几点：

用GAUC这个离线metric替代AUC
用Dice方法替代经典的PReLU激活函数
介绍一种Adaptive的正则化方法
介绍阿里的X-Deep Learning深度学习平台

PReLU激活函数：

其中，$p(s) = I(s > 0)$

Dice方法：

Dice can be viewed as a generalization of PReLu. The key idea of Dice is to adaptively adjust the rectified point according to distribution of input data, whose value is set to be the mean of input. Besides, Dice controls smoothly to switch between the two channels. When $E(s) = 0 $ and $Var[s] = 0 $, Dice degenerates into PReLU.

GAUC:

因为auc反映的是整体样本间的一个排序能力，而在计算广告领域，我们实际要衡量的是不同用户对不同广告之间的排序能力，实际更关注的是同一个用户对不同广告间的排序能力。group auc实际是计算每个用户的auc，然后加权平均，最后得到group auc，这样就能减少不同用户间的排序结果不太好比较这一影响

实际处理时权重一般可以设为每个用户view的次数，或click的次数，而且一般计算时，会过滤掉单个用户全是正样本或负样本的情况。

实现代码： https://github.com/qiaoguan/deep-ctr-prediction/blob/master/DeepCross/metric.py

阅读论文：

基线模型： embedding & MLP

$Embedding Layer: $

For the $i-th$ feature group of $t_i$ （$t_i$ 是 $K_i$ 维向量，可能有一个或多个项是1）, let $W_i = [w^i_1 , ...,w^i_j , ...,w^i_{K_i} ] ∈ R^{D×K_i} $ represent the $i-th$ embedding dictionary, where $w^i_j ∈ R^D $ is an embedding vector with dimensionality of D. Embedding operation follows the table lookup mechanism。

embedding机制：

1、If $t_i$ is one-hot vector with $j-th$ element $t_i[j] = 1 $, the embedded representation of $t_i$ is a single embedding vector $e_i = w^i_j $.

2、If $t_i$ is multi-hot vector with $t_i[j] = 1 $ for $j ∈ {i_1, i_2, ...,i_k }$, the embedded representation of $t_i$ is a list of embedding vectors: ${e_{i_1} , e_{i_2} , ...e_{i_k} } = {w^i_{i1} ,w^i_{i2} , ...w^i_{ik} }$.

$Pooling layer and Concat layer: $

The number of non-zero values for multi-hot behavioral feature vector $t_i$ varies across instances, causing the lengths of the corresponding list of embedding vectors to be variable. As fully connected networks can only handle fixed-length inputs, it is a common practice to transform the list of embedding vectors via a pooling layer to get a fixed-length vector:

$e_i = pooling(e_{i_1} , e_{i_2}, ...e{i_k} )$

Both embedding and pooling layers operate in a group-wise manner, mapping the original sparse features into multiple fixedlength representation vectors. Then all the vectors are concatenated together to obtain the overall representation vector for the instance

$MLP:$

Given the concatenated dense representation vector, fully connected layers are used to learn the combination of features automatically. Recently developed methods focus on designing structures of MLP for better information extraction.

随机推荐

Python Flask构建可拓展的RESTful API
1-1 Flask VS Django 1-2 课程更新维护说明: 1-3 环境.开发环境与Flask: 1.3.1 关注版本更新说明: 1-4 初始化项目:
php字符串函数
strtolower()//字符串转化小写的字母 $str="abcdEfG";$s=strtolower($str); 输出:abcdefg; strtoupper();字符串转 ...
nginx在基于域名访问的时候是下载的界面
刚才在做nginx实验时候出现访问域名的时候是下载页面一直下载了好多文件,使用IP访问就正常,在配置文件中找到一个sendfile的参数,把参数值改为off或者直接注释掉这个参数就可以访问了.
位图 c++ 位图排序
什么是位图?来自http://www.cnblogs.com/dolphin0520/archive/2011/10/19/2217369.html 位图就是用一个bit来标记某个元素对应的值,键值就 ...
Git详解之二 Git基础转
http://www.open-open.com/lib/view/open1328069733264.html Git 基础读完本章你就能上手使用 Git 了.本章将介绍几个最基本的,也是最常用的 ...
论文笔记《Notes on convolutional neural networks》
这是个06年的老文章了,但是很多地方还是值得看一看的. 一.概要主要讲了CNN的Feedforward Pass和 Backpropagation Pass,关键是卷积层和polling层的BP推导 ...
codechef May Challenge 2016 LADDU: Ladd 模拟
All submissions for this problem are available. Read problems statements in Mandarin Chinese, Russia ...
python 对比两个字典的差异
实际遇到的问题逻辑很繁杂,就不全写了.最后是通过对比两个字典差异来解决的.找出两个字典的差异,可参考以下代码. dict1 = {'a':1,'b':2,'c':3,'d':4} dict2 = {' ...
FFT与NTT
讲解:http://www.cnblogs.com/poorpool/p/8760748.html 递归版FFT #include <iostream> #include <cstd ...
SqlServer不允许更改字段类型（表中已有数据）
工具-选项-设计器-阻止保存要求重新创建表的更改,√去掉.

推荐系统中的注意力机制——阿里深度兴趣网络（DIN）

推荐系统中的注意力机制——阿里深度兴趣网络（DIN）的更多相关文章

随机推荐

热门专题