【DM论文阅读杂记】推荐系统注意力机制

Paper Title

Real-time Attention Based Look-alike Model for Recommender System

Basic algorithm and main steps

Basic ideas

RALM is a similarity based look-alike model, which consists of user representation learning and look-alike learning. Novel points: attention-merge layer, local and global attention, on-line asynchronous seeds cluster.

1. Offline Traning

1. User Representation Learning

Treat it as multi-class classification that chooses an interest item from millions of candidates.

(1) Calculate the possibility of picking the $ i$-th item as a negative example

$ p(x_i) = \frac{log(k+2)-log(k+1)}{log(D+1)} $

$ D $: the max rank of all the items( rank by their frequency of appearance.)

$ k $: the rank of the $ i$-th item.

(2) Negative sampling: ample in a positive/negative proportion of 1/10

(3) Embedding layer

$ P(c=i|U,X_i) = \frac{e^{x_i u}}{\sum \limits_{j \in X}e^{x_j u}} $

the cross entropy loss : $ L = -\sum \limits_{j \in X} y_i log P(c=i|U,X_i) $

$ u $: a high-dimensional embedding of the user

$ x_j $: embeddings of item $ j $

$ y_i \in {0, 1} $: the label

When converge, output: the representation of user interests.

(4) Attention merge layer

Learn user-related weights for multiple fields.

$n$ fields are embedded with the same length $m$ as vector $h \in R^m$, and then concatenate them in dimension 2, resulting a matrix $H \in R^{n×m}$. Next, compute weights:

$ u = tanh(W_1H) $

$ w_i = \frac{e^{W_2u_iT}}{\sum_j^n e^{W_2u_jT}} $

$W_1 \in R^{k×n}$ and $W_2 \in R^k$ : weight matrix , $k$ size of attention unit,

$ u \in R^n$ :the activation unit for fields, $a ∈ R^n$ weights of fields.

Merge vector $ M \in R^m : M = aH $

Then take it as the input of the MLP layer and get universal user embedding.

2. Look-alike Learning

(1) Transforming matrix.

$ n \times m $ to $ n \times h $

(2) Local attention

To activate local interest / mine personalized info.

$ E_{local_s} = E_s softmax(tanh(E_s^T W_l E_u)) $

$W_l \in R^{h \times h}$ : the attention matrix,

$E_s$ : seen user $ E_u $: target user

Note: Firstly, cluster the seed users through K-means algorithm into k clusters, and for each cluster , calculate the average mean of seeds vectors.

(3) Global attention

$ E_{global_s} = E_s softmax(E_s^T tanh(W_g E_s)) $

(4) Calculate the similarity between seeds and target user

$ score_{u,s} = \alpha \cdot cosine(E_u,E_{global_s}) + \beta \cdot cosine(E_u, E_{local_s}) $

(5) Iterative training

2. Online Asynchronous Processing

Update seeds embedding database in real-time . It includes user feedback monitor and seeds clustering.

3. Online Serving

$ score_{u,s} = \alpha \cdot cosine(E_u,E_{global_s}) + \beta \cdot cosine(E_u, E_{local_s}) $

Motivation

The "Matthew effect" becomes increasingly evident in recent recommendation systems. Many competitive long-tail contents are

difficult to achieve timely exposure because of lacking behavior

features .
Traditional look-alike models which widely used in on-line

advertising are not suitable for recommender systems because of

the strict requirement of both real-time and effectiveness.

Contribution

Improve the effectiveness of user representation learning. Use the attention to capture various fields of interests.
Improve the robustness and adaptivity of seeds representation learning. Use local and global attention.
Realize a real-time and high-performance look-alike model

My own idea

Relations to what I had read

Method of concatenating feature fields. In other paper about CTR I had read, different feature fields

are concatenated directly. It will cause overfitting in strongly-relevant fields(such as interested tags) and underfitting in to weakly-relevant fields(such as shopping interests) . Then it leads to a result that the recommended results are determined by the few strongly-relevant fields. Such models can not learn comprehensively on multi-fields features, and will lack diversity of recommended results. But in this paper, it uses attention merge to learn effective relations among different fields of user features.
Besides, it uses high-order continuous features instead of categorical features. In my opinion, if we use low-order categorical features to express the user group, we can only use statistical methods to construct the features, which will lose most of the information of the group. However, the higher-order continuous features after presentation learning actually contain the intersections of various lower-order features of users, which can more comprehensively express the information of users. Moreover, the higher-order features are generalized to avoid the expression of memory trapped in historical data.

Shortcomings and potential change I assume

In this paper, it seems that only a few features are used to learn representation, which may limits the effect in some extends.