个性化排序算法实践(一)—

因子分解机（Factorization Machine，简称FM）算法用于解决大规模稀疏数据下的特征组合问题。FM可以看做带特征交叉的LR。

理论部分可参考FM系列，通过将FM的二次项化简，其复杂度可优化到$O(kn)$。即：

\[\hat y(x) = w_0+\sum_{i=1}^n w_i x_i +\sum_{i=1}^n \sum_{j=i+1}^n ⟨vi,vj⟩ x_i x_j \\
=w_0+\sum_{i=1}^n w_i x_i + \frac{1}{2} \sum_{f=1}^{k} {\left \lgroup \left(\sum_{i=1}^{n} v_{i,f} x_i \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_i^2\right \rgroup} \qquad
\]

我们用随机梯度下降（Stochastic Gradient Descent）法学习模型参数。那么，模型各个参数的梯度如下：

\[\frac{\partial}{\partial \theta} y(\mathbf{x}) =
\begin{cases}
1, & \text{if}\; \theta\; \text{is}\; w_0 \text{(常数项)} \\
x_i & \text{if}\; \theta\; \text{is}\; w_i \text{(线性项)} \\
x_i \sum_{j=1}^{n} v_{j,f} x_j - v_{i,f} x_i^2, & \text{if}\; \theta\; \text{is}\; v_{i,f} \text{(交叉项)}
\end{cases}
\]

这里，我们使用tensorflow实现整个算法。基本步骤如下：

1、构建数据集。这里，令movielens数据集的样本个数为行，令用户ID与itemID为特征，令rating为label，构建数据集。最终通过稀疏矩阵的形式存储，具体方法参考稀疏矩阵在Python中的表示方法。

这里采用用户ID与itemID为特征，进行onehot后，对每一个特征构建隐向量，隐向量维度为(feat_num, vec_dim)。注意这里的特征维度(feat_num),已经不是两维了，而是onehot后的维度。所以，这里的隐向量也可以看做是对每一维的EMbedding的向量，FM算法最终通过EMbedding向量的内积预测label。

2、通过tensorflow构建图，主要注意pred与loss的构建。另外，通过迭代器实现了batcher()方法。

核心代码如下：

x = tf.placeholder(tf.float32, shape=[None, feat_num], name="input_x")

y = tf.placeholder(tf.float32, shape=[None, 1], name="ground_truth")

w0 = tf.get_variable(name="bias", shape=(1), dtype=tf.float32)

W = tf.get_variable(name="linear_w", shape=(feat_num), dtype=tf.float32)

V = tf.get_variable(name="interaction_w", shape=(feat_num, vec_dim), dtype=tf.float32)

linear_part = w0 + tf.reduce_sum(tf.multiply(x, W), axis=1, keep_dims=True)

interaction_part = 0.5 * tf.reduce_sum(tf.square(tf.matmul(x, V)) - tf.matmul(tf.square(x), tf.square(V)), axis=1, keep_dims=True)

y_hat = linear_part + interaction_part

loss = tf.reduce_mean(tf.square(y - y_hat))

train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)

可以看到，这里定义了三个变量$w_0$,$W$与$V$分别代表偏移量，一阶权重与EMbedding向量。loss定义为平方损失函数(MSE)，使用$Adam$优化器进行优化。

全部代码如下所示：

#-*-coding:utf-8-*-

"""

author:jamest

date:20191029

FMfunction

"""

# -*- coding:utf-8 -*-

import pandas as pd

import numpy as np

from scipy.sparse import csr

from itertools import count

from collections import defaultdict

import tensorflow as tf

def vectorize_dic(dic, label2index=None, hold_num=None):

    if label2index == None:

        d = count(0)

        label2index = defaultdict(lambda: next(d))  # 数值映射表

    sample_num = len(list(dic.values())[0])  # 样本数

    feat_num = len(list(dic.keys()))  # 特征数

    total_value_num = sample_num * feat_num

    col_ix = np.empty(total_value_num, dtype=int) # 列索引

    i = 0

    for k, lis in dic.items():

        col_ix[i::feat_num] = [label2index[str(k) + str(el)] for el in lis] # 'user'和'item'的映射

        i += 1

    row_ix = np.repeat(np.arange(sample_num), feat_num)

    data = np.ones(total_value_num)

    if hold_num is None:

        hold_num = len(label2index)

    left_data_index = np.where(col_ix < hold_num)  # 为了剔除不在train set中出现的test set数据

    return csr.csr_matrix(

        (data[left_data_index], (row_ix[left_data_index], col_ix[left_data_index])),

        shape=(sample_num, hold_num)), label2index

def batcher(X_, y_, batch_size=-1):

    assert X_.shape[0] == len(y_)

    n_samples = X_.shape[0]

    if batch_size == -1:

        batch_size = n_samples

    if batch_size < 1:

        raise ValueError('Parameter batch_size={} is unsupported'.format(batch_size))

    for i in range(0, n_samples, batch_size):

        upper_bound = min(i + batch_size, n_samples)

        ret_x = X_[i:upper_bound]

        ret_y = y_[i:upper_bound]

        yield (ret_x, ret_y)

def load_dataset():

    cols = ['user', 'item', 'rating', 'timestamp']

    ratingsPath = '../data/ml-1m/ratings.dat'

    ratingsDF = pd.read_csv(ratingsPath, index_col=None, sep='::', header=None,

                            names=cols)[:10000]

    ratingsDF = ratingsDF.sample(frac=1.0)  # 全部打乱

    cut_idx = int(round(0.7 * ratingsDF.shape[0]))

    train, test = ratingsDF.iloc[:cut_idx], ratingsDF.iloc[cut_idx:]

    x_train, label2index = vectorize_dic({'users': train.user.values, 'items': train.item.values})

    x_test, label2index = vectorize_dic({'users': test.user.values, 'items': test.item.values}, label2index,

                                        x_train.shape[1])

    y_train = train.rating.values

    y_test = test.rating.values

    x_train = x_train.todense()

    x_test = x_test.todense()

    return x_train, x_test, y_train, y_test

if __name__ == '__main__':

    x_train, x_test, y_train, y_test = load_dataset()

    print("x_train shape: ", x_train.shape)

    print("x_test shape: ", x_test.shape)

    print("y_train shape: ", y_train.shape)

    print("y_test shape: ", y_test.shape)

    vec_dim = 10

    batch_size = 64

    epochs = 50

    learning_rate = 0.001

    sample_num, feat_num = x_train.shape

    x = tf.placeholder(tf.float32, shape=[None, feat_num], name="input_x")

    y = tf.placeholder(tf.float32, shape=[None, 1], name="ground_truth")

    w0 = tf.get_variable(name="bias", shape=(1), dtype=tf.float32)

    W = tf.get_variable(name="linear_w", shape=(feat_num), dtype=tf.float32)

    V = tf.get_variable(name="interaction_w", shape=(feat_num, vec_dim), dtype=tf.float32)

    linear_part = w0 + tf.reduce_sum(tf.multiply(x, W), axis=1, keep_dims=True)

    interaction_part = 0.5 * tf.reduce_sum(tf.square(tf.matmul(x, V)) - tf.matmul(tf.square(x), tf.square(V)), axis=1,keep_dims=True)

    y_hat = linear_part + interaction_part

    loss = tf.reduce_mean(tf.square(y - y_hat))

    train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)

    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())

        for e in range(epochs):

            step = 0

            print("epoch:{}".format(e))

            for batch_x, batch_y in batcher(x_train, y_train, batch_size):

                sess.run(train_op, feed_dict={x: batch_x, y: batch_y.reshape(-1, 1)})

                step += 1

                if step % 10 == 0:

                    for val_x, val_y in batcher(x_test, y_test):

                        train_loss = sess.run(loss, feed_dict={x: batch_x, y: batch_y.reshape(-1, 1)})

                        val_loss = sess.run(loss, feed_dict={x: val_x, y: val_y.reshape(-1, 1)})

                        print("batch train_mse={}, val_mse={}".format(train_loss, val_loss))

        for val_x, val_y in batcher(x_test, y_test):

            val_loss = sess.run(loss, feed_dict={x: val_x, y: val_y.reshape(-1, 1)})

            print("test set rmse = {}".format(np.sqrt(val_loss)))

参考：

FM系列

 Github

个性化排序算法实践(一)——FM算法的更多相关文章

个性化排序算法实践(三)——deepFM算法
FM通过对于每一位特征的隐变量内积来提取特征组合,最后的结果也不错,虽然理论上FM可以对高阶特征组合进行建模,但实际上因为计算复杂度原因,一般都只用到了二阶特征组合.对于高阶特征组合来说,我们很自然想 ...
个性化排序算法实践(五)——DCN算法
wide&deep在个性化排序算法中是影响力比较大的工作了.wide部分是手动特征交叉(负责memorization),deep部分利用mlp来实现高阶特征交叉(负责generalizatio ...
个性化排序算法实践(二)——FFM算法
场感知分解机(Field-aware Factorization Machine ,简称FFM)在FM的基础上进一步改进,在模型中引入类别的概念,即field.将同一个field的特征单独进行one- ...
个性化召回算法实践(一)——CF算法
协同过滤推荐(Collaborative Filtering Recommendation)主要包括基于用户的协同过滤算法与基于物品的协同过滤算法. 下面,以movielens数据集为例,分别实践这两 ...
个性化召回算法实践(三)——PersonalRank算法
将用户行为表示为二分图模型.假设给用户$u$进行个性化推荐,要计算所有节点相对于用户$u$的相关度,则PersonalRank从用户$u$对应的节点开始游走,每到一个节点都以$1-d$ ...
个性化召回算法实践(四)——ContentBased算法
ContentBased算法的思想非常简单:根据用户过去喜欢的物品(本文统称为 item),为用户推荐和他过去喜欢的物品相似的物品.而关键就在于这里的物品相似性的度量,这才是算法运用过程中的核心. C ...
个性化召回算法实践(二)——LFM算法
LFM算法核心思想是通过隐含特征(latent factor)联系用户兴趣和物品,找出潜在的主题和分类.LFM(latent factor model)通过如下公式计算用户u对物品i的兴趣: \[ P ...
算法实践--最小生成树(Kruskal算法)
什么是最小生成树(Minimum Spanning Tree) 每两个端点之间的边都有一个权重值,最小生成树是这些边的一个子集.这些边可以将所有端点连到一起,且总的权重最小下图所示的例子,最小生成树 ...
[迷宫中的算法实践]迷宫生成算法——递归分割算法
Recursive division method Mazes can be created with recursive division, an algorithm which wo ...

随机推荐

bootstrap-table：操作栏点击编辑按钮弹出模态框修改数据
核心代码: columns: [ { checkbox:true //第一列显示复选框 }, ... { field: 'fail_num', title: '失败数' }, { field: 'op ...
【C/C++开发】C中调用C++函数
C中如何调用C++函数? 前阵子被问及一个在C中如何调用C++函数的问题,当时简单回答是将函数用extern "C"声明,当被问及如何将类内成员函数声明时,一时语塞,后来网上查了下 ...
Spring的定时任务@Scheduled(cron = "0 0 1 * * *")
指定某个方法在特定时间执行,如: cron="0 0 1 1 * ?" 即这个方法每月1号凌晨1点执行一次关于这个注解的解释网上一大堆但是今天遇到个问题,明明加了注解@Sche ...
Appium移动自动化测试-----(一)Appium介绍
1.特点 appium 是一个自动化测试开源工具,支持 iOS 平台和 Android 平台上的原生应用,web应用和混合应用. “移动原生应用”是指那些用iOS或者 Android SDK 写的应用 ...
Elasticsearch聚合操作报错解决办法
1. 当根据一个类型为text的字段idc进行聚合操作时,查询语句如下: { "aggs": { "top_10_states": { "terms& ...
[转帖]Nginx rewrite 规则与 proxy_pass 实现
Nginx rewrite 规则与 proxy_pass 实现 https://www.cnblogs.com/jicki/p/5546916.html Nginx rewrite 规则与 pr ...
Python3数据类型之数字
1. Python数字类型的作用 Python数字类型用来存储数值,它是不可变对象,一旦定义之后,其值不可以被修改.如果改变了数字类型的值,就要重新为其分配内存空间. 定义一个数字类型的变量:a = ...
Java调用SqlLoader将大文本导入数据库
Java调用SqlLoader将大文本导入数据库业务场景:将一千万条数据,大约500M的文本文档的数据导入到数据库分析:通过Java的IO流解析txt文本文档,拼接动态sql实现insert入库, ...
15-16 ICPC europe J Saint John Festival (graham扫描法+旋转卡壳）
题意:给n个大点,m个小点$(n<=1e5,m<=5e5),问有多少个小点,存在3个大点,使小点在三个大点组成的三角形内. 解题思路: 首先,易证,若该小点在某三大点行成的三角形内,则该小 ...
RT-Flash imxrt 系列rt1052 rt1060量产神器宣传
转载: 恩智浦半导体2017年10月正式发布了业内首款跨界处理器—i.MX RT系列,超强的性能.超高的性价比使得i.MX RT系列火遍大江南北,一度成为基于MCU的产品主控首选,尤其是那些对于性能有 ...

个性化排序算法实践(一)——FM算法

个性化排序算法实践(一)——FM算法的更多相关文章

随机推荐

热门专题