推荐系统之最小二乘法ALS的Spark实现

1.ALS算法流程：

初始化数据集和Spark环境---->

切分测试机和检验集------>

训练ALS模型------------>

验证结果----------------->

检验满足结果---->直接推荐商品，否则继续训练ALS模型

2.数据集的含义

Rating是固定的ALS输入格式，要求是一个元组类型的数据，其中数值分别是如下的[Int,Int,Double],在建立数据集的时候，用户名和物品名需要采用数值代替

 /**

  * A more compact class to represent a rating than Tuple3[Int, Int, Double].

  */

 @Since("0.8.0")

 case class Rating @Since("0.8.0") (

     @Since("0.8.0") user: Int,

     @Since("0.8.0") product: Int,

     @Since("0.8.0") rating: Double)

如下：第一列位用户编号，第二列位产品编号，第三列的评分Rating为Double类型

3.ALS的测试数据集源代码解读

3.1ALS类的所有字段如下

@Since("0.8.0")

class ALS private (

    private var numUserBlocks: Int,

    private var numProductBlocks: Int,

    private var rank: Int,

    private var iterations: Int,

    private var lambda: Double,

    private var implicitPrefs: Boolean,  使用显式反馈ALS变量或隐式反馈

    private var alpha: Double,    ALS隐式反馈变化率用于控制每次拟合修正的幅度

    private var seed: Long = System.nanoTime()

  ) extends Serializable with Logging {

3.2 ALS.train方法

 /**

    * Train a matrix factorization model given an RDD of ratings given by users to some products,

    * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the

    * product of two lower-rank matrices of a given rank (number of features). To solve for these

    * features, we run a given number of iterations of ALS. This is done using a level of

    * parallelism given by `blocks`.

    *

    * @param ratings    RDD of (userID, productID, rating) pairs

    * @param rank       number of features to use

    * @param iterations number of iterations of ALS (recommended: 10-20)

    * @param lambda     regularization factor (recommended: 0.01)

    * @param blocks     level of parallelism to split computation into  将并行度分解为等级

    * @param seed       random seed  随机种子

    */

   @Since("0.9.1")

   def train(

       ratings: RDD[Rating], //RDD序列由用户ID 产品ID和评分组成

       rank: Int,    //模型中的隐藏因子数目

       iterations: Int,  //算法迭代次数

       lambda: Double,  //ALS正则化参数

       blocks: Int,   //块

       seed: Long

     ): MatrixFactorizationModel = {

     new ALS(blocks, blocks, rank, iterations, lambda, false, 1.0, seed).run(ratings)

   }

3.3 基于ALS算法的协同过滤推荐

 package com.bigdata.demo

 import org.apache.spark.{SparkContext, SparkConf}

 import org.apache.spark.mllib.recommendation.ALS

 import org.apache.spark.mllib.recommendation.Rating

 /**

   * Created by SimonsZhao on 3/30/2017.

   * ALS最小二乘法

   */

 object CollaborativeFilter {

   def main(args: Array[String]) {

     //设置环境变量

      val conf=new SparkConf().setMaster("local").setAppName("CollaborativeFilter ")

     //实例化环境

      val sc = new SparkContext(conf)

     //设置数据集

      val data =sc.textFile("E:/scala/spark/testdata/ALSTest.txt")

     //处理数据

      val ratings=data.map(_.split(' ') match{

       //数据集的转换

       case Array(user,item,rate) =>

         //将数据集转化为专用的Rating

         Rating(user.toInt,item.toInt,rate.toDouble)

     })

     //设置隐藏因子

      val rank=2

     //设置迭代次数

      val numIterations=2

     //进行模型训练

      val model =ALS.train(ratings,rank,numIterations,0.01)

     //为用户2推荐一个商品

      val rs=model.recommendProducts(2,1)

     //打印结果

      rs.foreach(println)

   }

 }

展开代码可复制

 package com.bigdata.demo

 import org.apache.spark.{SparkContext, SparkConf}

 import org.apache.spark.mllib.recommendation.ALS

 import org.apache.spark.mllib.recommendation.Rating

 /**

   * Created by SimonsZhao on 3/30/2017.

   * ALS最小二乘法

   */

 object CollaborativeFilter {

   def main(args: Array[String]) {

     //设置环境变量

      val conf=new SparkConf().setMaster("local").setAppName("CollaborativeFilter ")

     //实例化环境

      val sc = new SparkContext(conf)

     //设置数据集

      val data =sc.textFile("E:/scala/spark/testdata/ALSTest.txt")

     //处理数据

      val ratings=data.map(_.split(' ') match{

       //数据集的转换

       case Array(user,item,rate) =>

         //将数据集转化为专用的Rating

         Rating(user.toInt,item.toInt,rate.toDouble)

     })

     //设置隐藏因子

      val rank=2

     //设置迭代次数

      val numIterations=2

     //进行模型训练

      val model =ALS.train(ratings,rank,numIterations,0.01)

     //为用户2推荐一个商品

      val rs=model.recommendProducts(2,1)

     //打印结果

      rs.foreach(println)

   }

 }

点击+复制代码

4.测试及分析

根据结果分析为第2个用户推荐了编号为15的商品，预测评分为3.99

5.基于用户的推荐源代码(mllib)

注释的部分翻译：

用户向用户推荐产品

num返回多少产品。返回的数字可能少于此值。

[[评分]]对象，每个对象包含给定的用户ID，产品ID和
评分字段中的“得分”。每个代表一个推荐的产品，并且它们被排序
按分数，减少。第一个返回的是预测最强的一个
推荐给用户。分数是一个不透明的值，表示强列推荐的产品。

   /**

    * Recommends products to a user.

    *

    * @param user the user to recommend products to

    * @param num how many products to return. The number returned may be less than this.

    * @return [[Rating]] objects, each of which contains the given user ID, a product ID, and a

    *  "score" in the rating field. Each represents one recommended product, and they are sorted

    *  by score, decreasing. The first returned is the one predicted to be most strongly

    *  recommended to the user. The score is an opaque value that indicates how strongly

    *  recommended the product is.

    */

   @Since("1.1.0")

   def recommendProducts(user: Int, num: Int): Array[Rating] =

     MatrixFactorizationModel.recommend(userFeatures.lookup(user).head, productFeatures, num)

       .map(t => Rating(user, t._1, t._2))

6.基于物品的推荐源代码(mllib)

注释的部分翻译：

推荐用户使用产品,也就是说，这将返回最有可能的用户对产品感兴趣

每个都包含用户ID，给定的产品ID和评分字段中的“得分”。

每个代表一个推荐的用户，并且它们被排序按得分，减少。

第一个返回的是预测最强的一个推荐给产品。

分数是一个不透明的值，表示强烈推荐给用户。

   /**

    * Recommends users to a product. That is, this returns users who are most likely to be

    * interested in a product.

    *

    * @param product the product to recommend users to   给用户推荐的产品

    * @param num how many users to return. The number returned may be less than this. 返回个用户的个数

    * @return [[Rating]] objects, each of which contains a user ID, the given product ID, and a

    *  "score" in the rating field. Each represents one recommended user, and they are sorted

    *  by score, decreasing. The first returned is the one predicted to be most strongly

    *  recommended to the product. The score is an opaque value that indicates how strongly

    *  recommended the user is.

    */

   @Since("1.1.0")

   def recommendUsers(product: Int, num: Int): Array[Rating] =

     MatrixFactorizationModel.recommend(productFeatures.lookup(product).head, userFeatures, num)

       .map(t => Rating(t._1, product, t._2))

END~

随机推荐

8 -- 深入使用Spring -- 1...3 容器后处理器
8.1.3 容器后处理器(BeanFactoryPostProcessor) 容器后处理器负责处理容器本身. 容器后处理器必须实现BeanFacotryPostProcessor接口.实现该接口必须实 ...
第四章 TCP粘包/拆包问题的解决之道---4.2--- 未考虑TCP粘包导致功能异常案例
4.2 未考虑TCP粘包导致功能异常案例如果代码没有考虑粘包/拆包问题,往往会出现解码错位或者错误,导致程序不能正常工作. 4.2.1 TimeServer 的改造 Class : TimeServ ...
.NET Framework 4.0源代码
原文出处:http://blogs.microsoft.co.il/blogs/arik/archive/2010/07/12/step-into-net-framework-4-0-source-c ...
Maven 多项目依赖，需要验证artifact的output root中是否包含其他项目输出
Jsoup（三）-- Jsoup使用选择器语法查找DOM元素
1.Jsoup可以使用类似于CSS或jQuery的语法来查找和操作元素. 2.实例如下: public static void main(String[] args) throws Exception ...
SaltStack 安装及配置认证
一.SaltStack 安装 SaltStack 是基于 Python 开发的,也是基于 C/S 架构,通过服务端 ( master ) 控制多台客户端 ( minion ) 实现批量操作这里我们使用 ...
Streaming 101
开宗明义!本文根据Google Beam大神Tyler Akidau的系列文章<The world beyond batch: Streaming 101>(批处理之外的流式世界)整理而成 ...
Java bean中布尔类型使用注意
JavaBean是一个标准,遵循标准的Bean是一个带有属性和getters/setters方法的Java类. JavaBean的定义很简单,但是还有有一些地方需要注意,例如Bean中含有boolea ...
OpenCV——轮廓特征描述
检测出特定轮廓,可进一步对其特征进行描述,从而识别物体. 1. 如下函数,可以将轮廓以多种形式包围起来. // 轮廓表示为一个矩形 Rect r = boundingRect(Mat(contours ...
使用kendynet构建异步redis访问服务
使用kendynet构建异步redis访问服务最近开始在kendynet上开发手游服务端,游戏类型是生存挑战类的,要存储的数据结构和类型都比较简单,于是选择了用redis做存储,数据类型使用stri ...