本文内容

软件
步骤
控制相关性
总结
参考资料

本文介绍如何用带 Apache Mahout 的 MapR Sandbox for Hadoop 和 Elasticsearch 搭建推荐引擎，只需要很少的代码。

This tutorial will give step-by-step instructions on how to:

使用的电影评分数据位于 http://grouplens.org/datasets/movielens/
使用 Apache Mahout 的协同过滤（collaborative filtering）搭建和训练机器学习模型
使用 Elasticsearch 的搜索技术简化推荐系统的开发

迁移到：http://www.bdata-cap.com/newsinfo/1712675.html

软件

该文章运行在 MapReduce Sandbox。还要求在 Sandbox 上安装 Elasticsearch 和 Mahout。

从 http://grouplens.org/datasets/movielens/ 下载 10M MovieLens 数据
安装 Mahout
安装 Elasticsearch

步骤

Step 1: 索引（Index）电影元数据到 Elasticsearch

在 Elasticsearch 中，默认情况下，文档的所有字段都会被索引。最简单的文档是只有一级 JSON 结构。文档包含在索引中，文档中的类型告诉 Elasticsearch 如何解释文档中的字段。

你可以把 Elasticsearch 的索引看做是关系型数据库中的数据库实例，而类型看做是数据库表，字段看做表定义（但是这个字段，在 Elasticsearch 中的意义更广泛），文档看做是表的某行记录。

针对本例，文档类型是 film。并具有如下字段：电影ID（id）、标题（title）、上映时间（year）、电影类型/标签（genre，基因）、指示（indicators）、indicators数组的数量（numFields）：

 "id": "65006",

 "title": "Impulse",

 "year": "2008",

 "genre": ["Mystery","Thriller"],

 "indicators": ["154","272",”154","308", "535", "583", "593", "668", "670", "680", "702", "745"],

 "numFields": 12

通过 9200 端口访问 Elasticsearch RESTful API 与其通信，或者命令行用 curl 命令。参看 Elasticsearch REST interface 和 Elasticsearch 101 tutorial。

curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

使用 Elasticsearch's REST API 的 put mapping 命令可以定义文档的类型。下面的请求在 bigmovie 索引中创建名为 film 的映射（mapping）。该映射定义一个类型为 integer 类型的 numFields 字段。默认情况，所有字段都被存储并索引，整型也如此。

curl -XPUT 'http://localhost:9200/bigmovie' -d '

  "mappings": {

    "film" : {

      "properties" : {

        "numFields" : { "type" :   "integer" }

}'

电影信息包含在 movies.dat 文件中。文件的每行表示一部电影，字段的含义如下所示：

MovieID::Title::Genres

例如：

65006::Impulse (2008)::Mystery|Thriller

图 1 电影《冲动（Impulse）》（2008）、类型“悬疑/惊悚”

下面 Python 脚本把 movies.dat 文件中的数据转换成 JSON 格式，以便导入 Elasticsearch：

import re

import json

count=0

with open('movies.dat','rb') as csv_file:

   content = csv_file.readlines()

   for line in content:

        fixed = re.sub("::", "\t", line).rstrip().split("\t")

   if len(fixed)==3:

          title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))

          genre = fixed[2].split('|')

          print '{ "create" : { "_index" : "bigmovie", "_type" : "film",

          "_id" : "%s" } }' %  fixed[0]

          print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'

          % (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

运行该 Python 文件，转换结果输出到 index.json：

$ python index.py > index.json

将产生如下 Elasticsearch 需要的格式：

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }

{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }

{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }

文件中的每行创建索引和类型，并添加电影信息。这是利用 Elasticsearch 批量导入数据。

Elasticsearch 批量 API 可以执行对索引的操作，用同一个 API，不同的 http 请求（如 get、put、post、delete）。下面命令让 Elasticsearch 批量加载 index.json 文中的内容：

curl -s -XPOST localhost:9200/_bulk --data-binary @index.json; echo

加载电影信息后，你就可以利用 REST API 进行查询了。你也可以使用 Chrome 的 Elasticsearch 插件——Sense 进行操作（Kibana 4 提供的一个插件）。示例如下所示：

下面是检索 id 为 1237的电影：

Step 2: 使用 Mahout 从用户评分数据中创建 Movie indicators

评分包含在 ratings.dat 文件中。该文件每行表示某个用户对某个电影的评分，格式如下所示：

UserID::MovieID::Rating::Timestamp

例如：

71567::2294::5::912577968

71567::2338::2::912578016

ratings.data 文件用 "::" 做分隔符，转换成 tab 后 Mahout 才能使用。可以用 sed 命令把 :: 替换成 tab：

sed -i 's/::/\t/g' ratings.dat

该命令打开文件，把"::" 替换成"\t" 后，重新保存。Updates are only supported with MapR NFS and thus this command probably won't work on other NFS-on-Hadoop implementations. MapR Direct Access NFS allows files to be modified (supports random reads and writes) and accessed via mounting the Hadoop cluster over NFS.

sed 命令会产生如下格式的内容，该格式可以作为 Mahout 的输入：

71567    2294    5    912580553

71567    2338    2    912580553

一般格式为：item1 item2 rating timestamp，即“物品1 物品2 评分”，本例不使用 timestamp。

启动 Mahout 物品相似度（itemsimilarity）作业，命令如下所示：

 mahout itemsimilarity \

  --input /user/user01/mlinput/ratings.dat \

  --output /user/user01/mloutput \

  --similarityClassname SIMILARITY_LOGLIKELIHOOD \

  --booleanData TRUE \

  --tempDir /user/user01/temp

The argument “-s SIMILARITY_LOGLIKELIHOOD” tells the recommender to use the Log Likelihood Ratio (LLR) method for determining which items co-occur anomalously often and thus which co-occurrences can be used as indicators of preference. 相似度默认是 0.9；this can be adjusted based on the use case with the --threshold parameter, which will discard pairs with lower similarity (the default is a fine choice). Mahout 通过启动很多 Hadoop MapReduce 作业计算推荐，最后将产生输出文件，该文件位于 /user/user01/mloutput 目录。输出文件格式如下所示：

64957   64997   0.9604835425701245

64957   65126   0.919355104432831

64957   65133   0.9580439772229588

一般格式为：item1id item2id similarity，即“物品1 物品2 相似度”。

Step 3: 添加 Movie indicators 到 Elasticsearch 的电影文档

下一步，我们从上面的输出文件添加 indicators 到 Elasticsearch 的 film 文档。例如，把电影的 indicators 放到 indicators 字段：

  "id": "65006",

  "title": "Impulse",

  "year": "2008",

  "genre": ["Mystery","Thriller"],

  "indicators": ["1076", "1936", "2057", "2204"],

  "numFields": 4

左面的表显示文档中包含 indicator 的内容，右边的表显示哪些文档包含某个 indicator：

图 2 文档与 indicator

如果想要检索 indicator 为 1237 和 551 的电影，那么本例将返回 id 为 8298 的文档（电影）。如果检索 1237 或 551，那么将返回 id 为 8298、3 和 64418 的电影。

下面脚本将读取 Mahout 的输出文件 part-r-00000，为每部电影创建 indicator 数组，然后输出 JSON 文件，用该文件更新 Elasticsearch bigmovie 索引的 film 类型的 indicator 字段。

import fileinput

from string import join

import json

import csv

import json

### read the output from MAHOUT and collect into hash ###

with open('/user/user01/mloutput/part-r-00000','rb') as csv_file:

    csv_reader = csv.reader(csv_file,delimiter='\t')

    old_id = ""

    indicators = []

    update = {"update" : {"_id":""}}

    doc = {"doc" : {"indicators":[], "numFields":0}}

    for row in csv_reader:

        id = row[0]

        if (id != old_id and old_id != ""):

            update["update"]["_id"] = old_id

            doc["doc"]["indicators"] = indicators

            doc["doc"]["numFields"] = len(indicators)

            print(json.dumps(update))

            print(json.dumps(doc))

            indicators = [row[1]]

        else:

            indicators.append(row[1])

        old_id = id

下面命令会执行 update.py 的 Python 脚本，并输出 update.json：

$ python update.py > update.json

上面 Python 脚本将创建如下内容的文件：

{"update": {"_id": "1"}}

{"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}}

{"update": {"_id": "2"}}

{"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}

在命令行，用 curl 命令调用 Elasticsearch REST bulk 请求，把该文件 update.json 作为输入，就可以更新 indicator 字段：

$ curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo

Step 4: 检索 Film 索引的 indicator 字段进行推荐

现在，你就可以检索 film 的 indicator 字段进行查询并推荐。例如，某人喜欢电影 1237 和 551，你想推荐类似的电影，可以执行如下 Elasticsearch 查询获得推荐，将返回indicator 数组为 1237 和 551 的电影，即 1237=Seventh Seal（第七封印），551=Nightmare Before Christmas（圣诞夜惊魂）：

curl 'http://localhost:9200/bigmovie/film/_search?pretty' -d '

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

},

  "fields":["_id","title","genre"],

  "size":"8"

}'

上面查询 indicator 为 1237 或 551，并且不是 1237 或 551 的电影。下面示例使用 Sense 插件进行查询，右边是检索结果，推荐结果是 “A Man Named Pearl（这个是纪录片）” 和 “Used People（寡妇三弄）”。

控制相关性

全文检索引擎根据相关度排序，Elasticsearch 用 _score 字段表示文档的相关度分数（relevance score）。function_score 允许你查询时修改该分数。random_score 用一个种子变量使用散列生成分数。Elasticsearch 查询如下所示，random_score 函数用于把变量添加到检索结果，以便完成 dithering：

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

相关性抖动（dithering）有意地包含排名靠，但相关性较低的结果，以便拓展训练数据，提供给推荐引擎。如果没有 dithering，那么明天的训练数据仅仅是教模型今天已经知道的事情。增加 dithering，会帮助拓展推荐模型。如果模型给出的答案接近优秀的，那么 dithering 可以帮助找到正确答案。有效的 dithering 会减少今天的准确性，而改进明天的训练数据（和未来的性能，算法的准确性也属于性能的范畴），换句话说，为了让将来的推荐准确，需要减少过去对将来的影响。

总结

We showed in this tutorial how to use Apache Mahout and Elasticsearch with the MapR Sandbox to build a basic recommendation engine. You can go beyond a basic recommender and get even better results with a few simple additions to the design to add cross recommendation of items, which leverages a variety of interactions and items for making recommendations. You can find more information about these technologies here:

参考资料

若想学习更多关于推荐引擎的组件和逻辑，参看 "An Inside Look at the Components of a Recommendation Engine"，该文章详细描述了推荐引擎的架构、Mahout 协同过滤（collaborative filtering）和 Elasticsearch 检索引擎。

更多关于推荐引擎、机器学习和 Elasticsearch 的资源，如下所示：

Tutorial Category Reference:

用 Mahout 和 Elasticsearch 实现推荐系统的更多相关文章

【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
机器学习(Machine Learning)与深度学习(Deep Learning)资料汇总
<Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...
Cheatsheet: 2015 04.01 ~ 04.30
Other CentOS 7.1 Released: Installation Guide with Screenshots A Git Style Guide Recommender System ...
到底啥是平台，到底啥是中台？李鬼太多，不得不说(ZT)
(1)哪些不是中台,而是应该叫平台做开发,有所谓的三层技术架构:前端展示层.中间逻辑层.后端数据层.我们现在讲的中台不在这个维度上. 做开发,还有所谓的技术中间件.一开始我们没有中间件的概念,只有操 ...
基于Mahout的电影推荐系统
基于Mahout的电影推荐系统 1.Mahout 简介 Apache Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域 ...
大数据入门第十九天——推荐系统与mahout（一）入门与概述
一.推荐系统概述为了解决信息过载和用户无明确需求的问题,找到用户感兴趣的物品,才有了个性化推荐系统.其实,解决信息过载的问题,代表性的解决方案是分类目录和搜索引擎,如hao123,电商首页的分类目录 ...
mahout推荐系统
本章包含以下内容: 首先看一下实战中的推荐系统推荐引擎的精度评价评价一个引擎的准确率和召回率在真实数据集:GroupLens 上评价推荐系统我们每天都会对喜欢的.不喜欢的.甚至不关心的事情有很 ...
【甘道夫】通过Mahout构建推荐系统--通过IDRescorer扩展评分规则
通过Mahout构建推荐系统时,假设我们须要添�某些过滤规则(比方:item的创建时间在一年以内),则须要用到IDRescorer接口,该接口源代码例如以下: package org.apache.m ...

随机推荐

js学习-DOM之动态创建元素的三种方式、插入元素、onkeydown与onkeyup两个事件整理
动态创建元素的三种方式: 第一种: Document.write(); <body> <input type="button" id="btn" ...
java多条件不定条件查询
网站或各类管理系统都会用到搜索,会用到一个或多个不确定条件搜索,单条件搜索比较简单,有时候会有多个条件共同查询,如果系统中已经提供了相关的方法供你使用最好,像我做这老系统改版,需要添加搜索,就要自己写 ...
FastReport代码计算高度
Dim iHeight As Double Dim columnData_form As DataSourceBase iHeight=Page1.TopMarg ...
java线程详解（一）
1,相关概念简介 (1)进程:是一个正在执行的程序.每一个进程执行都有一个执行的顺序,该顺序就是一个执行路径,或者叫一个控制单元.用于分配空间. (2)线程:就是进程中一个独立的控制单元,线程在控制着 ...
我的复杂的OpenCV编译之路（OpenCV3.1.0 + VS2010 + Win7）
教程:www.cnblogs.com/jliangqiu2016/p/5597501.html 这里主要记载我编译遇到的错误及解决方法. OpenCV3.1软件下载:https://sourcefor ...
bzoj 3389
题意:给定1维连续T<= 1000000个点,以及n<=10000个线段,求最少的线段覆盖该区间.. 思路:很显然,贪心是可以做的..不过这一题最有意思的是使可以转换为最短路模型.. 如果 ...
安装时出现 Runtiem error (at 62:321) SWbem Locator:服务不存在,或已被标记为删除该怎么解决?
这是由wmi服务损坏引起的错误修复WMI服务损坏的批处理程序将下列代码复制到一个文本文件中,改名为fixwmi.bat,运行即可.需要一段时间,请大家耐心等候. ================= ...
[HtmlUnit]Fetch Dynamic Html/Content Created By Javascript/Ajax
import com.gargoylesoftware.htmlunit.*; import com.gargoylesoftware.htmlunit.html.HtmlPage; import j ...
【源码笔记】BlogEngine.Net 中的权限管理
BlogEngine.Net 是个功能点很全面的开源博客系统,容易安装和实现定制,开放接口支持TrackBack,可以定义主题配置数据源等等.可谓五脏俱全,这里先记录一下它基于Membership的权 ...
献上两个java小算法
直接上代码: /** * Name: 求数组中元素重复次数对多的数和重复次数 * Description: * 数组中的元素可能会重复,这个方法可以找出重复次数最多的数,同时可以返回重复了多少次. * ...

用 Mahout 和 Elasticsearch 实现推荐系统