[原文链接:http://engineering.richrelevance.com/bandits-recommendation-systems/。]

[本文链接:http://www.cnblogs.com/breezedeus/p/3775316.html,转载请注明出处]

Bandits for Recommendation Systems

06/02/2014 • Topics: Bayesian, Big data, Data Science

by Sergey Feldman

This is the first in a series of three blog posts on bandits for recommendation systems.

In this blog post, we will discuss the bandit problem and how it relates to online recommender systems.  Then, we'll cover some classic algorithms and see how well they do in simulation.

A common problem for internet-based companies is: which piece of content should we display?  Google has this problem (which ad to show), Facebook has this problem (which friend's post to show), and RichRelevance has this problem (which product recommendation to show).  Many of the promising solutions come from the study of the multi-armed bandit problem.  A one-armed "bandit" is another way to say slot machine (probably because both will leave you with empty pockets).  Here is a description that I hijacked from Wikipedia:

"The multi-armed bandit problem is the problem a gambler faces at a row of slot machines when deciding which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls."

Let's rewrite this in retail language.  Each time a shopper looks at a webpage, we have to show them one of K product recommendations.  They either click on it or do not, and we log this (binary) reward.  Next, we proceed to either the next shopper or the next page view of this shopper and have to choose one of Kproduct recommendations again. (Actually, we have to choose multiple recommendations per page, and our 'reward' could instead be sales revenue, but let's ignore these aspects for now.)

Multi-armed bandits come in two flavors: stochastic and adversarial.  The stochastic case is where each bandit doesn't change in response to your actions, while in the adversarial case the bandits learn from your actions and adjust their behavior to minimize your rewards.  We care about the stochastic case, and our goal is to find the arm which has the largest expected reward.  I will index the arms by a , and the probability distribution over possible rewards r for each arm a can be written as \( p_a(r)\) .  We have to find the arm with the largest mean reward

\( μ_a=E_a[r] \) 

as quickly as possible while accumulating the most rewards along the way.  One important point is that in practice \( p_a(r)\) are non-stationary, that is, rewards change over time, and we have to take that into account when we design our algorithms.

Approach #1: A Naive Algorithm

We need to figure out the mean reward (expected value) of each arm.  So, let's just try each arm 100 times, take the sample mean of the rewards we get back, and then pull the arm with the best sample mean forever more.  Problem solved?

Not exactly.  This approach will get you in trouble in a few key ways:

  1. If K is of even moderate size (10-100), you'll spend a long time gathering data before you can actually benefit from feedback.
  2. Is 100 samples for each arm enough?  How many should we use?  This is an arbitrary parameter that will require experimentation to determine.
  3. If after 100 samples (or however many), the arm you settle on is not actually the optimal one, you can never recover.
  4. In practice, the reward distribution is likely to change over time, and we should use an algorithm that can take that into account.

OK, so maybe the naive approach won't work.  Let's move on to a few that are actually used in practice.

Approach #2: The ϵ− Greedy Algorithm

What we'd like to do is start using what we think is the best arm as soon as possible, and adjust course when information to the contrary becomes available.  And we don't want to get stuck in a sub-optimal state forever.  The " ϵ− greedy" algorithm addresses both of these concerns.  Here is how it works: with probability 1−ϵ pull the arm with the best current sample mean reward, and otherwise pull a random other arm (uniformly).  The advantages over the naive are:

  1. Guaranteed to not get stuck in a suboptimal state forever.
  2. Will use the current best performing arm a large proportion of the time.

But setting ϵ is hard.  If it’s too small, learning is slow at the start, and you will be slow to react to changes.  If we happen to sample, say, the second-best arm the first few times, it may take a long time to discover that another arm is actually better.  If ϵ is too big, you’ll waste many trials pulling random arms without gaining much.  After a while, we'll have enough samples to be pretty sure which is best, but we will still be wasting an ϵ of our traffic exploring other options.  In short, ϵ is a parameter that gives poor performance at the extremes, and we have little guidance as to how to set it.

Approach #3: Upper Confidence Bound Algorithms

In the world of statistics, whenever you estimate some unknown parameter (such as the mean of a distribution) using random samples, there is a way to quantify the uncertainty inherent in your estimate.  For example, the true mean of a fair six-sided die is 3.5.  But if you only roll it once and get a 2, your best estimate of the mean is just 2.  Obviously that estimate is not very good, and we can quantify just how variable it is.  There are confidence bounds which can be written, for example, as: "The mean of this die is 2, with a 95-th percentile lower bound of 1.4 and a 95-th percentile upper bound of 5.2."

The upper confidence bound (UCB) family of algorithms, as its name suggests, simply selects the arm with the largest upper confidence bound at each round.  The intuition is this: the more times you roll the die, the tighter the confidence bounds.  If your roll the die an infinite number of times then the width of the confidence bound is zero, and before you ever roll it the width of the confidence bound is the largest it will ever be.  So, as the number of rolls increases, the uncertainty decreases, and so does the width of the confidence bound.

In the bandit case, imagine that you have to introduce a brand new choice to the set of K choices a week into your experiment.  The ϵ− greedy algorithm would keep chugging along, showing this new choice rarely (if the initial mean is defined to be 0).  But the upper confidence bound of this new choice will be very large because of the uncertainty that results from us never having pulled it.  So UCB will choose this new arm until its upper bound is below the upper bound of the more established choices.

So, the advantages are:

  1. Take uncertainty of sample mean estimate into account in a smart way.
  2. No parameters to validate.

And the major disadvantage is that the confidence bounds designed in the machine learning literature require heuristic adjustments.  One way to get around having to wade through heuristics is to recall the central limit theorem.  I'll skip the math but it says that the distribution of the sample mean computed from samples from any distribution converges to a Normal (Gaussian) as the number of samples increases (and fairly quickly).  Why does that matter here?  Because we are estimating the true expected reward for each arm with a sample mean.  Ideally we want a posterior of where the true mean is, but that's hard in non-Bernoulli and non-Gaussian cases.  So we will instead content ourselves with an approximation and use a Gaussian distribution centered at the sample mean instead.  We can thus always use a, say, 95% upper confidence bound, and be secure in the knowledge that it will become more and more accurate the more samples we get.  I will discuss this in more detail in the next blog post.

Simulation Comparison

So how do these 3 algorithms perform?  To find out, I ran a simple simulation 100 times with K=5 and binary rewards (aka a Bernoulli bandit).  Here are the 5 algorithms compared:

  1. Random - just pick a random arm each time without learning anything.
  2. Naive - with 100 samples of each arm before committing to the best one
  3. ϵ− Greedy - with ϵ=0.01
  4. UCB - with (1 - 1/t) bounds (heuristic modification of UCB1)
  5. UCB - with 95% bounds

The metric used to compare these algorithms is average (over all the trials) expected regret (lower is better), which quantifies how much reward we missed out on by pulling the suboptimal arm at each time step.  The Python code is here and the results are in the plot below.

What can we conclude from this plot?

  1. Naive is as bad as random for the first 100K rounds, but then has effectively flat performance.  In the real world, the arms have shifting rewards, so this algorithm is impractical because it over-commits
  2. ϵ− greedy is OK but without a decaying ϵ we're still wasting 1% of traffic on exploration when it may no longer be necessary.
  3. The UCB algorithms are great.  It's not clear which one is the winner in this limited horizon, but both handily beat all of the other algorithms.

Now you know all about bandits, and have a good idea of how they might be relevant to online recommender systems.  But there's more to do before we have a system that is really up to the job.

Coming up next: let's get Bayesian with Thompson Sampling!

About Sergey Feldman:

Sergey Feldman is a data scientist & machine learning cowboy with the RichRelevance Analytics team. He was born in Ukraine, moved with his family to Skokie, Illinois at age 10, and now lives in Seattle. In 2012 he obtained his machine learning PhD from the University of Washington. Sergey loves random forests and thinks the Fourier transform is pure magic.

【转载】Bandits for Recommendation Systems (Part I)的更多相关文章

  1. CABaRet: Leveraging Recommendation Systems for Mobile Edge Caching

    CABaRet:利用推荐系统进行移动边缘缓存 本文为SIGCOMM 2018 Workshop (Mobile Edge Communications, MECOMM)论文. 笔者翻译了该论文.由于时 ...

  2. 【RS】Collaborative Memory Network for Recommendation Systems - 基于协同记忆网络的推荐系统

    [论文标题]Collaborative Memory Network for Recommendation Systems    (SIGIR'18) [论文作者]—Travis Ebesu (San ...

  3. Deep Learning Recommendation Model for Personalization and Recommendation Systems

    这篇文章出自facebook,主要探索了如何利用类别型特征(categorical features)并且构建一个深度推荐系统.值得注意的是,文章还特别强调了工业实现上如何实现并行,也很良心地给出了基 ...

  4. 【转载】Recommendations with Thompson Sampling (Part II)

    [原文链接:http://engineering.richrelevance.com/recommendations-thompson-sampling/.] [本文链接:http://www.cnb ...

  5. Bandit:一种简单而强大的在线学习算法

    假设我有5枚硬币,都是正反面不均匀的.我们玩一个游戏,每次你可以选择其中一枚硬币掷出,如果掷出正面,你将得到一百块奖励.掷硬币的次数有限(比如10000次),显然,如果要拿到最多的利益,你要做的就是尽 ...

  6. (转) Quick Guide to Build a Recommendation Engine in Python

    本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...

  7. 近年Recsys论文

    2015年~2017年SIGIR,SIGKDD,ICML三大会议的Recsys论文: [转载请注明出处:https://www.cnblogs.com/shenxiaolin/p/8321722.ht ...

  8. 微软黑科技强力注入,.NET C#全面支持人工智能

    微软黑科技强力注入,.NET C#全面支持人工智能,AI编程领域开始C#.Py--百花齐放 就像武侠小说中,一个普通人突然得到绝世高手的几十年内力注入,招式还没学,一身内力有点方 Introducin ...

  9. TensorFlow实战——个性化推荐

    原创文章,转载请注明出处: http://blog.csdn.net/chengcheng1394/article/details/78820529 请安装TensorFlow1.0,Python3. ...

随机推荐

  1. 集合框架之——迭代器并发修改异常ConcurrentModificationException

    问题: 我有一个集合,如下,请问,我想判断里面有没有"world"这个元素,如果有,我就添加一个"javaee"元素,请写代码实现. 使用普通迭代器出现的异常: ...

  2. iOS开发笔记之Runtime实用总结

    前言 runtime的资料网上有很多了,部分有些晦涩难懂,我通过自己的学习方法总结一遍,主要讲一些常用的方法功能,以实用为主,我觉得用到印象才是最深刻的.另外runtime的知识还有很多,想要了解更多 ...

  3. iOS 键盘类型

    版权声明:本文为博主原创文章.请尊重作者劳动成果,转载请注明出处. UIKeyboardTypeDefault: UIKeyboardTypeASCIICapable: UIKeyboardTypeN ...

  4. .net学习笔记--序列化与反序列化

    序列化其实就是将一个对象的所有相关的数据保存为一个二进制文件(注意:是一个对象) 而且与这个对象相关的所有类型都必须是可序列化的所以要在相关类中加上 [Serializable]特性 对象类型包括:对 ...

  5. Java Reflection

    Java语言的反射机制 1. Java反射的含义:获取应用中正在运行的Java对象. 2. Java反射机制: 在运行的程序中,对于任意的类,都可以知道这个类的属性.方法以及构造函数,对于任意对象都可 ...

  6. Linux 条件判断

    1. 按照文件类型判断 -b 文件 #判断文件是否存在,并且是设备文件 -c 文件 #判断文件是否存在,并且是字符设备文件 -d 目录 #判断目录是否存在,并且是否为目录(是目录返回真) -e 文件 ...

  7. CMakeLists for tesseract

    在网上找了很多,直接用都不行,试了半天的到以下的结果. cmake_minimum_required(VERSION 2.8) project( test ) include_directories ...

  8. OpenSource.organization-in-github

    1. gosquared https://github.com/gosquared 2. slack https://github.com/slackhq 3. The New York Times ...

  9. (原创)通用查询实现方案(可用于DDD)[附源码] -- 简介

    [声明] 写作不易,转载请注明出处(http://www.cnblogs.com/wiseant/p/3985353.html).   [系列文章] 通用查询实现方案(可用于DDD)[附源码] -- ...

  10. android webview开发问题及优化汇总

    我们在native与网页相结合开发的过程中,难免会遇到关于WebView一些共通的问题.就我目前开发过程中遇到的问题以及最后得到的优化方案都将在这里列举出来.有些是老生常谈,有些则是个人摸索得出解决方 ...