[原文链接:http://engineering.richrelevance.com/bandits-recommendation-systems/。]

[本文链接:http://www.cnblogs.com/breezedeus/p/3775316.html,转载请注明出处]

Bandits for Recommendation Systems

06/02/2014 • Topics: Bayesian, Big data, Data Science

by Sergey Feldman

This is the first in a series of three blog posts on bandits for recommendation systems.

In this blog post, we will discuss the bandit problem and how it relates to online recommender systems.  Then, we'll cover some classic algorithms and see how well they do in simulation.

A common problem for internet-based companies is: which piece of content should we display?  Google has this problem (which ad to show), Facebook has this problem (which friend's post to show), and RichRelevance has this problem (which product recommendation to show).  Many of the promising solutions come from the study of the multi-armed bandit problem.  A one-armed "bandit" is another way to say slot machine (probably because both will leave you with empty pockets).  Here is a description that I hijacked from Wikipedia:

"The multi-armed bandit problem is the problem a gambler faces at a row of slot machines when deciding which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls."

Let's rewrite this in retail language.  Each time a shopper looks at a webpage, we have to show them one of K product recommendations.  They either click on it or do not, and we log this (binary) reward.  Next, we proceed to either the next shopper or the next page view of this shopper and have to choose one of Kproduct recommendations again. (Actually, we have to choose multiple recommendations per page, and our 'reward' could instead be sales revenue, but let's ignore these aspects for now.)

Multi-armed bandits come in two flavors: stochastic and adversarial.  The stochastic case is where each bandit doesn't change in response to your actions, while in the adversarial case the bandits learn from your actions and adjust their behavior to minimize your rewards.  We care about the stochastic case, and our goal is to find the arm which has the largest expected reward.  I will index the arms by a , and the probability distribution over possible rewards r for each arm a can be written as \( p_a(r)\) .  We have to find the arm with the largest mean reward

\( μ_a=E_a[r] \) 

as quickly as possible while accumulating the most rewards along the way.  One important point is that in practice \( p_a(r)\) are non-stationary, that is, rewards change over time, and we have to take that into account when we design our algorithms.

Approach #1: A Naive Algorithm

We need to figure out the mean reward (expected value) of each arm.  So, let's just try each arm 100 times, take the sample mean of the rewards we get back, and then pull the arm with the best sample mean forever more.  Problem solved?

Not exactly.  This approach will get you in trouble in a few key ways:

  1. If K is of even moderate size (10-100), you'll spend a long time gathering data before you can actually benefit from feedback.
  2. Is 100 samples for each arm enough?  How many should we use?  This is an arbitrary parameter that will require experimentation to determine.
  3. If after 100 samples (or however many), the arm you settle on is not actually the optimal one, you can never recover.
  4. In practice, the reward distribution is likely to change over time, and we should use an algorithm that can take that into account.

OK, so maybe the naive approach won't work.  Let's move on to a few that are actually used in practice.

Approach #2: The ϵ− Greedy Algorithm

What we'd like to do is start using what we think is the best arm as soon as possible, and adjust course when information to the contrary becomes available.  And we don't want to get stuck in a sub-optimal state forever.  The " ϵ− greedy" algorithm addresses both of these concerns.  Here is how it works: with probability 1−ϵ pull the arm with the best current sample mean reward, and otherwise pull a random other arm (uniformly).  The advantages over the naive are:

  1. Guaranteed to not get stuck in a suboptimal state forever.
  2. Will use the current best performing arm a large proportion of the time.

But setting ϵ is hard.  If it’s too small, learning is slow at the start, and you will be slow to react to changes.  If we happen to sample, say, the second-best arm the first few times, it may take a long time to discover that another arm is actually better.  If ϵ is too big, you’ll waste many trials pulling random arms without gaining much.  After a while, we'll have enough samples to be pretty sure which is best, but we will still be wasting an ϵ of our traffic exploring other options.  In short, ϵ is a parameter that gives poor performance at the extremes, and we have little guidance as to how to set it.

Approach #3: Upper Confidence Bound Algorithms

In the world of statistics, whenever you estimate some unknown parameter (such as the mean of a distribution) using random samples, there is a way to quantify the uncertainty inherent in your estimate.  For example, the true mean of a fair six-sided die is 3.5.  But if you only roll it once and get a 2, your best estimate of the mean is just 2.  Obviously that estimate is not very good, and we can quantify just how variable it is.  There are confidence bounds which can be written, for example, as: "The mean of this die is 2, with a 95-th percentile lower bound of 1.4 and a 95-th percentile upper bound of 5.2."

The upper confidence bound (UCB) family of algorithms, as its name suggests, simply selects the arm with the largest upper confidence bound at each round.  The intuition is this: the more times you roll the die, the tighter the confidence bounds.  If your roll the die an infinite number of times then the width of the confidence bound is zero, and before you ever roll it the width of the confidence bound is the largest it will ever be.  So, as the number of rolls increases, the uncertainty decreases, and so does the width of the confidence bound.

In the bandit case, imagine that you have to introduce a brand new choice to the set of K choices a week into your experiment.  The ϵ− greedy algorithm would keep chugging along, showing this new choice rarely (if the initial mean is defined to be 0).  But the upper confidence bound of this new choice will be very large because of the uncertainty that results from us never having pulled it.  So UCB will choose this new arm until its upper bound is below the upper bound of the more established choices.

So, the advantages are:

  1. Take uncertainty of sample mean estimate into account in a smart way.
  2. No parameters to validate.

And the major disadvantage is that the confidence bounds designed in the machine learning literature require heuristic adjustments.  One way to get around having to wade through heuristics is to recall the central limit theorem.  I'll skip the math but it says that the distribution of the sample mean computed from samples from any distribution converges to a Normal (Gaussian) as the number of samples increases (and fairly quickly).  Why does that matter here?  Because we are estimating the true expected reward for each arm with a sample mean.  Ideally we want a posterior of where the true mean is, but that's hard in non-Bernoulli and non-Gaussian cases.  So we will instead content ourselves with an approximation and use a Gaussian distribution centered at the sample mean instead.  We can thus always use a, say, 95% upper confidence bound, and be secure in the knowledge that it will become more and more accurate the more samples we get.  I will discuss this in more detail in the next blog post.

Simulation Comparison

So how do these 3 algorithms perform?  To find out, I ran a simple simulation 100 times with K=5 and binary rewards (aka a Bernoulli bandit).  Here are the 5 algorithms compared:

  1. Random - just pick a random arm each time without learning anything.
  2. Naive - with 100 samples of each arm before committing to the best one
  3. ϵ− Greedy - with ϵ=0.01
  4. UCB - with (1 - 1/t) bounds (heuristic modification of UCB1)
  5. UCB - with 95% bounds

The metric used to compare these algorithms is average (over all the trials) expected regret (lower is better), which quantifies how much reward we missed out on by pulling the suboptimal arm at each time step.  The Python code is here and the results are in the plot below.

What can we conclude from this plot?

  1. Naive is as bad as random for the first 100K rounds, but then has effectively flat performance.  In the real world, the arms have shifting rewards, so this algorithm is impractical because it over-commits
  2. ϵ− greedy is OK but without a decaying ϵ we're still wasting 1% of traffic on exploration when it may no longer be necessary.
  3. The UCB algorithms are great.  It's not clear which one is the winner in this limited horizon, but both handily beat all of the other algorithms.

Now you know all about bandits, and have a good idea of how they might be relevant to online recommender systems.  But there's more to do before we have a system that is really up to the job.

Coming up next: let's get Bayesian with Thompson Sampling!

About Sergey Feldman:

Sergey Feldman is a data scientist & machine learning cowboy with the RichRelevance Analytics team. He was born in Ukraine, moved with his family to Skokie, Illinois at age 10, and now lives in Seattle. In 2012 he obtained his machine learning PhD from the University of Washington. Sergey loves random forests and thinks the Fourier transform is pure magic.

【转载】Bandits for Recommendation Systems (Part I)的更多相关文章

  1. CABaRet: Leveraging Recommendation Systems for Mobile Edge Caching

    CABaRet:利用推荐系统进行移动边缘缓存 本文为SIGCOMM 2018 Workshop (Mobile Edge Communications, MECOMM)论文. 笔者翻译了该论文.由于时 ...

  2. 【RS】Collaborative Memory Network for Recommendation Systems - 基于协同记忆网络的推荐系统

    [论文标题]Collaborative Memory Network for Recommendation Systems    (SIGIR'18) [论文作者]—Travis Ebesu (San ...

  3. Deep Learning Recommendation Model for Personalization and Recommendation Systems

    这篇文章出自facebook,主要探索了如何利用类别型特征(categorical features)并且构建一个深度推荐系统.值得注意的是,文章还特别强调了工业实现上如何实现并行,也很良心地给出了基 ...

  4. 【转载】Recommendations with Thompson Sampling (Part II)

    [原文链接:http://engineering.richrelevance.com/recommendations-thompson-sampling/.] [本文链接:http://www.cnb ...

  5. Bandit:一种简单而强大的在线学习算法

    假设我有5枚硬币,都是正反面不均匀的.我们玩一个游戏,每次你可以选择其中一枚硬币掷出,如果掷出正面,你将得到一百块奖励.掷硬币的次数有限(比如10000次),显然,如果要拿到最多的利益,你要做的就是尽 ...

  6. (转) Quick Guide to Build a Recommendation Engine in Python

    本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...

  7. 近年Recsys论文

    2015年~2017年SIGIR,SIGKDD,ICML三大会议的Recsys论文: [转载请注明出处:https://www.cnblogs.com/shenxiaolin/p/8321722.ht ...

  8. 微软黑科技强力注入,.NET C#全面支持人工智能

    微软黑科技强力注入,.NET C#全面支持人工智能,AI编程领域开始C#.Py--百花齐放 就像武侠小说中,一个普通人突然得到绝世高手的几十年内力注入,招式还没学,一身内力有点方 Introducin ...

  9. TensorFlow实战——个性化推荐

    原创文章,转载请注明出处: http://blog.csdn.net/chengcheng1394/article/details/78820529 请安装TensorFlow1.0,Python3. ...

随机推荐

  1. Brophp框架开发时连接数据库读取UTF8乱码的解决(转)

    Brophp框架开发时连接数据库读取UTF8乱码的解决办法 (2012-09-15 10:41:22) 转载▼ 标签: 杂谈 it php 分类: 建站技术 Brophp框架开发时连接数据库读取UTF ...

  2. 使用miniui框架制作树形节点

    <div id="leftTree" class="mini-outlooktree" url="<%=basePath%>mana ...

  3. Jquery当中当data为json串时,eval('(' +data+ ')')的解释

    var dataObj = eval('(' +data+ ')') data是返回来的json. dataObj就是json对象了. 为什么要添加 '(' 与 ')' 作为开始于结尾呢? json是 ...

  4. 深入Java虚拟机

    第一章:Java体系结构介绍 1.Java为什么重要?       Java是为网络而设计的,而Java这种适合网络环境的能力又是由其体系结构决定的,可以保证安全健壮和平台无关的程序通过网络传播. 2 ...

  5. 空MVC项目找不到System.Web.Optimization的处理办法

    install-package Microsoft.AspNet.Web.Optimization Create the bundle in Global.asax Application_Start ...

  6. ng2收获

    1.devDependencies下只有在开发应用时才用得到这个我是知道的. 但是我不知道的事要想达到这个效果是要在生产环境安装包的时候必须要加个这个才行"--production" ...

  7. PTGM and APTM

    1. 性能测试过程模型(PTGM) PTGM模型包括以下几个步骤: 测试前期的准备 测试工具的引入 测试计划 测试设计与开发 测试执行与管理 测试分析 测试前期准备:主要任务为保证系统稳定和建立合适的 ...

  8. 先学习下一些基础的js和xpath语法

    这两个方法到底是在做什么呢?其实就是克隆了当前指令的节点,并生成子作用域.克隆的节点由transclude定义,如果你的属性是true,则克隆的是指令模板中的ng-transclude所在的DOM节点 ...

  9. 多线程Server client

    项目结构 项目设计 客户端同时大量请求服务端,服务端多线程处理连接,并发序列化获得客户端发送的数据,并做出处理. IClients package simple.socket; import java ...

  10. SQL数据库

    SQL是Structured Query Language(结构化查询语言)的缩写.SQL是专为数据库而建立的操作命令集,是一种功能齐全的数据库语言.在使用它时,只需要发出“做什么”的命令,“怎么做” ...