再谈汤普森采样（Thompson Sampling）

相关：

外网教程：

https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf

国内教程：

强化学习理论基础 4.8 汤普森采样（Thompson Sampling）

=====================================

之前已经share了多篇相关的post，最近看到了一个新的这方面的东西，就想着再谈下这个问题。汤普森采样（Thompson Sampling）是MAB（Multi arm bandit）问题中经常被讨论的一种均衡exploit和explore的方法，之前已经解释了很多相关的资料，最近看到了些不一样的东西，之前share的汤普森采样（Thompson Sampling）都是使用beta分布做先验和后验的，不过发现最近发现也有使用正态分布做先验和后验的。

根据维基百科上的Thompson sampling的定义：

可以看到，汤普森采样（Thompson Sampling）并不是一定要用beta分布的，汤普森采样（Thompson Sampling）其实核心就是利用贝叶斯公式在抽样时评估哪个抽样的最优可能性更高。我们在使用汤普森采样（Thompson Sampling）时需要先设置先验概率分布和似然概率分布，而且我们还需要保证获得的后验概率分布和先验概率分布是共轭的，这样就可以不断的根据抽样的次数来进行迭代评估。我们最常用的汤普森采样（Thompson Sampling）就是伯努利-汤普森采样（Bernoulli Thompson Sampling），也就是使用beta分布作为先验分布和后验分布，使用二项分布作为似然函数的方法，大致形式的伪代码如下：

伯努利-汤普森采样（Bernoulli Thompson Sampling）

其中的核心代码为：（参考：https://zhuanlan.zhihu.com/p/36199435）

import  numpy as np

import  pymc

# wins 和 trials 都是一个 N 维向量，N 是臂的个数

# wins 表示所有臂的 α 参数，loses 表示所有臂的 β 参数

choice = np.argmax(pymc.rbeta(1 + wins, 1 + loses, len(wins)))

可以看到，伯努利-汤普森采样（Bernoulli Thompson Sampling）很大的一个局限性就是使用二项分布作为似然函数，因为这样我们每次抽样的结果都只能是0或1，也就是发生或没发生，而在MAB（Multi arm bandit）问题中我们采样的reward的形式有的时候是0或1，但是也存在多个离散值，甚至是连续值的reward，这样就不适用伯努利-汤普森采样（Bernoulli Thompson Sampling），该种情况下我们可以使用高斯-汤普森采样（Gaussian Thompson Sampling）。

高斯-汤普森采样（Gaussian Thompson Sampling）

这里不给出具体的数学推导公式及证明了，原有有两点：1.是没有那么多精力写这些旁支内容；2.是本人也确实不会这东西的推导和证明。

直接给出Python代码：（https://github.com/mimoralea/gdrl）

def thompson_sampling(env,

                      alpha=1,

                      beta=0,

                      n_episodes=1000):

    Q = np.zeros((env.action_space.n), dtype=np.float64)

    N = np.zeros((env.action_space.n), dtype=np.int)

    Qe = np.empty((n_episodes, env.action_space.n), dtype=np.float64)

    returns = np.empty(n_episodes, dtype=np.float64)

    actions = np.empty(n_episodes, dtype=np.int)

    name = 'Thompson Sampling {}, {}'.format(alpha, beta)

    for e in tqdm(range(n_episodes),

                  desc='Episodes for: ' + name,

                  leave=False):

        samples = np.random.normal(

            loc=Q, scale=alpha/(np.sqrt(N) + beta))

        action = np.argmax(samples)

        _, reward, _, _ = env.step(action)

        N[action] += 1

        Q[action] = Q[action] + (reward - Q[action])/N[action]

        Qe[e] = Q

        returns[e] = reward

        actions[e] = action

    return name, returns, Qe, actions

核心代码部分：

也就是说，高斯-汤普森采样（Gaussian Thompson Sampling）是使用正态分布作为先验和后验的，每次抽样后我都只需要更新对应arm的正态分布中的均值和方差即可。需要注意的是高斯分布也是共轭分布。

--------------------------------------------------

在外网找到了些关于汤普森采样（Thompson Sampling）不错的资料：

https://towardsdatascience.com/thompson-sampling-fc28817eacb8

https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf

====================================

同时也给出其他的采样方法：

softmax采样：

def softmax(env,

            init_temp=float('inf'),

            min_temp=0.0,

            decay_ratio=0.04,

            n_episodes=1000):

    Q = np.zeros((env.action_space.n), dtype=np.float64)

    N = np.zeros((env.action_space.n), dtype=np.int)

    Qe = np.empty((n_episodes, env.action_space.n), dtype=np.float64)

    returns = np.empty(n_episodes, dtype=np.float64)

    actions = np.empty(n_episodes, dtype=np.int)

    name = 'Lin SoftMax {}, {}, {}'.format(init_temp,

                                           min_temp,

                                           decay_ratio)

    # can't really use infinity

    init_temp = min(init_temp,

                    sys.float_info.max)

    # can't really use zero

    min_temp = max(min_temp,

                   np.nextafter(np.float32(0),

                                np.float32(1)))

    for e in tqdm(range(n_episodes),

                  desc='Episodes for: ' + name,

                  leave=False):

        decay_episodes = n_episodes * decay_ratio

        temp = 1 - e / decay_episodes

        temp *= init_temp - min_temp

        temp += min_temp

        temp = np.clip(temp, min_temp, init_temp)

        scaled_Q = Q / temp

        norm_Q = scaled_Q - np.max(scaled_Q)

        exp_Q = np.exp(norm_Q)

        probs = exp_Q / np.sum(exp_Q)

        assert np.isclose(probs.sum(), 1.0)

        action = np.random.choice(np.arange(len(probs)),

                                  size=1,

                                  p=probs)[0]

        _, reward, _, _ = env.step(action)

        N[action] += 1

        Q[action] = Q[action] + (reward - Q[action])/N[action]

        Qe[e] = Q

        returns[e] = reward

        actions[e] = action

    return name, returns, Qe, actions

上置信采样Upper confidence bound(UCB)：

def upper_confidence_bound(env,

                           c=2,

                           n_episodes=1000):

    Q = np.zeros((env.action_space.n), dtype=np.float64)

    N = np.zeros((env.action_space.n), dtype=np.int)

    Qe = np.empty((n_episodes, env.action_space.n), dtype=np.float64)

    returns = np.empty(n_episodes, dtype=np.float64)

    actions = np.empty(n_episodes, dtype=np.int)

    name = 'UCB {}'.format(c)

    for e in tqdm(range(n_episodes),

                  desc='Episodes for: ' + name,

                  leave=False):

        action = e

        if e >= len(Q):

            U = c * np.sqrt(np.log(e)/N)

            action = np.argmax(Q + U)

        _, reward, _, _ = env.step(action)

        N[action] += 1

        Q[action] = Q[action] + (reward - Q[action])/N[action]

        Qe[e] = Q

        returns[e] = reward

        actions[e] = action

    return name, returns, Qe, actions

注意本文中定义的Python函数块内代码内容遵守BSD 3-Clause License协议，协议内容：

BSD 3-Clause License

Copyright (c) 2018, Miguel Morales

All rights reserved.

Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this

  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,

  this list of conditions and the following disclaimer in the documentation

  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its

  contributors may be used to endorse or promote products derived from

  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"

AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE

FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR

SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER

CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,

OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

再谈汤普森采样（Thompson Sampling）的更多相关文章

推荐算法之Thompson(汤普森)采样
如果想理解汤普森采样算法,就必须先熟悉了解贝塔分布. 一.Beta(贝塔)分布 Beta分布是一个定义在[0,1]区间上的连续概率分布族,它有两个正值参数,称为形状参数,一般用α和β表示,Beta分布 ...
随机采样和随机模拟：吉布斯采样Gibbs Sampling
http://blog.csdn.net/pipisorry/article/details/51373090 吉布斯采样算法详解为什么要用吉布斯采样通俗解释一下什么是sampling. samp ...
470. Implement Rand10() Using Rand7() （拒绝采样Reject Sampling）
1. 问题已提供一个Rand7()的API可以随机生成1到7的数字,使用Rand7实现Rand10,Rand10可以随机生成1到10的数字. 2. 思路简单说: (1)通过(Rand N - 1) ...
[DeeplearningAI笔记]序列模型2.7负采样Negative sampling
5.2自然语言处理觉得有用的话,欢迎一起讨论相互学习~Follow Me 2.7 负采样 Negative sampling Mikolov T, Sutskever I, Chen K, et a ...
[转载]再谈百度：KPI、无人机，以及一个必须给父母看的案例
[转载]再谈百度:KPI.无人机,以及一个必须给父母看的案例发表于 2016-03-15 | 0 Comments | 阅读次数 33 原文: 再谈百度:KPI.无人机,以及一个必须 ...
Support Vector Machine (3) : 再谈泛化误差（Generalization Error）
目录 Support Vector Machine (1) : 简单SVM原理 Support Vector Machine (2) : Sequential Minimal Optimization ...
【转载】Recommendations with Thompson Sampling (Part II)
[原文链接:http://engineering.richrelevance.com/recommendations-thompson-sampling/.] [本文链接:http://www.cnb ...
Unity教程之再谈Unity中的优化技术
这是从 Unity教程之再谈Unity中的优化技术这篇文章里提取出来的一部分,这篇文章让我学到了挺多可能我应该知道却还没知道的知识,写的挺好的优化几何体这一步主要是为了针对性能瓶颈中的”顶点 ...
浅谈HTTP中Get与Post的区别/HTTP协议与HTML表单（再谈GET与POST的区别）
HTTP协议与HTML表单(再谈GET与POST的区别) GET方式在request-line中传送数据:POST方式在request-line及request-body中均可以传送数据. http: ...
Another Look at Events（再谈Events）
转载:http://www.qtcn.org/bbs/simple/?t31383.html Another Look at Events(再谈Events) 最近在学习Qt事件处理的时候发现一篇很不 ...

随机推荐

el-upload拍照上传多个文件报错 ERR_UPLOAD_FILE_CHANGED问题
最近同事使用el-upload上传图片时出现一个问题,连续拍照多个图片的时候,循环调用接口上传会报错: ERR_UPLOAD_FILE_CHANGED,网上找了很多方案没有解决,下面是我自己的解决过程 ...
使用python+pytesseract实现图片中文字的识别
一.安装tesseract 1.下载链接 https://digi.bib.uni-mannheim.de/tesseract/ 2.网盘下载地址链接:https://pan.baidu.com/s ...
将静态文件打包进nuget里 Net Core
我之前写了一个.net core 生成验证码的小工具需要使用者先单独下载字体文件到本地在 install-package 感觉这样很捞也很不方便,但当时忙着做其他需求现在更新下. 其实很简单 vis ...
Abp vNext 模块化系统简单介绍
怎么使用模块1. 建立模块直接的依赖关系,可以通过DependsOnAttribute特性来确定依赖关系2. 先配置模块,实现为模块填充数据和功能设置.3. 使用模块提供的功能接口怎么定义模块1. ...
Nginx配置以及热升级
目录 Nginx详解 1. Nginx关键特性 2. Nginx配置 2.1 event 2.2 http 2.2.1 log_format 2.2.2 sendfile 2.2.3 tcp_nopu ...
SpringBoot能同时处理多少请求
SpringBoot默认的内嵌容器是Tomcat,也就是我们的程序实际上是运行在Tomcat里的.所以与其说SpringBoot可以处理多少请求,到不如说Tomcat可以处理多少请求. 关于Tomca ...
SpringBoot 解决跨域问题
今天遇到一个很神奇的问题,之前写的项目,后端跨域都处理好的,按部就班使用原来的方式,前后端都开发完之后,部署本地后,跨域没起效,一脸懵逼,然后使用公司另外一个同事的跨域解决方案,具体我也没深入研究到底 ...
msgpack的使用
1.引入包  <dependency> <groupId>org.msgpack</groupId> <artif ...
Dawwin首位人工智能编程师，未来又会怎么样？
Darwinai是一家快速发展的视觉质量检测公司,为制造商提供端到端解决方案,以提高产品质量并提高生产效率.该公司的专利可解释人工智能(XAI)平台已被众多财富500强公司采用,可以轻松集成值得信赖的 ...
redis-sort by
对某个列表(list).集合(set).有序集合(zset)排序的时候按照某个参考键进行排序,而不是按照按照这个列表.集合或有序集合本身进行排序: 被排序的键和参考键在业务上有关联(这个由业务保证 ...

再谈汤普森采样（Thompson Sampling）

再谈汤普森采样（Thompson Sampling）的更多相关文章

随机推荐

热门专题