再谈汤普森采样（Thompson Sampling）

相关：

外网教程：

https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf

国内教程：

强化学习理论基础 4.8 汤普森采样（Thompson Sampling）

=====================================

之前已经share了多篇相关的post，最近看到了一个新的这方面的东西，就想着再谈下这个问题。汤普森采样（Thompson Sampling）是MAB（Multi arm bandit）问题中经常被讨论的一种均衡exploit和explore的方法，之前已经解释了很多相关的资料，最近看到了些不一样的东西，之前share的汤普森采样（Thompson Sampling）都是使用beta分布做先验和后验的，不过发现最近发现也有使用正态分布做先验和后验的。

根据维基百科上的Thompson sampling的定义：

可以看到，汤普森采样（Thompson Sampling）并不是一定要用beta分布的，汤普森采样（Thompson Sampling）其实核心就是利用贝叶斯公式在抽样时评估哪个抽样的最优可能性更高。我们在使用汤普森采样（Thompson Sampling）时需要先设置先验概率分布和似然概率分布，而且我们还需要保证获得的后验概率分布和先验概率分布是共轭的，这样就可以不断的根据抽样的次数来进行迭代评估。我们最常用的汤普森采样（Thompson Sampling）就是伯努利-汤普森采样（Bernoulli Thompson Sampling），也就是使用beta分布作为先验分布和后验分布，使用二项分布作为似然函数的方法，大致形式的伪代码如下：

伯努利-汤普森采样（Bernoulli Thompson Sampling）

其中的核心代码为：（参考：https://zhuanlan.zhihu.com/p/36199435）

import  numpy as np

import  pymc

# wins 和 trials 都是一个 N 维向量，N 是臂的个数

# wins 表示所有臂的 α 参数，loses 表示所有臂的 β 参数

choice = np.argmax(pymc.rbeta(1 + wins, 1 + loses, len(wins)))

可以看到，伯努利-汤普森采样（Bernoulli Thompson Sampling）很大的一个局限性就是使用二项分布作为似然函数，因为这样我们每次抽样的结果都只能是0或1，也就是发生或没发生，而在MAB（Multi arm bandit）问题中我们采样的reward的形式有的时候是0或1，但是也存在多个离散值，甚至是连续值的reward，这样就不适用伯努利-汤普森采样（Bernoulli Thompson Sampling），该种情况下我们可以使用高斯-汤普森采样（Gaussian Thompson Sampling）。

高斯-汤普森采样（Gaussian Thompson Sampling）

这里不给出具体的数学推导公式及证明了，原有有两点：1.是没有那么多精力写这些旁支内容；2.是本人也确实不会这东西的推导和证明。

直接给出Python代码：（https://github.com/mimoralea/gdrl）

def thompson_sampling(env,

                      alpha=1,

                      beta=0,

                      n_episodes=1000):

    Q = np.zeros((env.action_space.n), dtype=np.float64)

    N = np.zeros((env.action_space.n), dtype=np.int)

    Qe = np.empty((n_episodes, env.action_space.n), dtype=np.float64)

    returns = np.empty(n_episodes, dtype=np.float64)

    actions = np.empty(n_episodes, dtype=np.int)

    name = 'Thompson Sampling {}, {}'.format(alpha, beta)

    for e in tqdm(range(n_episodes),

                  desc='Episodes for: ' + name,

                  leave=False):

        samples = np.random.normal(

            loc=Q, scale=alpha/(np.sqrt(N) + beta))

        action = np.argmax(samples)

        _, reward, _, _ = env.step(action)

        N[action] += 1

        Q[action] = Q[action] + (reward - Q[action])/N[action]

        Qe[e] = Q

        returns[e] = reward

        actions[e] = action

    return name, returns, Qe, actions

核心代码部分：

也就是说，高斯-汤普森采样（Gaussian Thompson Sampling）是使用正态分布作为先验和后验的，每次抽样后我都只需要更新对应arm的正态分布中的均值和方差即可。需要注意的是高斯分布也是共轭分布。

--------------------------------------------------

在外网找到了些关于汤普森采样（Thompson Sampling）不错的资料：

https://towardsdatascience.com/thompson-sampling-fc28817eacb8

https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf

====================================

同时也给出其他的采样方法：

softmax采样：

def softmax(env,

            init_temp=float('inf'),

            min_temp=0.0,

            decay_ratio=0.04,

            n_episodes=1000):

    Q = np.zeros((env.action_space.n), dtype=np.float64)

    N = np.zeros((env.action_space.n), dtype=np.int)

    Qe = np.empty((n_episodes, env.action_space.n), dtype=np.float64)

    returns = np.empty(n_episodes, dtype=np.float64)

    actions = np.empty(n_episodes, dtype=np.int)

    name = 'Lin SoftMax {}, {}, {}'.format(init_temp,

                                           min_temp,

                                           decay_ratio)

    # can't really use infinity

    init_temp = min(init_temp,

                    sys.float_info.max)

    # can't really use zero

    min_temp = max(min_temp,

                   np.nextafter(np.float32(0),

                                np.float32(1)))

    for e in tqdm(range(n_episodes),

                  desc='Episodes for: ' + name,

                  leave=False):

        decay_episodes = n_episodes * decay_ratio

        temp = 1 - e / decay_episodes

        temp *= init_temp - min_temp

        temp += min_temp

        temp = np.clip(temp, min_temp, init_temp)

        scaled_Q = Q / temp

        norm_Q = scaled_Q - np.max(scaled_Q)

        exp_Q = np.exp(norm_Q)

        probs = exp_Q / np.sum(exp_Q)

        assert np.isclose(probs.sum(), 1.0)

        action = np.random.choice(np.arange(len(probs)),

                                  size=1,

                                  p=probs)[0]

        _, reward, _, _ = env.step(action)

        N[action] += 1

        Q[action] = Q[action] + (reward - Q[action])/N[action]

        Qe[e] = Q

        returns[e] = reward

        actions[e] = action

    return name, returns, Qe, actions

上置信采样Upper confidence bound(UCB)：

def upper_confidence_bound(env,

                           c=2,

                           n_episodes=1000):

    Q = np.zeros((env.action_space.n), dtype=np.float64)

    N = np.zeros((env.action_space.n), dtype=np.int)

    Qe = np.empty((n_episodes, env.action_space.n), dtype=np.float64)

    returns = np.empty(n_episodes, dtype=np.float64)

    actions = np.empty(n_episodes, dtype=np.int)

    name = 'UCB {}'.format(c)

    for e in tqdm(range(n_episodes),

                  desc='Episodes for: ' + name,

                  leave=False):

        action = e

        if e >= len(Q):

            U = c * np.sqrt(np.log(e)/N)

            action = np.argmax(Q + U)

        _, reward, _, _ = env.step(action)

        N[action] += 1

        Q[action] = Q[action] + (reward - Q[action])/N[action]

        Qe[e] = Q

        returns[e] = reward

        actions[e] = action

    return name, returns, Qe, actions

注意本文中定义的Python函数块内代码内容遵守BSD 3-Clause License协议，协议内容：

BSD 3-Clause License

Copyright (c) 2018, Miguel Morales

All rights reserved.

Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this

  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,

  this list of conditions and the following disclaimer in the documentation

  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its

  contributors may be used to endorse or promote products derived from

  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"

AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE

FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR

SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER

CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,

OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

再谈汤普森采样（Thompson Sampling）的更多相关文章

推荐算法之Thompson(汤普森)采样
如果想理解汤普森采样算法,就必须先熟悉了解贝塔分布. 一.Beta(贝塔)分布 Beta分布是一个定义在[0,1]区间上的连续概率分布族,它有两个正值参数,称为形状参数,一般用α和β表示,Beta分布 ...
随机采样和随机模拟：吉布斯采样Gibbs Sampling
http://blog.csdn.net/pipisorry/article/details/51373090 吉布斯采样算法详解为什么要用吉布斯采样通俗解释一下什么是sampling. samp ...
470. Implement Rand10() Using Rand7() （拒绝采样Reject Sampling）
1. 问题已提供一个Rand7()的API可以随机生成1到7的数字,使用Rand7实现Rand10,Rand10可以随机生成1到10的数字. 2. 思路简单说: (1)通过(Rand N - 1) ...
[DeeplearningAI笔记]序列模型2.7负采样Negative sampling
5.2自然语言处理觉得有用的话,欢迎一起讨论相互学习~Follow Me 2.7 负采样 Negative sampling Mikolov T, Sutskever I, Chen K, et a ...
[转载]再谈百度：KPI、无人机，以及一个必须给父母看的案例
[转载]再谈百度:KPI.无人机,以及一个必须给父母看的案例发表于 2016-03-15 | 0 Comments | 阅读次数 33 原文: 再谈百度:KPI.无人机,以及一个必须 ...
Support Vector Machine (3) : 再谈泛化误差（Generalization Error）
目录 Support Vector Machine (1) : 简单SVM原理 Support Vector Machine (2) : Sequential Minimal Optimization ...
【转载】Recommendations with Thompson Sampling (Part II)
[原文链接:http://engineering.richrelevance.com/recommendations-thompson-sampling/.] [本文链接:http://www.cnb ...
Unity教程之再谈Unity中的优化技术
这是从 Unity教程之再谈Unity中的优化技术这篇文章里提取出来的一部分,这篇文章让我学到了挺多可能我应该知道却还没知道的知识,写的挺好的优化几何体这一步主要是为了针对性能瓶颈中的”顶点 ...
浅谈HTTP中Get与Post的区别/HTTP协议与HTML表单（再谈GET与POST的区别）
HTTP协议与HTML表单(再谈GET与POST的区别) GET方式在request-line中传送数据:POST方式在request-line及request-body中均可以传送数据. http: ...
Another Look at Events（再谈Events）
转载:http://www.qtcn.org/bbs/simple/?t31383.html Another Look at Events(再谈Events) 最近在学习Qt事件处理的时候发现一篇很不 ...

随机推荐

PI规划会，研发团队价值聚焦的一剂良方
随着数字化建设如火如荼地推进,中大型企业的数字化建设团队规模也越来越大,团队规模的扩大一方面带来了更多产能与可能性,另一方面,不同的角色在不同的业务场景也带来了一些现实问题,例如: 作为CIO 或产品 ...
『手撕Vue-CLI』自动安装依赖
开篇经过『手撕Vue-CLI』拷贝模板,实现了自动下载并复制指定模板到目标目录.然而,虽然项目已复制,但其依赖并未自动安装,可能需要用户手动操作,这并不够智能. 正如前文所述,我们已经了解了业务需求 ...
支付宝spi接口设计验签和返回结果加签注意点,支付宝使用JSONObject对象
支付宝spi接口设计验签和返回结果加签注意点,支付宝使用JSONObject对象 SPI 三方服务接入指南https://opendocs.alipay.com/isv/spiforisv 服务端实现 ...
docker 报Failed to create thread: Operation not permitted (1) 解决方法
docker启动容器时报:Failed to create thread: Operation not permitted (1) 原因:docker内的用户权限受限解决办法1: 启动docker时 ...
C# 语言在AGI 赛道上能做什么
自从2022年11月OpenAI正式对外发布ChatGPT依赖,AGI 这条赛道上就挤满了重量级的选手,各大头部公司纷纷下场布局.原本就在机器学习.深度学习领域占据No.1的Python语言更是继续稳 ...
新手指引：前后端分离的springboot + mysql + vue实战案例
案例说明: 使用springboot + mysql + vue实现前后端分离的用户查询功能. 1.mysql:创建test数据库 -> 创建user数据表 -> 创建模拟数据: 2.sp ...
js-文件读写和上传下载的简单例子01
现下,网络越来越快,浏览器的功能和性能越来越好,所以很多时候,已经不需要一些复杂的框架来实现不是非常复杂的功能. 我们只有在以下情况才会考虑使用框架或者现成的第三方组件: 1.功能复杂,自己写没有必要 ...
DPO: Direct Preference Optimization 直接偏好优化（学习笔记）
学习参考:链接1 一.为什么要提出DPO 在之前,我们已经了解到基于人类反馈的强化学习RLHF分为三个阶段:全监督微调(SFT).奖励模型(RM).强化学习(PPO).但是RLHF面临缺陷:RLH ...
TRL(Transformer Reinforcement Learning) PPO Trainer 学习笔记
(1) PPO Trainer TRL支持PPO Trainer通过RL训练语言模型上的任何奖励信号.奖励信号可以来自手工制作的规则.指标或使用奖励模型的偏好数据.要获得完整的示例,请查看examp ...
markdown语法支持测试
latex 公式 \(v, w, \nu, \omega\) \[\iiint, \oiiint \] \(\Set{ x | x<\frac 1 2 }\) \(\displaystyle \ ...

再谈汤普森采样（Thompson Sampling）

再谈汤普森采样（Thompson Sampling）的更多相关文章

随机推荐

热门专题