Prioritized Experience Replay

JAN 26, 2016


Schaul, Quan, Antonoglou, Silver, 2016

This Blog from: http://pemami4911.github.io/paper-summaries/2016/01/26/prioritizing-experience-replay.html

Summary

Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in replay memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.

Evidence

  • Implemented Double DQN with main changes being the addition of prioritized experience replay sampling and importance-sampling
  • Tested on Atari Learning Environment

Strengths

  • Lots of insight about the repercussions of this research and plenty of discussion on extensions

Notes

  • The magnitude of the TD-error indicates how unexpected a certain transition was
  • The TD-error can be a poor estimate about the amount an agent can learn from a transition when rewards are noisy
  • Problems with greedily selecting experiences:
    • High-error transitions are replayed too frequently
    • Low-error transitions are almost entirely ignored
    • Expensive to update entire replay memory, so errors are only updated for transitions that are replayed
    • Lack of diversity leads to over-fitting
  • A stochastic sampling method is introduced which finds a balance between greedy prioritization and random sampling (current method)
  • Two variants of P(i)=pαi∑kpαkP(i)=piα∑kpkα were studied, where PP is the probability of sampling transition ii, pi>0pi>0 is the priority of transition ii, and the exponent αα determines how much prioritization is used, with α=0α=0 the uniform case
    • Variant 1: proportional prioritization, where pi=|δi|+ϵpi=|δi|+ϵ is used and ϵϵ is a small positive constant that prevents the edge-case of transitions not being revisited once their error is zero. δδ is the TD-error
    • Variant 2: rank-based prioritization, with pi=1rank(i)pi=1rank(i) where rank(i)rank(i) is the rank of transition ii when the replay memory is sorted according to δiδi
  • Key insight The estimation of the expected value of the total discounted reward with stochastic updates requires that the updates correspond to the same distribution as the expectation. Prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights wi=(1N1P(i))βwi=(1N1P(i))β that fully compensate for the non-uniform probabilities P(i)P(i) if β=1β=1. These weights are folded into the Q-learning update by using wi×δiwi×δi, which is normalized by 1maxiwi1maxiwi
  • IS is annealed from β0β0 to 1, which means its affect is felt more strongly at the end of the stochastic process; this is because the unbiased nature of the updates in RL is most important near convergence
  • IS also reduces the gradient magnitudes which is good for optimization; allows the algorithm to follow the curvature of highly non-linear optimization landscapes because the Taylor expansion (gradient descent) is constantly re-approximated

(zhuan) Prioritized Experience Replay的更多相关文章

  1. 论文阅读之:PRIORITIZED EXPERIENCE REPLAY

    PRIORITIZED EXPERIENCE REPLAY ICLR 2016 经验回放使得 online reinforcement learning agent 能够记住并且回放过去的经验.在先前 ...

  2. 强化学习中的经验回放(The Experience Replay in Reinforcement Learning)

    一.Play it again: reactivation of waking experience and memory(Trends in Neurosciences 2010) SWR发放模式不 ...

  3. 【深度强化学习】Curriculum-guided Hindsight Experience Replay读后感

    目录 导读 目录 正文 Abstract[摘要] Introduction[介绍] 导读 看任何一个领域的文章,一定要看第一手资料.学习他们的思考方式,论述逻辑,得出一点自己的感悟.因此,通过阅读pa ...

  4. Revisiting Fundamentals of Experience Replay

    郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布! ICML 2020 Abstract 经验回放对于深度RL中的异策算法至关重要,但是在我们的理解上仍然存在很大差距.因此,我们对Q学习方法 ...

  5. 强化学习(十一) Prioritized Replay DQN

    在强化学习(十)Double DQN (DDQN)中,我们讲到了DDQN使用两个Q网络,用当前Q网络计算最大Q值对应的动作,用目标Q网络计算这个最大动作对应的目标Q值,进而消除贪婪法带来的偏差.今天我 ...

  6. 【转载】 强化学习(十一) Prioritized Replay DQN

    原文地址: https://www.cnblogs.com/pinard/p/9797695.html ------------------------------------------------ ...

  7. (zhuan) Deep Reinforcement Learning Papers

    Deep Reinforcement Learning Papers A list of recent papers regarding deep reinforcement learning. Th ...

  8. (zhuan) Deep Deterministic Policy Gradients in TensorFlow

          Deep Deterministic Policy Gradients in TensorFlow AUG 21, 2016 This blog from: http://pemami49 ...

  9. (转)Let’s make a DQN 系列

    Let's make a DQN 系列 Let's make a DQN: Theory September 27, 2016DQN This article is part of series Le ...

随机推荐

  1. Lua逻辑操作符

    [1]逻辑操作符and.or和not 应用示例: ) ) -- nil ) -- false ) ) ) ) ) ) ) print(not nil) -- ture print(not false) ...

  2. HDU 2064 汉诺塔III (递推)

    题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=2064 约19世纪末,在欧州的商店中出售一种智力玩具,在一块铜板上有三根杆,最左边的杆上自上而下.由小到 ...

  3. Linux 进程管理 ps、top、pstree命令

    ps命令:查看系统中正在运行的进程 ps 是用来静态地查看系统中正在运行的进程的命令.不过这个命令有些特殊,它的部分选项不能加入"-",比如命令"ps aux" ...

  4. tensorflow serving

    1.安装tensorflow serving 1.1确保当前环境已经安装并可运行tensorflow 从github上下载源码 git clone --recurse-submodules https ...

  5. webpack处理媒体文件(图片/视频和音频)

    webpack最终会将各个模块打包成一个文件,因此我们样式中的url路径是相对入口html页面的, 这个问题是用file-loader解决的,file-loader可以解析项目中的url引入(不仅限于 ...

  6. bzoj 3083

    bzoj 3083 树链剖分,换根 对于一颗树有以下操作 1.确定x为根,2.将u到v的简单路径上的数全改成c,3.询问当前以x为根的子树中的节点最小权值. 如果没有操作1:树链剖分很明显. 于是考虑 ...

  7. Django之404、500、400错误处理

    要自定义处理url请求错误需要进行三步操作:主要错误有: 404错误:page not found视图 500错误:server error视图 400错误:bad request视图 以404错误为 ...

  8. MySQL PXC集群部署

    安装 Percona-XtraDB-Cluster 架构: 三个节点: pxc_node_0 30.0.0.196 pxc_node_1 30.0.0.198 pxc_node_2 30.0.0.19 ...

  9. PHP 中文工具类,支持汉字转拼音、拼音分词、简繁互转

    ChineseUtil 下载地址:https://github.com/Yurunsoft/ChineseUtil 另外一个中文转拼音工具:https://github.com/overtrue/pi ...

  10. P2564 [SCOI2009]生日礼物(尺取法)

    P2564 [SCOI2009]生日礼物 三个字.尺取法......... 坐标按x轴排序. 蓝后尺取一下.......... #include<iostream> #include< ...