Prioritized Experience Replay

JAN 26, 2016

Schaul, Quan, Antonoglou, Silver, 2016

This Blog from: http://pemami4911.github.io/paper-summaries/2016/01/26/prioritizing-experience-replay.html

Summary

Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in replay memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.

Evidence

Implemented Double DQN with main changes being the addition of prioritized experience replay sampling and importance-sampling
Tested on Atari Learning Environment

Strengths

Lots of insight about the repercussions of this research and plenty of discussion on extensions

Notes

The magnitude of the TD-error indicates how unexpected a certain transition was
The TD-error can be a poor estimate about the amount an agent can learn from a transition when rewards are noisy
Problems with greedily selecting experiences:
- High-error transitions are replayed too frequently
- Low-error transitions are almost entirely ignored
- Expensive to update entire replay memory, so errors are only updated for transitions that are replayed
- Lack of diversity leads to over-fitting
A stochastic sampling method is introduced which finds a balance between greedy prioritization and random sampling (current method)
Two variants of P(i)=pαi∑kpαkP(i)=piα∑kpkα were studied, where PP is the probability of sampling transition ii, pi>0pi>0 is the priority of transition ii, and the exponent αα determines how much prioritization is used, with α=0α=0 the uniform case
- Variant 1: proportional prioritization, where pi=|δi|+ϵpi=|δi|+ϵ is used and ϵϵ is a small positive constant that prevents the edge-case of transitions not being revisited once their error is zero. δδ is the TD-error
- Variant 2: rank-based prioritization, with pi=1rank(i)pi=1rank(i) where rank(i)rank(i) is the rank of transition ii when the replay memory is sorted according to δiδi
Key insight The estimation of the expected value of the total discounted reward with stochastic updates requires that the updates correspond to the same distribution as the expectation. Prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights wi=(1N1P(i))βwi=(1N1P(i))β that fully compensate for the non-uniform probabilities P(i)P(i) if β=1β=1. These weights are folded into the Q-learning update by using wi×δiwi×δi, which is normalized by 1maxiwi1maxiwi
IS is annealed from β0β0 to 1, which means its affect is felt more strongly at the end of the stochastic process; this is because the unbiased nature of the updates in RL is most important near convergence
IS also reduces the gradient magnitudes which is good for optimization; allows the algorithm to follow the curvature of highly non-linear optimization landscapes because the Taylor expansion (gradient descent) is constantly re-approximated

(zhuan) Prioritized Experience Replay的更多相关文章

论文阅读之：PRIORITIZED EXPERIENCE REPLAY
PRIORITIZED EXPERIENCE REPLAY ICLR 2016 经验回放使得 online reinforcement learning agent 能够记住并且回放过去的经验.在先前 ...
强化学习中的经验回放（The Experience Replay in Reinforcement Learning）
一.Play it again: reactivation of waking experience and memory(Trends in Neurosciences 2010) SWR发放模式不 ...
【深度强化学习】Curriculum-guided Hindsight Experience Replay读后感
目录导读目录正文 Abstract[摘要] Introduction[介绍] 导读看任何一个领域的文章,一定要看第一手资料.学习他们的思考方式,论述逻辑,得出一点自己的感悟.因此,通过阅读pa ...
Revisiting Fundamentals of Experience Replay
郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布! ICML 2020 Abstract 经验回放对于深度RL中的异策算法至关重要,但是在我们的理解上仍然存在很大差距.因此,我们对Q学习方法 ...
强化学习(十一) Prioritized Replay DQN
在强化学习(十)Double DQN (DDQN)中,我们讲到了DDQN使用两个Q网络,用当前Q网络计算最大Q值对应的动作,用目标Q网络计算这个最大动作对应的目标Q值,进而消除贪婪法带来的偏差.今天我 ...
【转载】强化学习(十一) Prioritized Replay DQN
原文地址: https://www.cnblogs.com/pinard/p/9797695.html ------------------------------------------------ ...
(zhuan) Deep Reinforcement Learning Papers
Deep Reinforcement Learning Papers A list of recent papers regarding deep reinforcement learning. Th ...
(zhuan) Deep Deterministic Policy Gradients in TensorFlow
Deep Deterministic Policy Gradients in TensorFlow AUG 21, 2016 This blog from: http://pemami49 ...
（转）Let’s make a DQN 系列
Let's make a DQN 系列 Let's make a DQN: Theory September 27, 2016DQN This article is part of series Le ...

随机推荐

px-pt-dp-rem像素单位的换算问题
px-pt-dp-rem像素单位的换算问题 dp 的意思从 MDPI 到 XXXHDPI 每单位物理尺寸的像素数越来越大.也就是说 mdpi 时 1dp = 1pxxxxhdpi 时 1dp = 4p ...
GUI常用对象介绍3
%text hf = axes; ht = text(,,'示例'); get(ht); %公式并且设置位置坐标 (积分符号) text('String','\int_0^x dF(x)','Pos ...
51nod 1130 N的阶乘的长度 V2（斯特林近似）
输入N求N的阶乘的10进制表示的长度.例如6! = 720,长度为3. Input 第1行:一个数T,表示后面用作输入测试的数的数量.(1 <= T <= 1000) 第2 - T + ...
Python+OpenCV图像处理（三）—— Numpy数组操作图片
一.改变图片每个像素点每个通道的灰度值 (一) 代码如下: #遍历访问图片每个像素点,并修改相应的RGB import cv2 as cv def access_pixels(image): prin ...
GoldenGate 12.3 MA架构介绍系列(4)–Restful API介绍
OGG 12.3 MA中最大的变化就是使用了restful api,在前面介绍的各个服务模块,其实就是引用restful api开发而来,这些API同时也提供对外的集成接口,详细接口可参考: http ...
ubuntu_virtualenv
sudo pip install virtualenv 1.安装virtualenv(需要先安装pip): $ [sudo] pip install virtualenv 2.创建虚拟环境: $ vi ...
ubunta apt install error
ubuntu系统: 用apt-get命令安装一些软件包时,总报错:E:could not get lock /var/lib/dpkg/lock -open等出现这个问题的原因可能是有另外一个程序正 ...
QT多线程简单例子
在Qt中实现多线程,除了使用全局变量.还可以使用信号/槽机制. 以下例子使用信号/槽机制. 功能: 在主线程A界面上点击按钮,然后对应开起一个线程B.线程B往线程A发送一个字符串,线程A打印出来. 1 ...
mysql 8.0 Druid连接时调用getServerCharset报空指针异常解决方法
类似错误信息如下: 16:52:01.163 [Druid-ConnectionPool-Create-1641320886] ERROR com.alibaba.druid.pool.DruidDa ...
使用mac自带终端修改hosts
修改mac host文件绑定域名打开终端在终端terminal中输入sudo vi/etc/hosts sudo与vi之间有一个空格上一步输入完成之后按enter回车键,如果当前用户账号有密码, ...

(zhuan) Prioritized Experience Replay