(zhuan) Prioritized Experience Replay
Prioritized Experience Replay
JAN 26, 2016
Schaul, Quan, Antonoglou, Silver, 2016
This Blog from: http://pemami4911.github.io/paper-summaries/2016/01/26/prioritizing-experience-replay.html
Summary
Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in replay memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.
Evidence
- Implemented Double DQN with main changes being the addition of prioritized experience replay sampling and importance-sampling
- Tested on Atari Learning Environment
Strengths
- Lots of insight about the repercussions of this research and plenty of discussion on extensions
Notes
- The magnitude of the TD-error indicates how unexpected a certain transition was
- The TD-error can be a poor estimate about the amount an agent can learn from a transition when rewards are noisy
- Problems with greedily selecting experiences:
- High-error transitions are replayed too frequently
- Low-error transitions are almost entirely ignored
- Expensive to update entire replay memory, so errors are only updated for transitions that are replayed
- Lack of diversity leads to over-fitting
- A stochastic sampling method is introduced which finds a balance between greedy prioritization and random sampling (current method)
- Two variants of P(i)=pαi∑kpαkP(i)=piα∑kpkα were studied, where PP is the probability of sampling transition ii, pi>0pi>0 is the priority of transition ii, and the exponent αα determines how much prioritization is used, with α=0α=0 the uniform case
- Variant 1: proportional prioritization, where pi=|δi|+ϵpi=|δi|+ϵ is used and ϵϵ is a small positive constant that prevents the edge-case of transitions not being revisited once their error is zero. δδ is the TD-error
- Variant 2: rank-based prioritization, with pi=1rank(i)pi=1rank(i) where rank(i)rank(i) is the rank of transition ii when the replay memory is sorted according to δiδi
- Key insight The estimation of the expected value of the total discounted reward with stochastic updates requires that the updates correspond to the same distribution as the expectation. Prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights wi=(1N1P(i))βwi=(1N1P(i))β that fully compensate for the non-uniform probabilities P(i)P(i) if β=1β=1. These weights are folded into the Q-learning update by using wi×δiwi×δi, which is normalized by 1maxiwi1maxiwi
- IS is annealed from β0β0 to 1, which means its affect is felt more strongly at the end of the stochastic process; this is because the unbiased nature of the updates in RL is most important near convergence
- IS also reduces the gradient magnitudes which is good for optimization; allows the algorithm to follow the curvature of highly non-linear optimization landscapes because the Taylor expansion (gradient descent) is constantly re-approximated
(zhuan) Prioritized Experience Replay的更多相关文章
- 论文阅读之:PRIORITIZED EXPERIENCE REPLAY
PRIORITIZED EXPERIENCE REPLAY ICLR 2016 经验回放使得 online reinforcement learning agent 能够记住并且回放过去的经验.在先前 ...
- 强化学习中的经验回放(The Experience Replay in Reinforcement Learning)
一.Play it again: reactivation of waking experience and memory(Trends in Neurosciences 2010) SWR发放模式不 ...
- 【深度强化学习】Curriculum-guided Hindsight Experience Replay读后感
目录 导读 目录 正文 Abstract[摘要] Introduction[介绍] 导读 看任何一个领域的文章,一定要看第一手资料.学习他们的思考方式,论述逻辑,得出一点自己的感悟.因此,通过阅读pa ...
- Revisiting Fundamentals of Experience Replay
郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布! ICML 2020 Abstract 经验回放对于深度RL中的异策算法至关重要,但是在我们的理解上仍然存在很大差距.因此,我们对Q学习方法 ...
- 强化学习(十一) Prioritized Replay DQN
在强化学习(十)Double DQN (DDQN)中,我们讲到了DDQN使用两个Q网络,用当前Q网络计算最大Q值对应的动作,用目标Q网络计算这个最大动作对应的目标Q值,进而消除贪婪法带来的偏差.今天我 ...
- 【转载】 强化学习(十一) Prioritized Replay DQN
原文地址: https://www.cnblogs.com/pinard/p/9797695.html ------------------------------------------------ ...
- (zhuan) Deep Reinforcement Learning Papers
Deep Reinforcement Learning Papers A list of recent papers regarding deep reinforcement learning. Th ...
- (zhuan) Deep Deterministic Policy Gradients in TensorFlow
Deep Deterministic Policy Gradients in TensorFlow AUG 21, 2016 This blog from: http://pemami49 ...
- (转)Let’s make a DQN 系列
Let's make a DQN 系列 Let's make a DQN: Theory September 27, 2016DQN This article is part of series Le ...
随机推荐
- 【转】查看sqlserver被锁的表以及如何解锁
查看被锁表: select request_session_id spid,OBJECT_NAME(resource_associated_entity_id) tableName from sys. ...
- Java基础(basis)-----异常与错误处理
1.编译型异常和运行时异常 编译时异常是指程序正确 而由外界条件不满足而产生的异常 java 中要求必须去捕捉住这类异常 不然无法通过编译 运行时异常是指程序存在着bug 如空指针异常 数 ...
- 大数据处理框架之Strom: Storm----helloword
大数据处理框架之Strom: Storm----helloword Storm按照设计好的拓扑流程运转,所以写代码之前要先设计好拓扑图.这里写一个简单的拓扑: 第一步:创建一个拓扑类含有main方法的 ...
- JavaScript 函数声明与函数表达式的区别 函数声明提升(function declaration hoisting)
解析器在向执行环境中加载数据时,对函数声明和函数表达式并非一视同仁.解析器会率先读取函数声明,并使其在执行任何代码之前可用(可以访问).至于函数表达式,则必须等到解析器执行到它所在的代码行,才会真的被 ...
- 字符编码几个缩写 ACR CCS CEF CES TES
摘自https://zhuanlan.zhihu.com/p/27012967 5. 在Unicode Technical Report (UTR统一码技术报告) #17<UNICODE CHA ...
- 20165305 实验一: Java开发环境的熟悉
实验1-1 建立"自己学号exp1"的目录. 在"自己学号exp1"目录下建立src,bin等目录. javac,java的执行在"自己学号exp1& ...
- Linux基础命令---杀死进程pkill
pkill pkill可以给指定的进程发送信息,它可以结束某个执行的进程或者目录登录的用户. 此命令的适用范围:RedHat.RHEL.Ubuntu.CentOS.SUSE.openSUSE.Fedo ...
- FTP搭建 共享上网 穿透内网外网
1.ftp原理介绍 FTP只通过TCP连接,没有用于FTP的UDP组件.FTP不同于其他服务的是它使用了两个端口, 一个数据端口和一个命令端口(或称为控制端口).通常21端口是命令端口,20端口是数据 ...
- API gateway 之 kong 安装
kong安装: https://getkong.org/install/centos/ 下载指定版本rpm: wget https://bintray.com/kong/kong-community- ...
- spring总结之一(spring开发步骤、bean对象的管理、bean生命周期)
###spring 1.概念:开源,轻量级,简化开发的企业级框架. 开源:免费,发展快. 轻量级:占内存小. 简化开发:通用的功能封装,提高程序员的开发效率.--------------------- ...