强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)
强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)
学习笔记:
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016
数学符号看不懂的,先看看这里:
时序差分学习简话
时序差分学习结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。
时序差分这个词不好理解。改为当时差分学习比较形象一些 - 表示通过当前的差分数据来学习。
蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。
时序差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时序差分学习。
本章只考虑单步的时序差分学习。多步的时序差分学习在下一章讲解。
数学表示
根据我们已经知道的知识:如果可以计算出策略价值(\(\pi\)状态价值\(v_{\pi}(s)\),或者行动价值\(q_{\pi(s, a)}\)),就可以优化策略。
在蒙特卡洛方法中,计算策略的价值,需要完成一个情节(episode),通过情节的目标价值\(G_t\)来计算状态的价值。其公式:
Formula MonteCarlo
\[
V(S_t) \gets V(S_t) + \alpha \delta_t \\
\delta_t = [G_t - V(S_t)] \\
where \\
\delta_t \text{ - Monte Carlo error} \\
\alpha \text{ - learning step size}
\]
时序差分的思想是通过下一个状态的价值计算状态的价值,形成一个迭代公式(又):
Formula TD(0)
\[
V(S_t) \gets V(S_t) + \alpha \delta_t \\
\delta_t = [R_{t+1} + \gamma\ V(S_{t+1} - V(S_t)] \\
where \\
\delta_t \text{ - TD error} \\
\alpha \text{ - learning step size} \\
\gamma \text{ - reward discount rate}
\]
注:书上提出TD error并不精确,而Monte Carlo error是精确地。需要了解,在此并不拗述。
时序差分学习方法
本章介绍的是时序差分学习的单步学习方法。多步学习方法在下一章介绍。
- 策略状态价值\(v_{\pi}\)的时序差分学习方法(单步\多步)
- 策略行动价值\(q_{\pi}\)的on-policy时序差分学习方法: Sarsa(单步\多步)
- 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: Q-learning(单步)
- Double Q-learning(单步)
- 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(带importance sampling): Sarsa(多步)
- 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm(多步)
- 策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: \(Q(\sigma)\)(多步)
策略状态价值\(v_{\pi}\)的时序差分学习方法
单步时序差分学习方法TD(0)
- 流程图
算法描述
Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}^+\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Repeat (for each step of episode):
\(A \gets\) action given by \(\pi\) for \(S\)
Take action \(A\), observe \(R, S'\)
\(V(S) \gets V(S) + \alpha [R + \gamma V(S') - V(S)]\)
\(S \gets S'\)
Until S is terminal
多步时序差分学习方法
- 流程图
算法描述
Input: the policy \(\pi\) to be evaluated
Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}\)
Parameters: step size \(\alpha \in (0, 1]\), a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action according to \(\pi(\dot | S_t)\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then \(T \gets t+1\)
\(\tau \gets t - n + 1 \ \) (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n}V(S_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
\(V(S_{\tau}) \gets V(S_{\tau}) + \alpha [G - V(S_{\tau})]\)
Until \(\tau = T - 1\)
这里要理解\(V(S_0)\)是由\(V(S_0), V(S_1), \dots, V(S_n)\)计算所得;\(V(S_1)\)是由\(V(S_1), V(S_1), \dots, V(S_{n+1})\)。
策略行动价值\(q_{\pi}\)的on-policy时序差分学习方法: Sarsa
单步时序差分学习方法
- 流程图
算法描述
Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(termnal-state, \dot) = 0\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
Repeat (for each step of episode):
Take action \(A\), observe \(R, S'\)
Choose \(A'\) from \(S'\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
\(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma Q(S', A') - Q(S, A)]\)
\(S \gets S'; A \gets A';\)
Until S is terminal
多步时序差分学习方法
- 流程图
算法描述
Initialize \(Q(s, a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 ~ \pi(\dot | S_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
Else:
Select and store an action \(A_{t+1} ~ \pi(\dot | S_{t+1})\)
\(\tau \gets t - n + 1 \ \) (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
\(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G - Q(S_{\tau}, A_{\tau})]\)
If {\pi} is being learned, then ensure that \(\pi(\dot | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
Until \(\tau = T - 1\)
策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: Q-learning
Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。
\[
Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \underset{a}{max} \ Q(S_{t+1}, a) - Q(S_t, A_t)]
\]
单步时序差分学习方法
算法描述
Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(termnal-state, \dot) = 0\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
Repeat (for each step of episode):
Take action \(A\), observe \(R, S'\)
\(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma \underset{a}{max} \ Q(S‘, a) - Q(S, A)]\)
\(S \gets S';\)
Until S is terminalQ-learning使用了max,会引起一个最大化偏差(Maximization Bias)问题。
具体说明,请看书上的Example 6.7。**
使用Double Q-learning可以消除这个问题。
Double Q-learning
单步时序差分学习方法
Initialize $Q_1(s, a) and \(Q_2(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily
Initialize $Q_1(termnal-state, \dot) = \(Q_2(termnal-state, \dot) = 0\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Repeat (for each step of episode):
Choose \(A\) from \(S\) using policy derived from \(Q_1\) and \(Q_2\) (e.g. \(\epsilon-greedy\))
Take action \(A\), observe \(R, S'\)
With 0.5 probability:
\(Q_1(S, A) \gets Q_1(S, A) + \alpha [R + \gamma Q_2(S', \underset{a}{argmax} \ Q_1(S', a)) - Q_1(S, A)]\)
Else:
\(Q_2(S, A) \gets Q_2(S, A) + \alpha [R + \gamma Q_1(S', \underset{a}{argmax} \ Q_2(S', a)) - Q_2(S, A)]\)
\(S \gets S';\)
Until S is terminal
策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(by importance sampling): Sarsa
考虑到重要样本,把\(\rho\)带入到Sarsa算法中,形成一个off-policy的方法。
\(\rho\) - 重要样本比率(importance sampling ratio)
\[
\rho \gets \Pi_{i = \tau + 1}^{min(\tau + n - 1, T -1 )} \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \qquad \qquad (\rho_{\tau+n}^{(\tau+1)})
\]
多步时序差分学习方法
算法描述
Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)
Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 ~ \mu(\dot | S_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
Else:
Select and store an action \(A_{t+1} ~ \pi(\dot | S_{t+1})\)
\(\tau \gets t - n + 1 \ \) (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(\rho \gets \Pi_{i = \tau + 1}^{min(\tau + n - 1, T -1 )} \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \qquad \qquad (\rho_{\tau+n}^{(\tau+1)})\)
\(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
\(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \rho [G - Q(S_{\tau}, A_{\tau})]\)
If {\pi} is being learned, then ensure that \(\pi(\dot | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
Until \(\tau = T - 1\)
Expected Sarsa
- 流程图
- 算法描述
略。
策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm
Tree Backup Algorithm的思想是每步都求行动价值的期望值。
求行动价值的期望值意味着对所有可能的行动\(a\)都评估一次。
多步时序差分学习方法
- 流程图
算法描述
Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 ~ \pi(\dot | S_0)\)
\(Q_0 \gets Q(S_0, A_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
\(\delta_t \gets R - Q_t\)
Else:
\(\delta_t \gets R + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q_t\)
Select arbitrarily and store an action as \(A_{t+1}\)
\(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
\(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
\(\tau \gets t - n + 1 \ \) (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(E \gets 1\)
\(G \gets Q_{\tau}\)
For \(k=\tau, \dots, min(\tau + n - 1, T - 1):\)
\(G \gets\ G + E \delta_k\)
\(E \gets\ \gamma E \pi_{k+1}\)
\(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G - Q(S_{\tau}, A_{\tau})]\)
If {\pi} is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot)\)
Until \(\tau = T - 1\)
策略行动价值\(q_{\pi}\)的off-policy时序差分学习方法: \(Q(\sigma)\)
\(Q(\sigma)\)结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
当\(\sigma = 1\)时,使用了重要样本的Sarsa算法。
当\(\sigma = 0\)时,使用了Tree Backup的行动期望值算法。
多步时序差分学习方法
- 流程图
算法描述
Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)
Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 ~ \mu(\dot | S_0)\)
\(Q_0 \gets Q(S_0, A_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
\(\delta_t \gets R - Q_t\)
Else:
Select and store an action as \(A_{t+1} ~ \mu(\dot|S_{t+1})\)
Select and store \(\sigma_{t+1})\)
\(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
\(\delta_t \gets R + \gamma \sigma_{t+1} Q_{t+1} + \gamma (1 - \sigma_{t+1})\sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q_t\)
\(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
\(\rho_{t+1} \gets \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})}\)
\(\tau \gets t - n + 1 \ \) (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(\rho \gets 1\)
\(E \gets 1\)
\(G \gets Q_{\tau}\)
For \(k=\tau, \dots, min(\tau + n - 1, T - 1):\)
\(G \gets\ G + E \delta_k\)
\(E \gets\ \gamma E [(1 - \sigma_{k+1})\pi_{k+1} + \sigma_{k+1}]\)
\(\rho \gets\ \rho(1 - \sigma_{k} + \sigma_{k}\tau_{k})\)
$Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \(\rho [G - Q(S_{\tau}, A_{\tau})]\)
If {\pi} is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot)\)
Until \(\tau = T - 1\)
总结
时序差分学习方法的限制:学习步数内,可获得奖赏信息。
比如,国际象棋的每一步,是否可以计算出一个奖赏信息?如果使用蒙特卡洛方法,模拟到游戏结束,肯定是可以获得一个奖赏结果的。
参照
- Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016
- 强化学习读书笔记 - 00 - 术语和数学符号
- 强化学习读书笔记 - 01 - 强化学习的问题
- 强化学习读书笔记 - 02 - 多臂老O虎O机问题
- 强化学习读书笔记 - 03 - 有限马尔科夫决策过程
- 强化学习读书笔记 - 04 - 动态规划
- 强化学习读书笔记 - 05 - 蒙特卡洛方法(Monte Carlo Methods)
强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)的更多相关文章
- Python深度学习读书笔记-1.什么是深度学习
人工智能 什么是人工智能.机器学习与深度学习(见图1-1)?这三者之间有什么关系?
- 强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)
强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods) 学习笔记: Reinforcement Learning: An Introduction, Richa ...
- 强化学习读书笔记 - 12 - 资格痕迹(Eligibility Traces)
强化学习读书笔记 - 12 - 资格痕迹(Eligibility Traces) 学习笔记: Reinforcement Learning: An Introduction, Richard S. S ...
- 强化学习读书笔记 - 11 - off-policy的近似方法
强化学习读书笔记 - 11 - off-policy的近似方法 学习笔记: Reinforcement Learning: An Introduction, Richard S. Sutton and ...
- 强化学习读书笔记 - 10 - on-policy控制的近似方法
强化学习读书笔记 - 10 - on-policy控制的近似方法 学习笔记: Reinforcement Learning: An Introduction, Richard S. Sutton an ...
- 强化学习读书笔记 - 09 - on-policy预测的近似方法
强化学习读书笔记 - 09 - on-policy预测的近似方法 参照 Reinforcement Learning: An Introduction, Richard S. Sutton and A ...
- 强化学习读书笔记 - 02 - 多臂老O虎O机问题
# 强化学习读书笔记 - 02 - 多臂老O虎O机问题 学习笔记: [Reinforcement Learning: An Introduction, Richard S. Sutton and An ...
- 强化学习读书笔记 - 05 - 蒙特卡洛方法(Monte Carlo Methods)
强化学习读书笔记 - 05 - 蒙特卡洛方法(Monte Carlo Methods) 学习笔记: Reinforcement Learning: An Introduction, Richard S ...
- 深度学习读书笔记之RBM(限制波尔兹曼机)
深度学习读书笔记之RBM 声明: 1)看到其他博客如@zouxy09都有个声明,老衲也抄袭一下这个东西 2)该博文是整理自网上很大牛和机器学习专家所无私奉献的资料的.具体引用的资料请看参考文献.具体的 ...
随机推荐
- Delphi制作图像特殊显示效果
Delphi制作实现图像的各种显示效果,比如百叶窗.渐变.淡入淡出.水平交错.雨滴效果等,用鼠标点击“打开图像”按钮,可以选择图像文件导入到窗体中:点击其它各个按钮,可以实现图像显示特效,例如:点击“ ...
- 【续】强行在C# Winform中渲染Cocos2d-x 3.6
[前言] 上一篇讲了怎么把Cocos2d-x 3.6渲染进MFC窗体,这里来讲一下怎么在C# Winform中做到同样的功能.如果你不熟悉MFC的使用但对C# Winform比较在行,请往下看. 这一 ...
- Google Analytics之增强型电子商务报告
虽然Google Analytics很多年以前就提供了电子商务报告的功能,但对于电子商务网站来说,这个报告缺失的东西还太多.而Google Analytics即将推出的增强型电子商务报告有望弥补这一短 ...
- log4j.appender.stdout.layout.ConversionPattern
http://501565246-qq-com.iteye.com/blog/1991881 http://wenku.baidu.com/link?url=e4Z9v9CY_gwRxHrggzHXx ...
- Java泛型类定义,与泛型方法的定义使用
package com.srie.testjava; public class TestClassDefine<T, S extends T> { public static void m ...
- loadrunner controller:集合点策略
集合点只有在多用户并发运行的时候才能体现它的作用. Scenario ---> Rendezvous 打开集合点设置界面,如下图所示: 我们可以看到Vusers 列表框里有两个用户,这与我们设置 ...
- 【Android】 分享一个完整的项目,适合新手!
写这个app之前是因为看了头条的一篇文章:http://www.managershare.com/post/155110,然后心想要不做一个这样的app,让手机计算就行了.也就没多想就去开始整了. ...
- 龙珠超的新OP【限界突破×サバイバー】
这首歌真的很燃 下载>> 限界突破×サバイバー 中文歌词 演唱:冰川清志 兴奋了!就去宇宙吧 最先端的“着迷”怎么样! 握在手中 突然想要大笑 糊里糊涂也习惯了吗! I can't g ...
- c#中读取数据库bit布尔字段数据转换Int和bool时的错误
数据库里bit这个布尔类型的字段,非常实用,但是在c#里读取时,许多人喜欢犯一些错误,导致运行报错. 实际中,有效的正确读取方法只有以下两种: int xxx= Convet.ToInt16(read ...
- PowerShell 批量修改AD属性
环境:win 2008 R2 在管理工具中打开用于 windows powershell 的ActiveDirectory模块命令行窗口或打开命令提示符窗口输入PowerShell回车再输入impor ...