SARSA

SARSA algorithm also estimate Action-Value functions rather than State-Value function. The difference between SARSA and Monte Carlo is: SARSA does not need to wait the actual return untill the end of the episode, instead it learns from each time step using estimations of the return.

In every step, the agent takes an action A from state S, then it receives a reward R and gets to a new state S'. Based on the policy π, we know the algorithm will greedily pick the action A'. So now we have:S,A,R,S',A', and the task is to estimate Q function of S,A pair.

We borrow the idea of estimating State-Value functions and use it onto Action-Value function estimation, then we get:

Here is the Sudo code for SARSA:

On-Policy vs Off-Policy

If we look into the learning process, there are actually two steps, firstly taking an action A from state S based on policy π, geting the reward R, and the next state S' coming; the second step is using the Q-function of action A' followd the same policy π. Both of the two steps use the same policy π, but actually they can be different. On the first step, the policy is called Target Policy, which is the policy that we will update. The second policy is Behavior Policy, this is how we pick the oprimal action from S'. Q-Learning uses different Policies on the two steps.

Q-Learning

From state S', Q-Learning algorithm picks the action maximizing the Q-function. It stands at state S', looking into all possible actions, and then chooses the best one.

Temporal-Difference Control: SARSA and Q-Learning的更多相关文章

  1. 强化学习9-Deep Q Learning

    之前讲到Sarsa和Q Learning都不太适合解决大规模问题,为什么呢? 因为传统的强化学习都有一张Q表,这张Q表记录了每个状态下,每个动作的q值,但是现实问题往往极其复杂,其状态非常多,甚至是连 ...

  2. 增强学习(五)----- 时间差分学习(Q learning, Sarsa learning)

    接下来我们回顾一下动态规划算法(DP)和蒙特卡罗方法(MC)的特点,对于动态规划算法有如下特性: 需要环境模型,即状态转移概率\(P_{sa}\) 状态值函数的估计是自举的(bootstrapping ...

  3. 【PPT】 Least squares temporal difference learning

    最小二次方时序差分学习 原文地址: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd= ...

  4. 论文笔记之:Human-level control through deep reinforcement learning

    Human-level control through deep reinforcement learning Nature 2015 Google DeepMind Abstract RL 理论 在 ...

  5. 如何用简单例子讲解 Q - learning 的具体过程?

    作者:牛阿链接:https://www.zhihu.com/question/26408259/answer/123230350来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明 ...

  6. 强化学习_Deep Q Learning(DQN)_代码解析

    Deep Q Learning 使用gym的CartPole作为环境,使用QDN解决离散动作空间的问题. 一.导入需要的包和定义超参数 import tensorflow as tf import n ...

  7. 深度强化学习介绍 【PPT】 Human-level control through deep reinforcement learning (DQN)

    这个是平时在实验室讲reinforcement learning 的时候用到PPT, 交期末作业.汇报都是一直用的这个,觉得比较不错,保存一下,也为分享,最早该PPT源于师弟汇报所做.

  8. The Difference between Gamification and Game-Based Learning

    http://inservice.ascd.org/the-difference-between-gamification-and-game-based-learning/ Have you trie ...

  9. deep Q learning小笔记

    1.loss 是什么 2. Q-Table的更新问题变成一个函数拟合问题,相近的状态得到相近的输出动作.如下式,通过更新参数 θθ 使Q函数逼近最优Q值 深度神经网络可以自动提取复杂特征,因此,面对高 ...

随机推荐

  1. C#索引器3 重载

    7.索引器  重载 public class Demo { private Hashtable name = new Hashtable(); public string this[int index ...

  2. grandson定理

    用处:求解同余线性方程组 inv:逆元 一堆物品 3个3个分剩2个 5个5个分剩3个 7个7个分剩2个 问这个物品有多少个 5*7*inv(5*7,  3) % 3  =  1 3*7*inv(3*7 ...

  3. SSM常用配置文件头模板

    web.xml文件头 <?xml version="1.0" encoding="UTF-8"?> <web-app xmlns:xsi=&q ...

  4. 内核热patch

    如下代码是一个内核patch #include <linux/init.h> #include <linux/module.h> #include <linux/modu ...

  5. [HDU]P2586 How far away?[LCA]

    [HDU]P2586 How far away ? Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Jav ...

  6. PHP获取数组最大值下标的方法

    <?php $hots = array('8213'=> 0,'8212'=> 100,'8172'=> 10008); $key = array_search(max($ho ...

  7. USACO2018 DEC(Platinum) (树上乱搞,期望+凸包)

    发现这跟\(Gold\)难度简直天差地别啊.. \(T1\) 传送门 解题思路 这道题还是很可做的,发现题意可以传化成一棵树每次从叶子节点删边,然后有\(m\)条限制,形如\(a\)在\(b\)前面删 ...

  8. CF704D Captain America

    http://codeforces.com/problemset/problem/704/D 题解 对于两种颜色的染色,我们可以把它看做选择问题. 比如说红色的代价小,所以我们尽可能多的染红色. 然后 ...

  9. 北风设计模式课程---里氏替换原则(Liskov Substitution Principle)

    北风设计模式课程---里氏替换原则(Liskov Substitution Principle) 一.总结 一句话总结: 当衍生类能够完全替代它们的基类时:(Liskov Substitution P ...

  10. ORACLE Physical Standby DG 之switch over

    DG架构图如下: 计划,切换之后的架构图: DG切换: 主备切换:这里所有的数据库数据文件.日志文件的路径是一致的 [旧主库]主库primarydb切换为备库standby3主库检查switchove ...