Temporal-Difference Control: SARSA and Q-Learning

SARSA

SARSA algorithm also estimate Action-Value functions rather than State-Value function. The difference between SARSA and Monte Carlo is: SARSA does not need to wait the actual return untill the end of the episode, instead it learns from each time step using estimations of the return.

In every step, the agent takes an action A from state S, then it receives a reward R and gets to a new state S'. Based on the policy π, we know the algorithm will greedily pick the action A'. So now we have:S,A,R,S',A', and the task is to estimate Q function of S,A pair.

We borrow the idea of estimating State-Value functions and use it onto Action-Value function estimation, then we get:

Here is the Sudo code for SARSA:

On-Policy vs Off-Policy

If we look into the learning process, there are actually two steps, firstly taking an action A from state S based on policy π, geting the reward R, and the next state S' coming; the second step is using the Q-function of action A' followd the same policy π. Both of the two steps use the same policy π, but actually they can be different. On the first step, the policy is called Target Policy, which is the policy that we will update. The second policy is Behavior Policy, this is how we pick the oprimal action from S'. Q-Learning uses different Policies on the two steps.

Q-Learning

From state S', Q-Learning algorithm picks the action maximizing the Q-function. It stands at state S', looking into all possible actions, and then chooses the best one.

Temporal-Difference Control: SARSA and Q-Learning的更多相关文章

强化学习9-Deep Q Learning
之前讲到Sarsa和Q Learning都不太适合解决大规模问题,为什么呢? 因为传统的强化学习都有一张Q表,这张Q表记录了每个状态下,每个动作的q值,但是现实问题往往极其复杂,其状态非常多,甚至是连 ...
增强学习（五）----- 时间差分学习(Q learning, Sarsa learning)
接下来我们回顾一下动态规划算法(DP)和蒙特卡罗方法(MC)的特点,对于动态规划算法有如下特性: 需要环境模型,即状态转移概率\(P_{sa}\) 状态值函数的估计是自举的(bootstrapping ...
【PPT】 Least squares temporal difference learning
最小二次方时序差分学习原文地址: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd= ...
论文笔记之：Human-level control through deep reinforcement learning
Human-level control through deep reinforcement learning Nature 2015 Google DeepMind Abstract RL 理论在 ...
如何用简单例子讲解 Q - learning 的具体过程？
作者:牛阿链接:https://www.zhihu.com/question/26408259/answer/123230350来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明 ...
强化学习_Deep Q Learning(DQN)_代码解析
Deep Q Learning 使用gym的CartPole作为环境,使用QDN解决离散动作空间的问题. 一.导入需要的包和定义超参数 import tensorflow as tf import n ...
深度强化学习介绍【PPT】 Human-level control through deep reinforcement learning （DQN）
这个是平时在实验室讲reinforcement learning 的时候用到PPT, 交期末作业.汇报都是一直用的这个,觉得比较不错,保存一下,也为分享,最早该PPT源于师弟汇报所做.
The Difference between Gamification and Game-Based Learning
http://inservice.ascd.org/the-difference-between-gamification-and-game-based-learning/ Have you trie ...
deep Q learning小笔记
1.loss 是什么 2. Q-Table的更新问题变成一个函数拟合问题,相近的状态得到相近的输出动作.如下式,通过更新参数 θθ 使Q函数逼近最优Q值深度神经网络可以自动提取复杂特征,因此,面对高 ...

随机推荐

搭建jumperserver堡垒机管理万台服务器-1
搭建jumperserver堡垒机管理万台服务器-1 1 Jumpserver堡垒机概述-部署Jumpserver运行环境 2 安装Coco组件 3 安装Web-Terminal前端-Luna组 ...
hdu4731 Minimum palindrome (找规律)
这道题找下规律,3个字母或者以上的时候就用abcabcabc....循环即可. 一个字母时,就是aaaaa.....; 当只有2个字母时!s[1][]=a"; s[2][]="ab ...
socket keepalive 服务端异常断线
异常断线客户端检测不到没有重连
linux中未实现的系统调用
afs_syscall, break, fattach, fdetach, ftime, getmsg, getpmsg, gtty, isastream, lock, madvise1, mpx, ...
GUI学习之二十九—QInputDialog学习总结
最后一种对话框是QInputDialog,,用来提供个输入的窗口. 一常用的静态方法由于输入的类型不同,QInputDialog分为多种静态方法使用 #有步长调节器的整形数据,step为步长调节器的 ...
pipeline语法学习日记
1.pipeline 整合job的通用代码,比较基本 2.pipeline参数化构建
利用雅虎ycsb对cassandra做性能测试
准备: 环境: 两台虚拟机:ip:192.168.138.128/129;配置:2核4G: 版本:apache-cassandra-3.10 ycsb-cassandra-binding-0.1 ...
《SaltStack技术入门与实践》—— Renderer组件
Renderer组件本章节参考<SaltStack技术入门与实践>,感谢该书作者: 刘继伟.沈灿.赵舜东前面我们已经提过使用Python语言编写state.sls文件.在SaltSta ...
CSS3——制作人物走路的小动画
一个很简单的小动画,但是还挺有意思的,就是找这种图片很麻烦,我这里把我找的一张图片贴上来,这张图片是我在网上找的,又改了背景色和大小. <!DOCTYPE html> <html l ...
安装VS2017
www.visualstudio.com/zh-hans/downloads/ https://visualstudio.microsoft.com/zh-hans/thank-you-downloa ...

Temporal-Difference Control: SARSA and Q-Learning

Temporal-Difference Control: SARSA and Q-Learning的更多相关文章

随机推荐

热门专题