Temporal-Difference Learning for Prediction
In Monte Carlo Learning, we've got the estimation of value function:
Gt is the episode return from time t, which can be calculated by:
Please recall, Gt can be only calculated at the end of a given episode. This reveals a disadvantage of Monte Carlo Learning: have to wait until the end of episodes.
TD(0) algorithm replace Gt of the equation to the immediate reward and estimated value function of the next state:
The algorithm updates the Estimated State-Value Function at time t+1, because everything in the equation is determined. This means we will wait until the agent reaching the next state, so that the agent can get the immediate reward Rt+1 and know which state the system will transition to at time t+1.
The equations below are State-Value Function for Dynamic Programming, in which the whole environment is known. Compare to these equations:
TD algorithm is quite like 6.4 Bellman Equation, but it does not take expectation. Instead, it uses the knowledge till now to estimate how much reward I am going to get from this state. The whole algorithm can be demonstrated as:
TD Target, TD Error
Bias/ Viriance trade-off
Bootstraping
Temporal-Difference Learning for Prediction的更多相关文章
- 【PPT】 Least squares temporal difference learning
最小二次方时序差分学习 原文地址: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd= ...
- PP: Multi-Horizon Time Series Forecasting with Temporal Attention Learning
Problem: multi-horizon probabilistic forecasting tasks; Propose an end-to-end framework for multi-ho ...
- [Reinforcement Learning] Model-Free Prediction
上篇文章介绍了 Model-based 的通用方法--动态规划,本文内容介绍 Model-Free 情况下 Prediction 问题,即 "Estimate the value funct ...
- [Machine Learning] 机器学习常见算法分类汇总
声明:本篇博文根据http://www.ctocio.com/hotnews/15919.html整理,原作者张萌,尊重原创. 机器学习无疑是当前数据分析领域的一个热点内容.很多人在平时的工作中都或多 ...
- (转) Deep Learning Research Review Week 2: Reinforcement Learning
Deep Learning Research Review Week 2: Reinforcement Learning 转载自: https://adeshpande3.github.io/ad ...
- Awesome Reinforcement Learning
Awesome Reinforcement Learning A curated list of resources dedicated to reinforcement learning. We h ...
- Machine Learning 学习笔记1 - 基本概念以及各分类
What is machine learning? 并没有广泛认可的定义来准确定义机器学习.以下定义均为译文,若以后有时间,将补充原英文...... 定义1.来自Arthur Samuel(上世纪50 ...
- Distributional Reinforcement Learning with Quantile Regression
郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布! arXiv:1710.10044v1 [cs.AI] 27 Oct 2017 In AAAI Conference on Artifici ...
- 3. Distributional Reinforcement Learning with Quantile Regression
C51算法理论上用Wasserstein度量衡量两个累积分布函数间的距离证明了价值分布的可行性,但在实际算法中用KL散度对离散支持的概率进行拟合,不能作用于累积分布函数,不能保证Bellman更新收敛 ...
随机推荐
- Python核心技术与实战——十一|程序的模块化
我们现在已经总结了Python的基本招式和套路,现在可以写一些不那么简单的系统性工程或代码量较大的应用程序.这时候,一个简单的.py文件就会显得过于臃肿,无法承担一个重量级软件开发的重任.这就需要这一 ...
- luoguP1445 [Violet]樱花
链接P1445 [Violet]樱花 求方程 \(\frac {1}{X}+\frac {1}{Y}=\frac {1}{N!}\) 的正整数解的组数,其中\(N≤10^6\),模\(10^9+7\) ...
- bzoj3011 [Usaco2012 Dec]Running Away From the Barn 左偏树
题目传送门 https://lydsy.com/JudgeOnline/problem.php?id=3011 题解 复习一下左偏树板子. 看完题目就知道是左偏树了. 结果这个板子还调了好久. 大概已 ...
- synchronized和lock的使用分析(优缺点对比详解)
1.synchronized加同步格式: synchronized(需要一个任意的对象(锁)){ 代码块中放操作共享数据的代码. } synchromized缺陷synchronized是java中的 ...
- CSS3——制作带动画效果的小图片
下了一个软件:ScreenToGif用来截取动态图片,终于可以展示我的小动图啦,嘻嘻,敲开心! main.html <!DOCTYPE html> <html lang=" ...
- jetSonNano darknet ubdefined reference to 'pow',undefined reference to 'sqrtf'....
我在用CMakelist编译工程时,遇到了这个一连串基础数学函数找不到的问题,如下图所示: 我当时在工程中明明引用了 #include "math.h"头函数,这是因为你的工程在预 ...
- mybatis中延迟加载Lazy策略
延迟加载: lazy策略原理:只有在使用查询sql返回的数据是才真正发出sql语句到数据库,否则不发出(主要用在多表的联合查询) 1.一对一延迟加载: 假设数据库中有person表和card表:其中p ...
- HDU 1298 T9 ( 字典树 )
题意 : 给你 w 个单词以及他们的频率,现在给出模拟 9 键打字的一串数字,要你在其模拟打字的过程中给出不同长度的提示词,出现的提示词应当是之前频率最高的,当然提示词不需要完整的,也可以是 w 个单 ...
- 小波神经网络(WNN)
人工神经网络(ANN) 是对人脑若干基本特性通过数学方法进行的抽象和模拟,是一种模仿人脑结构及其功能的非线性信息处理系统. 具有较强的非线性逼近功能和自学习.自适应.并行处理的特点,具有良好的容错能力 ...
- Leetcode 15. Sum(二分或者暴力或者哈希都可以)
15. 3Sum Medium Given an array nums of n integers, are there elements a, b, c in nums such that a + ...