Optimal Value Functions and Optimal Policy

Optimal Value Function is how much reward the best policy can get from a state s, which is the best senario given state s. It can be defined as:

Value Function and Optimal State-Value Function

Let's see firstly compare Value Function with Optimal Value Function. For example, in the student study case, the value function for the blue circle state under 50:50 policy is 7.4.

However, when we consider the Optimal State-Value function, 'branches' that may prevent us from getting the best scores are proned. For instance, the optimal senario for the blue circle state is having 100% probability to continue his study rather than going to pub.

Optimal Action-Value Function

Then we move to Action-Value Function, and the following equation also reveals the Optimal Action-Value Function is from the policy who gives the best Action Returns.

The Optimal Action-Value Function is strongly related to Optimal State-Value Function by:

The equation means when action a is taken at state s, what the best return is. At this condition, the probability of reaching each state and the immediate reward is determined, so the only variable is the State-Value function . Therefore it is obvious that obtaining the Optimal State-Value function is equivalent to holding the Optimal Action-Value Function.

Conversely, the Optimal State-Value function is the best combination of Action and the following states with Optimal State-value Functions:

Still in the student example, when we know the Optimal State-Value Function, the Optimal Action-Value Function can be calculated as:

Finally we can derive the best policy from the Optimal Action-Value Function:

This means the policy only picks up the best action at every state rather than having a probability distribution. This deterministic policy is the goal of Reinforcement Learning, as it will guide the action to complete the task.

Optimal Value Functions and Optimal Policy的更多相关文章

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs
> 目录 < Agent–Environment Interface Goals and Rewards Returns and Episodes Policies and Val ...
Machine Learning——吴恩达机器学习笔记（酷
[1] ML Introduction a. supervised learning & unsupervised learning 监督学习:从给定的训练数据集中学习出一个函数(模型参数), ...
RL_Learning
Key Concepts in RL 标签(空格分隔): RL_learning OpenAI Spinning Up原址 states and observations (状态和观测) action ...
Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
factoextra is an R package making easy to extract and visualize the output of exploratory multivaria ...
深度学习课程笔记（七）：模仿学习（imitation learning）
深度学习课程笔记(七):模仿学习(imitation learning) 2017.12.10 本文所涉及到的模仿学习,则是从给定的展示中进行学习.机器在这个过程中,也和环境进行交互,但是,并没有显 ...
DP Intro - OBST
http://radford.edu/~nokie/classes/360/dp-opt-bst.html Overview Optimal Binary Search Trees - Problem ...
[C5] Andrew Ng - Structuring Machine Learning Projects
About this Course You will learn how to build a successful machine learning project. If you aspire t ...
Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...

随机推荐

在学习linux磁盘管理期间学习的逻辑卷管理笔记
LVM(逻辑分区)的创建顺序:物理分区-物理卷-卷组-逻辑卷-挂载. 物理卷(Physical Volume,PV):就是指硬盘分区,也可以是整个硬盘或已创建的软RAID,是LVM的基本存储设备. 卷 ...
HNUSTOJ-1696 简单验证码识别(模拟)
1696: 简单验证码识别时间限制: 2 Sec 内存限制: 128 MB 提交: 148 解决: 44 [提交][状态][讨论版] 题目描述验证码是Web系统中一种防止暴力破解的重要手段.其 ...
Django中用 form 实现登录注册
1.forms模块将Models和Forms结合到一起使用,将Forms中的类和Models中的类关联到一起,实现属性的共享 1.在forms.py中创建class,继承自forms.ModelFo ...
python的cls，self，classmethod，staticmethod
python类里会出现这三个单词,self和cls都可以用别的单词代替,类的方法有三种, 一是通过def定义的普通的一般的,需要至少传递一个参数,一般用self,这样的方法必须通过一个类的实例去访问 ...
LED音乐频谱之点阵
转载请注明出处:http://blog.csdn.net/ruoyunliufeng/article/details/37967455 一.硬件 watermark/2/text/aHR0cDovL2 ...
NIO的缓冲区、通道、选择器关系理解
Buffer的数据存取一个用于特定基本数据类行的容器.有java.nio包定义的,所有缓冲区都是抽象类Buffer的子类. Java NIO中的Buffer主要用于与NIO通道进行交互,数 ...
Dubbo学习源码总结系列五--集群负载均衡
Dubbo提供了哪些负载均衡机制?如何实现的? LoadBalance接口:可以看出,通过SPI机制默认为RandomLoadBalance,生成的适配器类执行sel ...
CentOS下性能监测工具 dstat
原文链接:http://www.bkjia.com/Linuxjc/935113.html 参考链接:https://linux.cn/article-3215-1.html,http://lhfli ...
JuniorCTF - Web - blind
题目链接 https://ctftime.org/task/7450 参考链接 https://github.com/Dvd848/CTFs/blob/master/2018_35C3_Junior/ ...
Windows系统中，循环运行.bat/.exe等文件
一.创建循环运行的run-everySecond.vbs文件[双击次文件即可启动运行] dim a set a=CreateObject("Wscript.Shell") Do # ...

Optimal Value Functions and Optimal Policy

Optimal Value Functions and Optimal Policy的更多相关文章

随机推荐

热门专题