Optimal Value Functions and Optimal Policy
Optimal Value Function is how much reward the best policy can get from a state s, which is the best senario given state s. It can be defined as:

Value Function and Optimal State-Value Function
Let's see firstly compare Value Function with Optimal Value Function. For example, in the student study case, the value function for the blue circle state under 50:50 policy is 7.4.

However, when we consider the Optimal State-Value function, 'branches' that may prevent us from getting the best scores are proned. For instance, the optimal senario for the blue circle state is having 100% probability to continue his study rather than going to pub.

Optimal Action-Value Function
Then we move to Action-Value Function, and the following equation also reveals the Optimal Action-Value Function is from the policy who gives the best Action Returns.

The Optimal Action-Value Function is strongly related to Optimal State-Value Function by:

The equation means when action a is taken at state s, what the best return is. At this condition, the probability of reaching each state and the immediate reward is determined, so the only variable is the State-Value function . Therefore it is obvious that obtaining the Optimal State-Value function is equivalent to holding the Optimal Action-Value Function.
Conversely, the Optimal State-Value function is the best combination of Action and the following states with Optimal State-value Functions:

Still in the student example, when we know the Optimal State-Value Function, the Optimal Action-Value Function can be calculated as:

Finally we can derive the best policy from the Optimal Action-Value Function:


This means the policy only picks up the best action at every state rather than having a probability distribution. This deterministic policy is the goal of Reinforcement Learning, as it will guide the action to complete the task.
Optimal Value Functions and Optimal Policy的更多相关文章
- Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs
> 目 录 < Agent–Environment Interface Goals and Rewards Returns and Episodes Policies and Val ...
- Machine Learning——吴恩达机器学习笔记(酷
[1] ML Introduction a. supervised learning & unsupervised learning 监督学习:从给定的训练数据集中学习出一个函数(模型参数), ...
- RL_Learning
Key Concepts in RL 标签(空格分隔): RL_learning OpenAI Spinning Up原址 states and observations (状态和观测) action ...
- Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
- Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
factoextra is an R package making easy to extract and visualize the output of exploratory multivaria ...
- 深度学习课程笔记(七):模仿学习(imitation learning)
深度学习课程笔记(七):模仿学习(imitation learning) 2017.12.10 本文所涉及到的 模仿学习,则是从给定的展示中进行学习.机器在这个过程中,也和环境进行交互,但是,并没有显 ...
- DP Intro - OBST
http://radford.edu/~nokie/classes/360/dp-opt-bst.html Overview Optimal Binary Search Trees - Problem ...
- [C5] Andrew Ng - Structuring Machine Learning Projects
About this Course You will learn how to build a successful machine learning project. If you aspire t ...
- Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...
随机推荐
- UVA 10003 Cutting Sticks 区间DP+记忆化搜索
UVA 10003 Cutting Sticks+区间DP 纵有疾风起 题目大意 有一个长为L的木棍,木棍中间有n个切点.每次切割的费用为当前木棍的长度.求切割木棍的最小费用 输入输出 第一行是木棍的 ...
- C#根据出生日期和当前日期计算精确年龄
C#根据出生日期和当前日期计算精确年龄更多 0c#.net基础 public static string GetAge(DateTime dtBirthday, DateTime dtNow){ st ...
- python学习笔记(12):高级面向对象
一.__slots__和property 1.__slots__魔术函数动态的添加方法和属性 2.直接暴露属性的局限性 3.使用get/set方法 4.利用@property简化get/set方法 5 ...
- Python之路-Python中的线程与进程
一.发展背景 任务调度 大部分操作系统(如Windows.Linux)的任务调度是采用时间片轮转的抢占式调度方式,也就是说一个任务执行一小段时间后强制暂停去执行下一个任务,每个任务轮流执行.任务执行的 ...
- jQuery学习总结02-筛选
一.筛选 1.eq(index|-index) 说明:获取当前链式操作中第N个jQuery对象,返回jQuery对象,类似的有get(index),不过get(index)返回的是DOM对象 示例: ...
- SCRUM REPORT DIRECTORY
Alpha sprint scrum 1 scrum 2 scrum 3 scrum 4 scrum 5 scrum 6 scrum 7 scrum 8 scrum 9 scrum 10 Beta s ...
- OCTAVE-CONFIG
SYNOPSIS 总览 octave-config [--m-site-dir] [--oct-site-dir] [-v|--version] [-h|-?|--help] DESCRIPTION ...
- linux 服务器与客户端异常断开连接问题
服务器与客户端连接,客户端异常断掉之后服务器端口仍然被占用, 到最后是不是服务器端达到最大连接数就没法连接了?领导让我测试这种情况,我用自己的电脑当TCP Client,虚拟机当服务器,连接之后能正常 ...
- Sql中使用Case when
SELECT CASE WHEN Column IS NOT NULL THEN '情况1' ELSE '情况2' END AS '列名' , FROM dbo.Table
- Oracle单引号转义符
作用:Increase readability and usability (增加可读性和可用性) 用法:select q'[ select * from ]'||table_name|| ';' ...