Optimal Value Functions and Optimal Policy

Optimal Value Function is how much reward the best policy can get from a state s, which is the best senario given state s. It can be defined as:

Value Function and Optimal State-Value Function

Let's see firstly compare Value Function with Optimal Value Function. For example, in the student study case, the value function for the blue circle state under 50:50 policy is 7.4.

However, when we consider the Optimal State-Value function, 'branches' that may prevent us from getting the best scores are proned. For instance, the optimal senario for the blue circle state is having 100% probability to continue his study rather than going to pub.

Optimal Action-Value Function

Then we move to Action-Value Function, and the following equation also reveals the Optimal Action-Value Function is from the policy who gives the best Action Returns.

The Optimal Action-Value Function is strongly related to Optimal State-Value Function by:

The equation means when action a is taken at state s, what the best return is. At this condition, the probability of reaching each state and the immediate reward is determined, so the only variable is the State-Value function . Therefore it is obvious that obtaining the Optimal State-Value function is equivalent to holding the Optimal Action-Value Function.

Conversely, the Optimal State-Value function is the best combination of Action and the following states with Optimal State-value Functions:

Still in the student example, when we know the Optimal State-Value Function, the Optimal Action-Value Function can be calculated as:

Finally we can derive the best policy from the Optimal Action-Value Function:

This means the policy only picks up the best action at every state rather than having a probability distribution. This deterministic policy is the goal of Reinforcement Learning, as it will guide the action to complete the task.

Optimal Value Functions and Optimal Policy的更多相关文章

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs
> 目录 < Agent–Environment Interface Goals and Rewards Returns and Episodes Policies and Val ...
Machine Learning——吴恩达机器学习笔记（酷
[1] ML Introduction a. supervised learning & unsupervised learning 监督学习:从给定的训练数据集中学习出一个函数(模型参数), ...
RL_Learning
Key Concepts in RL 标签(空格分隔): RL_learning OpenAI Spinning Up原址 states and observations (状态和观测) action ...
Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
factoextra is an R package making easy to extract and visualize the output of exploratory multivaria ...
深度学习课程笔记（七）：模仿学习（imitation learning）
深度学习课程笔记(七):模仿学习(imitation learning) 2017.12.10 本文所涉及到的模仿学习,则是从给定的展示中进行学习.机器在这个过程中,也和环境进行交互,但是,并没有显 ...
DP Intro - OBST
http://radford.edu/~nokie/classes/360/dp-opt-bst.html Overview Optimal Binary Search Trees - Problem ...
[C5] Andrew Ng - Structuring Machine Learning Projects
About this Course You will learn how to build a successful machine learning project. If you aspire t ...
Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...

随机推荐

Acwing.835. Trie字符串统计（模板）
维护一个字符串集合,支持两种操作: “I x”向集合中插入一个字符串x: “Q x”询问一个字符串在集合中出现了多少次. 共有N个操作,输入的字符串总长度不超过 105105,字符串仅包含小写英文字母 ...
（一：NIO系列）JAVA NIO 简介
出处:JAVA NIO 简介 Java 中 New I/O类库是由 Java 1.4 引进的异步 IO.由于之前老的I/O类库是阻塞I/O,New I/O类库的目标就是要让Java支持非阻塞I/O, ...
Python 内置函数raw_input()和input()用法和区别
我们知道python接受输入的raw_input()和input() ,在python3 输入raw_input() 去掉乐,只要用input() 输入,input 可以接收一个Python表达式作为 ...
bzoj3188 [Coci 2011]Upit（分块）
Time Limit: 10 Sec Memory Limit: 128 MB Description 你需要维护一个序列,支持以下4种操作.一,将区间(u,v)的数覆盖为C:二,将区间(u,v)的 ...
Storm分布式集群搭建
一.storm版本选用storm0.9.6 二.本地模式用于对storm业务逻辑的调试和测试,可以直接在本地运行. 三.分布式模式生产环境,需要对应的zookeeper.nimbus.super ...
Spark2.0集成Hive操作的相关配置与注意事项
前言已完成安装Apache Hive,具体安装步骤请参照,Linux基于Hadoop2.8.0集群安装配置Hive2.1.1及基础操作补充说明 Hive中metastore(元数据存储)的三种方式 ...
【记录】解决windows中nginx明明退出了，为什么还能反向代理？CMD强制杀死进程命令
博主今天遇到一个很奇怪的问题,nginx在windows中明明已经退出了,而且在任务管理器中也没发现nginx进程, 为什么还能反向代理呢? 找了半天资料终于解决,现记录如下,希望能帮助到你. 步骤一 ...
django中collectstatic的使用
前言我最近在琢磨django框架的使用,在上传个人网站服务器上时,再次遇到了找不到静态文件,css.img等样式全无的问题.于是沉下心来,好好研究了django的静态文件到底应该怎么去部署(depl ...
rmdir 删除空目录
1. 命令功能 rmdir:删除空目录,当目录不为空时,命令不能起作用. 2. 语法格式 rmdir [option] directory rmdir 选项空目录参数参数说明 -p 递归 ...
修改pom项目版本 jenkins 关联 shell命令
#获取pom文件内的项目版本 version=`awk '/<version>[^<]+<\/version>/{gsub(/<version>|<\/ ...

Optimal Value Functions and Optimal Policy

Optimal Value Functions and Optimal Policy的更多相关文章

随机推荐

热门专题