Optimal Value Functions and Optimal Policy
Optimal Value Function is how much reward the best policy can get from a state s, which is the best senario given state s. It can be defined as:
Value Function and Optimal State-Value Function
Let's see firstly compare Value Function with Optimal Value Function. For example, in the student study case, the value function for the blue circle state under 50:50 policy is 7.4.
However, when we consider the Optimal State-Value function, 'branches' that may prevent us from getting the best scores are proned. For instance, the optimal senario for the blue circle state is having 100% probability to continue his study rather than going to pub.
Optimal Action-Value Function
Then we move to Action-Value Function, and the following equation also reveals the Optimal Action-Value Function is from the policy who gives the best Action Returns.
The Optimal Action-Value Function is strongly related to Optimal State-Value Function by:
The equation means when action a is taken at state s, what the best return is. At this condition, the probability of reaching each state and the immediate reward is determined, so the only variable is the State-Value function . Therefore it is obvious that obtaining the Optimal State-Value function is equivalent to holding the Optimal Action-Value Function.
Conversely, the Optimal State-Value function is the best combination of Action and the following states with Optimal State-value Functions:
Still in the student example, when we know the Optimal State-Value Function, the Optimal Action-Value Function can be calculated as:
Finally we can derive the best policy from the Optimal Action-Value Function:
This means the policy only picks up the best action at every state rather than having a probability distribution. This deterministic policy is the goal of Reinforcement Learning, as it will guide the action to complete the task.
Optimal Value Functions and Optimal Policy的更多相关文章
- Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs
> 目 录 < Agent–Environment Interface Goals and Rewards Returns and Episodes Policies and Val ...
- Machine Learning——吴恩达机器学习笔记(酷
[1] ML Introduction a. supervised learning & unsupervised learning 监督学习:从给定的训练数据集中学习出一个函数(模型参数), ...
- RL_Learning
Key Concepts in RL 标签(空格分隔): RL_learning OpenAI Spinning Up原址 states and observations (状态和观测) action ...
- Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
- Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
factoextra is an R package making easy to extract and visualize the output of exploratory multivaria ...
- 深度学习课程笔记(七):模仿学习(imitation learning)
深度学习课程笔记(七):模仿学习(imitation learning) 2017.12.10 本文所涉及到的 模仿学习,则是从给定的展示中进行学习.机器在这个过程中,也和环境进行交互,但是,并没有显 ...
- DP Intro - OBST
http://radford.edu/~nokie/classes/360/dp-opt-bst.html Overview Optimal Binary Search Trees - Problem ...
- [C5] Andrew Ng - Structuring Machine Learning Projects
About this Course You will learn how to build a successful machine learning project. If you aspire t ...
- Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...
随机推荐
- Numerical Sequence (easy version)
http://codeforces.com/problemset/problem/1216/E1 E1. Numerical Sequence (easy version) time limit pe ...
- display:table的几个用法 块级子元素垂直居中
DIV+CSS的布局已经让表格布局几乎很少用到,除非表格语义性很强的情况. display:table解决了一部分需要使用表格特性但又不需要表格语义的情况, 尤其是DIV+CSS很不方便解决的问题,比 ...
- 题解 P5265 【模板】多项式反三角函数
→_→ OI 生涯晚期才开始刷板子题的咱 其实这题就是道公式题,搞过多项式全家桶的同学贴贴板子照着公式码两下都能过... 至于公式的证明嘛...总之贴上公式: \[Arcsin(F)=\int{F'\ ...
- Ftp客户端(上传文件)
#coding=utf-8 import os import socket import hashlib import json # client = socket.socket() #申明socke ...
- 行人重识别(ReID) ——基于MGN-pytorch进行可视化展示
下载MGN-pytorch:https://github.com/seathiefwang/MGN-pytorch 下载Market1501数据集:http://www.liangzheng.org/ ...
- 彻底解决mysql报错:1030, 'Got error 28 from storage engine'
权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明.本文链接:https://blog.csdn.net/harry5508/article/deta ...
- 一、模型验证CoreWebApi 管道方式(非过滤器处理)
一.新建.Net Core的MVC项目添加WebApi控制器的方式 using System; using System.Collections.Generic; using System.Linq; ...
- tomcat的跳转与日志
1.跳转的关键性配置; 2. 日志的配置 1.跳转的关键性配置 当用户访问http://www.a.com/test时,会跳转打开/var/www/html目录下的页面 关键性配置如下: [root@ ...
- Windows 10 系统获取密钥方法
方法一: 快捷键 win+R 打开运行窗口,输入 regedit 打开注册表编辑器,选择 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\Curren ...
- cd 切换目录
1. 功能说明 cd是“change directory”中每个氮气的首字母缩写功能是重当前工作目录切换到指定的工作目录:cd是内建命令. 2. 语法格式 cd [option] [dir] cd ...