Policy Improvement and Policy Iteration
From the last post, we know how to evaluate a policy. But that's not enough, because the purpose of policy evaluation is to improve policies so that finally get the optimal policy. So in this post, we will discuss about how to improve a given policy, and how to from a given policy get to the optimal policy.
Firstly, when you have an evaluated policy, the Action-Value function is known for every state. That is, at a certain state s, we known which action can give the system the largest reward.

In the puzzle wandering example, we evaluate the random policy. However,the State-Value functions can be used for policy improvement. After 1 step calculating,we can conclude at the circled location, moving left is better than randomly picking a direction because left side has more reward.

After three steps, we've got a much better intuition about the map. We can change the random policy to a new better one.

The way to improve the current policy is to greedyly pick actions for every state. It is worth noting that greedily picking actions does not means it only consider one step (too greedy to consider multiple steps). Instead, when k=3, the algorithm can foresee three steps, and the greedy picking algorithm will select the best action for k steps.

The Policy Iteration Algorithm is keep doing evaluation and improvement tasks untill the policy becomes stable,


This process means Action-Value function of the improved policy picking the best return from a single action:

The algorithm is:

Policy Improvement and Policy Iteration的更多相关文章
- Provider Policy与Consumer Policy在bnd中的区别
首先需要了解的是bnd的相关知识: 1. API(也就是接口), 2. API Provider(接口的实现) 3. API Consumer( 接口的使用者) OSGi中的一个版本有4个部分: ...
- Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...
- Policy Gradient Algorithms
Policy Gradient Algorithms 2019-10-02 17:37:47 This blog is from: https://lilianweng.github.io/lil-l ...
- Deep Learning专栏--强化学习之从 Policy Gradient 到 A3C(3)
在之前的强化学习文章里,我们讲到了经典的MDP模型来描述强化学习,其解法包括value iteration和policy iteration,这类经典解法基于已知的转移概率矩阵P,而在实际应用中,我们 ...
- 使用 SecurityManager 和 Policy File 管理 Java 程序的权限
参考资料 该文中的内容来源于 Oracle 的官方文档.Oracle 在 Java 方面的文档是非常完善的.对 Java 8 感兴趣的朋友,可以从这个总入口 Java SE 8 Documentati ...
- Utility2:Appropriate Evaluation Policy
UCP收集所有Managed Instance的数据的机制,是通过启用各个Managed Instances上的Collection Set:Utility information(位于Managem ...
- trait与policy模板应用简单示例
trait与policy模板应用简单示例 accumtraits.hpp // 累加算法模板的trait // 累加算法模板的trait #ifndef ACCUMTRAITS_HPP #define ...
- trait与policy模板技术
trait与policy模板技术 我们知道,类有属性(即数据)和操作两个方面.同样模板也有自己的属性(特别是模板参数类型的一些具体特征,即trait)和算法策略(policy,即模板内部的操作逻辑). ...
- Network Policy - 每天5分钟玩转 Docker 容器技术(171)
Network Policy 是 Kubernetes 的一种资源.Network Policy 通过 Label 选择 Pod,并指定其他 Pod 或外界如何与这些 Pod 通信. 默认情况下,所有 ...
随机推荐
- MFC学习笔记1---准备工作
什么是MFC MFC,全称Microsoft Foundation Classes,微软基础类库,顾名思义,是微软的攻城狮们将一些常用的基础的Windows API 函数用C++的形式封装成类,简化程 ...
- STM32内核简介
STM32F407 的内核是cortex-M4 采用的是ARM架构,具体是ARM-V7架构. 而ARM-V7架构分为三个系列: 1>.A系列:面向尖端的基于虚拟内存的操作系统和用户应用: 2 ...
- Java并发——线程间通信与同步技术
传统的线程间通信与同步技术为Object上的wait().notify().notifyAll()等方法,Java在显示锁上增加了Condition对象,该对象也可以实现线程间通信与同步.本文会介绍有 ...
- poj 3714 Raid(平面最近点对)
Raid Time Limit: 5000MS Memory Limit: 65536K Total Submissions: 7473 Accepted: 2221 Description ...
- 64bit机器编译32bit汇编
sudo apt-get install gcc-multilib sudo apt-get install g++-multilib gcc -m32 -S a.c -o a.s gcc -m64 ...
- 超好用的input模糊搜索 jq模糊搜索,
上来先展示效果:默认展示效果: 输入内容: 上代码: css部分: <style type="text/css"> * { padding:; margin:; } h ...
- mysql 在查字符串字段中 条件参数传为数字0查到与实际数据不匹配问题
比如: CREATE TABLE `e` ( `id` int(11) DEFAULT NULL, `status` varchar(255) DEFAULT NULL, `b` varchar(25 ...
- POJ 3741 Raid (平面最近点对)
$ POJ~3741~Raid $ (平面最近点对) $ solution: $ 有两种点,现在求最近的平面点对.这是一道分治板子,但是当时还是想了很久,明明知道有最近平面点对,但还是觉得有点不对劲. ...
- java日志文件log4j.properties配置详解
一.Log4j配置 第一步:加入log4j-1.2.8.jar到lib下. 第二步:在CLASSPATH下建立log4j.properties.内容如下: 放在src下的话就不用配置 否则得去web. ...
- Jdbc连接数据库基本步骤详解_java - JAVA
文章来源:嗨学网 敏而好学论坛www.piaodoo.com 欢迎大家相互学习 Jdbc连接数据库的基本步骤,供大家参考,具体内容如下 package demo.jdbc; import java.s ...