From the last post, we know how to evaluate a policy. But that's not enough, because the purpose of policy evaluation is to improve policies so that finally get the optimal policy. So in this post, we will discuss about how to improve a given policy, and how to from a given policy get to the optimal policy.

Firstly, when you have an evaluated policy, the Action-Value function is known for every state. That is, at a certain state s, we known which action can give the system the largest reward.

In the puzzle wandering example, we evaluate the random policy. However,the State-Value functions can be used for policy improvement. After 1 step calculating,we can conclude at the circled location, moving left is better than randomly picking a direction because left side has more reward.

After three steps, we've got a much better intuition about the map. We can change the random policy to a new better one.

The way to improve the current policy is to greedyly pick actions for every state. It is worth noting that greedily picking actions does not means it only consider one step (too greedy to consider multiple steps). Instead, when k=3, the algorithm can foresee three steps, and the greedy picking algorithm will select the best action for k steps.

The Policy Iteration Algorithm is keep doing evaluation and improvement tasks untill the policy becomes stable,

This process means Action-Value function of the improved policy picking the best return from a single action:

The algorithm is:

Policy Improvement and Policy Iteration的更多相关文章

  1. Provider Policy与Consumer Policy在bnd中的区别

    首先需要了解的是bnd的相关知识: 1. API(也就是接口), 2. API Provider(接口的实现) 3. API Consumer( 接口的使用者) OSGi中的一个版本有4个部分:    ...

  2. Reinforcement Learning Index Page

    Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...

  3. Policy Gradient Algorithms

    Policy Gradient Algorithms 2019-10-02 17:37:47 This blog is from: https://lilianweng.github.io/lil-l ...

  4. Deep Learning专栏--强化学习之从 Policy Gradient 到 A3C(3)

    在之前的强化学习文章里,我们讲到了经典的MDP模型来描述强化学习,其解法包括value iteration和policy iteration,这类经典解法基于已知的转移概率矩阵P,而在实际应用中,我们 ...

  5. 使用 SecurityManager 和 Policy File 管理 Java 程序的权限

    参考资料 该文中的内容来源于 Oracle 的官方文档.Oracle 在 Java 方面的文档是非常完善的.对 Java 8 感兴趣的朋友,可以从这个总入口 Java SE 8 Documentati ...

  6. Utility2:Appropriate Evaluation Policy

    UCP收集所有Managed Instance的数据的机制,是通过启用各个Managed Instances上的Collection Set:Utility information(位于Managem ...

  7. trait与policy模板应用简单示例

    trait与policy模板应用简单示例 accumtraits.hpp // 累加算法模板的trait // 累加算法模板的trait #ifndef ACCUMTRAITS_HPP #define ...

  8. trait与policy模板技术

    trait与policy模板技术 我们知道,类有属性(即数据)和操作两个方面.同样模板也有自己的属性(特别是模板参数类型的一些具体特征,即trait)和算法策略(policy,即模板内部的操作逻辑). ...

  9. Network Policy - 每天5分钟玩转 Docker 容器技术(171)

    Network Policy 是 Kubernetes 的一种资源.Network Policy 通过 Label 选择 Pod,并指定其他 Pod 或外界如何与这些 Pod 通信. 默认情况下,所有 ...

随机推荐

  1. ln -在文件之间建立连接

    总览 ln [options] source [dest] ln [options] source...directory POSIX 选项: [-f] GNU 选项(缩写): [-bdfinsvF] ...

  2. 关于session的记录

    在做DRP项目中的修改密码功能时,在JSP中先获取了之前登陆时设置的session中的用户账号,在调试的时候一直只是刷新页面,而没有重启页面,导致AJAX一直传输到相应的servlet失败,出现404 ...

  3. 002-printf 命令用法

    printf 命令的用法,大部分结合awk命令使用 是格式化的输出的命令 %s 输入字符串 \n 换行 \t \r 回车键 [root@zabbix lianxi]# printf %s [root@ ...

  4. Python 序列化 pickle/cPickle模块

    Python 序列化 pickle/cPickle模块 2013-10-17 Posted by yeho Python序列化的概念很简单.内存里面有一个数据结构,你希望将它保存下来,重用,或者发送给 ...

  5. [USACO2009 OPEN] 滑雪课 Ski Lessons

    洛谷P2948 看到题目就觉得这是动规但一直没想到如何状态转移……看了别人的题解之后才有一些想法 f[i][j]:前i单位时间能力值为j可以滑的最多次数 lessons[i][j]:结束时间为i,获得 ...

  6. springboot+mybatis 配置sql打印日志

    第一种: 配置类型 # 配置slq打印日志 logging.level.com.lawt.repository.mapper=debug重点: #其中   com.lawt.repository.ma ...

  7. 【bzoj 4554】【Tjoi2016&Heoi2016】【NOIP2016模拟7.12】游戏

    题目 分析 当没有石头的时候,就用二分图匹配来做. 但现在加入了石头, 所以,求出每行和每列联通快的个数,如果有一块平地,包括在某个行联通块以及某个列联通块中,连边. //无聊打了网络流,匈牙利也可以 ...

  8. 【leetcode】1163. Last Substring in Lexicographical Order

    题目如下: Given a string s, return the last substring of s in lexicographical order. Example 1: Input: & ...

  9. mybatis学习$与#号取值区别

    1,多个参数传递用map或实体封装后再传给myBatis, mybatis学习$与#号取值区别 #{} 1.加了单引号,  2.#号写是可以防止sql注入,比较安全 select * from use ...

  10. sass、less中的scoped属性

    1.scoped 的实现原理 Vue中的Less 中的 scoped 属性的效果主要是通过 PostCss 实现的.代码示例: //编译前 <template> <div class ...