Problem of State-Value Function

Similar as Policy Iteration in Model-Based Learning, Generalized Policy Iteration will be used in Monte Carlo Control. In Policy Iteration, we keep doing Policy Evaluation and Policy Improvement untill our policy converging to Optimal Policy.

Every time when we improve the policy, the action that gives the best return(reward+value function of the next state) will be picked.

The problem of this algorithm if we directly transfering to Monte Carlo is: it is based on the Transition Matrix.

Monte Carlo Control based on Q function

The idea of Policy Iteration can be used to Estimite Action-Value Function, and it is very useful for Model-Free problem. The process of choosing actions does not depend on State-Value function, because the return from a specific action is given by Monte Carlo estimation.

Q function can be updated by:

When we improve the policy, we just pick the action that produce the maximum Q value.

Exploration-exploitation Dilemma and ε-Greedy Exploration:

In Model-Based Policy Iteration algorithm, we update all State-Value function within a single policy evaluation process, so that we can choose the best actions from the whole action space  whiled improving policies. Nevertheless, Monte Carlo Learning only updates the Action-Value functions whose actions were taken on the previous episode. So there are probabily some actions having better returns than the actions we have tried. Sometimes we need to give them a trial. We call that problem the Exploration-Exploitation Delemma.

It is necessary to try some new opened restaurant, rather than going to the usual place every day.

ε-Greedy Exploration is the algorithm that gives the agent probability=ε to choose randomly actions and 1-ε to stay on the optimal action.

Monte Carlo Control的更多相关文章

  1. 增强学习(四) ----- 蒙特卡罗方法(Monte Carlo Methods)

    1. 蒙特卡罗方法的基本思想 蒙特卡罗方法又叫统计模拟方法,它使用随机数(或伪随机数)来解决计算的问题,是一类重要的数值计算方法.该方法的名字来源于世界著名的赌城蒙特卡罗,而蒙特卡罗方法正是以概率为基 ...

  2. Monte Carlo Policy Evaluation

    Model-Based and Model-Free In the previous several posts, we mainly talked about Model-Based Reinfor ...

  3. Monte Carlo方法简介(转载)

    Monte Carlo方法简介(转载)       今天向大家介绍一下我现在主要做的这个东东. Monte Carlo方法又称为随机抽样技巧或统计实验方法,属于计算数学的一个分支,它是在上世纪四十年代 ...

  4. PRML读书会第十一章 Sampling Methods(MCMC, Markov Chain Monte Carlo,细致平稳条件,Metropolis-Hastings,Gibbs Sampling,Slice Sampling,Hamiltonian MCMC)

    主讲人 网络上的尼采 (新浪微博: @Nietzsche_复杂网络机器学习) 网络上的尼采(813394698) 9:05:00  今天的主要内容:Markov Chain Monte Carlo,M ...

  5. Monte Carlo Approximations

    准备总结几篇关于 Markov Chain Monte Carlo 的笔记. 本系列笔记主要译自A Gentle Introduction to Markov Chain Monte Carlo (M ...

  6. (转)Markov Chain Monte Carlo

    Nice R Code Punning code better since 2013 RSS Blog Archives Guides Modules About Markov Chain Monte ...

  7. [其他] 蒙特卡洛(Monte Carlo)模拟手把手教基于EXCEL与Crystal Ball的蒙特卡洛成本模拟过程实例:

    http://www.cqt8.com/soft/html/723.html下载,官网下载 (转帖)1.定义: 蒙特卡洛(Monte Carlo)模拟是一种通过设定随机过程,反复生成时间序列,计算参数 ...

  8. Introduction to Monte Carlo Tree Search (蒙特卡罗搜索树简介)

    Introduction to Monte Carlo Tree Search (蒙特卡罗搜索树简介)  部分翻译自“Monte Carlo Tree Search and Its Applicati ...

  9. (转)Monte Carlo method 蒙特卡洛方法

    转载自:维基百科  蒙特卡洛方法 https://zh.wikipedia.org/wiki/%E8%92%99%E5%9C%B0%E5%8D%A1%E7%BE%85%E6%96%B9%E6%B3%9 ...

随机推荐

  1. Nginx 的全局和虚拟主机配置

    Httpd.conf nginx.conf my-heavy-innode-4G.cnf php.ini  用中文注释 # user:指定 Nginx Worker 进程运行用户和用户组,默认 nob ...

  2. 瞎JB逆

    P为质 ; long long quickpow(long long a, long long b) { ) ; ; a %= mod; while(b) { ) ret = (ret * a) % ...

  3. 复试笔试复习 & bd面试总结

    计算机网络: 1.OSI模型中提供端到端服务的是传输层 2.波特率的含义是每秒钟信号变化的次数 3.非屏蔽双绞线中5类网线的数据速率为100Mbps,连接器是RJ-45 4.虚电路在数据链路层实现,电 ...

  4. CPU指令重排序与MESI缓存一致性

    一.重排序场景 class ResortDemo { int a = 0; boolean flag = false; public void writer() { a = 1; //1 flag = ...

  5. Ubuntu安装SFTP服务,及启动失败处理

    安装openssh-server sudo apt-get install openssh-server 查看是否安装成功 dpkg --get-selections | grep ssh 新建用户组 ...

  6. 1Mbps代表每秒传输1,000,000位(bit

    1Mbps代表每秒传输1,000,000位(bit

  7. 查看PL/SQL编译时的错误信息

    编译无效对象是DBA与数据库开发人员常见的工作之一.对于编译过程中的错误该如何去捕获,下面给出两种捕获错误的方法. 一.当前数据库版本信息及无效对象 1.查看当前数据库版本 [sql] view pl ...

  8. hashcode 和 equals

    https://www.cnblogs.com/Qian123/p/5703507.html#_label0 hashCode是jdk根据对象的地址或者字符串或者数字算出来的int类型的数值 详细了解 ...

  9. 【LuoguP3241】[HNOI2015] 开店

    题目链接 题意 给出一棵边带权的树,多次在线询问一个点到一个区间内的点的距离和. Sol 分块过不了的 一个 trick ,都知道要算两点之间距离可以拆成到根的距离和他们的 LCA 到根的距离 ,其实 ...

  10. Django模板自定义标签和过滤器,模板继承(extend),Django的模型层

    上回精彩回顾 视图函数: request对象 request.path 请求路径 request.GET GET请求数据 QueryDict {} request.POST POST请求数据 Quer ...