Monte Carlo Policy Evaluation

Model-Based and Model-Free

In the previous several posts, we mainly talked about Model-Based Reinforcement Learning. The biggest assumption for Model-Based learning is the whole knowledge of the environment is given, but it is unrealistic in real life or even video games. I do believe unknown makes life (game) interesting. Personally, everytime I know what will happen in a game, I get rid of it! So, people like playing games with some uncertainties. That is what Model-Free Reinforcement Learning does: lean the unknown environment, and then come up with the best policy.

Two Task for Model-Free Learning

For Model-Based Learning,we have two tasks: Evaluation and Control. We use Dynamic Programming to evaluate a policy, while two algorithms called Policy Iteration and Value Iteration are used to extract the optimal policy. In Model-Free Learning, we have the same two tasks:

Evaluation: we need to estimate State-Value functions for every state, although Transition Matrices and Reward Function are not given.

Control: we need to find the best policy to solve the interesting game.

Monte Carlo Method in Daily Life

One of the algorithms for Evaluation is Monte Carlo Method. Probably the superior name 'Monte Carlo' is scaring, but you would feel comfortable while following my example below.

Assume our task is traveling from London to Toronto, and the policy is 'random walk'. we define travel distance and time as rewards. There are cities, towns, even villages on the map on which agents would randomly draw diverse trajectories from London to Toronto. Even it is possible that an agent arrives and departs some places more than once. But finally every agent will complete an episode when it arrives at Toronto, and then we can review paths, learning experience from trials.

Finally, we will know goodnesses of being in every place on the map. Mississauga is a very good state for our task, but the state Owen Sound is terrible. The reason why we can get this conclusion is: states closed to the final state usually have great State-Value functions, because they tend to win the game soon. Episodes in our example that get to Mississauga have great probabilities getting to Toronto soon, so the Expectation of their State-Value functions is also high. On the other hand, the expectation of Owen Sound's State-Value function pretty low. That's why most people choose 401 Highway passing Mississauga to Toronto, but nobody go Owen Sound first. It's from daily life experience.

Definition of Monte Carlo Method

1. First-Visit Monte Carlo Evaluation Algorithm

Initialize:

a. π is the policy that needs to be evaluated;

b. V(S_t) is an arbitary State-Value function;

c. Counter matrix N(S_t), to record the appearance time of each state;

Repeat in loops:

a. Generate an episode from π

b. For each state s appearing in the episode for the first time, calculate new State-Value function V(S_t)

To avoid storing all returns or the sum of returns for all episodes, we transform the equation a little bit when we calculate the new State-Value function:

And we get the equation for updating State-Value function:

it can also be rewritten to:

2. Every-Visit Monte Carlo Evaluation Algorithm

Initialize:

a. π is the policy that needs to be evaluated;

b. V(S_t) is an arbitary State-Value function;

c. Counter matrix N(S_t), to record the appearance time of each state;

Repeat in loops:

a. Generate an episode from π

b. For each state s appearing in the episode for every time, calculate new State-Value function V(S_t)

Population vs. Sample

A online video from MIT(Here we go) reminds me the idea of Population and Sample from Statistics. Population is quite like the whole knowledge on Model-Based learning, but getting samples is easier and more realistic in real life. Monte Carlo is similar to using the distribution of samples to do inference of the population.

Monte Carlo Policy Evaluation的更多相关文章

增强学习（四） ----- 蒙特卡罗方法(Monte Carlo Methods)
1. 蒙特卡罗方法的基本思想蒙特卡罗方法又叫统计模拟方法,它使用随机数(或伪随机数)来解决计算的问题,是一类重要的数值计算方法.该方法的名字来源于世界著名的赌城蒙特卡罗,而蒙特卡罗方法正是以概率为基 ...
蒙特卡罗方法、蒙特卡洛树搜索（Monte Carlo Tree Search，MCTS）初探
1. 蒙特卡罗方法(Monte Carlo method) 0x1:从布丰投针实验说起 - 只要实验次数够多,我就能直到上帝的意图 18世纪,布丰提出以下问题:设我们有一个以平行且等距木纹铺成的地板( ...
Monte Carlo Control
Problem of State-Value Function Similar as Policy Iteration in Model-Based Learning, Generalized Pol ...
Introduction to Monte Carlo Tree Search （蒙特卡罗搜索树简介）
Introduction to Monte Carlo Tree Search (蒙特卡罗搜索树简介) 部分翻译自“Monte Carlo Tree Search and Its Applicati ...
强化学习读书笔记 - 05 - 蒙特卡洛方法(Monte Carlo Methods)
强化学习读书笔记 - 05 - 蒙特卡洛方法(Monte Carlo Methods) 学习笔记: Reinforcement Learning: An Introduction, Richard S ...
Programming a Hearthstone agent using Monte Carlo Tree Search(chapter one)
Markus Heikki AnderssonHåkon HelgesenHesselberg Master of Science in Computer Science Submission dat ...
Ⅳ Monte Carlo Methods
Dictum: Nutrition books in the world. There is no book in life, there is no sunlight; wisdom withou ...
Monte Carlo方法简介(转载)
Monte Carlo方法简介(转载) 今天向大家介绍一下我现在主要做的这个东东. Monte Carlo方法又称为随机抽样技巧或统计实验方法,属于计算数学的一个分支,它是在上世纪四十年代 ...
PRML读书会第十一章 Sampling Methods（MCMC， Markov Chain Monte Carlo，细致平稳条件，Metropolis-Hastings，Gibbs Sampling，Slice Sampling，Hamiltonian MCMC）
主讲人网络上的尼采 (新浪微博: @Nietzsche_复杂网络机器学习) 网络上的尼采(813394698) 9:05:00 今天的主要内容:Markov Chain Monte Carlo,M ...

随机推荐

Matlab 2019a 下载和安装步骤
目录 1. 安装包下载 + 多套数学建模视频教程 2. 安装步骤 1. 安装包下载 + 多套数学建模视频教程参考:https://blog.csdn.net/COCO56/article/detai ...
P4315 月下“毛景树” （树链剖分+边剖分+区间覆盖+区间加+区间最大值）
题目链接:https://www.luogu.org/problem/P4315 题目大意: 有N个节点和N-1条树枝,但节点上是没有毛毛果的,毛毛果都是长在树枝上的.但是这棵“毛景树”有着神奇的魔力 ...
zookeeper之一安装和配置(单机+集群)
这里我以zookeeper3.4.10.tar.gz来演示安装,安装到/usr/local/soft目录下. 一.单机版配置 1.安装和配置 #.下载 wget http://apache.fayea ...
MySQL--limit使用注意
limit m,n 的意义是在选择.查询得到的结果中,从第m条开始,拿连续的n条作为结果返回.根据它的原理可以知道,select ....limit m,n时要扫描得到的数据条数是m+n条.这就导致m ...
Linux发行版和内核版本
1./etc/issue 和 /etc/redhat-release都是系统安装时默认的发行版本信息,通常安装好系统后文件内容不会发生变化. 2.lsb_release -a :FSG(Free St ...
B/S大文件下载+断点续传
1.先将 webuploader-0.1.5.zip 这个文件下载下来:https://github.com/fex-team/webuploader/releases 根据个人的需求放置自己需要的 ...
asp.net实现大视频上传
IE的自带下载功能中没有断点续传功能,要实现断点续传功能,需要用到HTTP协议中鲜为人知的几个响应头和请求头. 一. 两个必要响应头Accept-Ranges.ETag 客户端每次提交下载请求时,服务 ...
【bzoj3195】【 [Jxoi2012]奇怪的道路】另类压缩的状压dp好题
(上不了p站我要死了) 啊啊,其实想清楚了还是挺简单的. Description 小宇从历史书上了解到一个古老的文明.这个文明在各个方面高度发达,交通方面也不例外.考古学家已经知道,这个文明在全盛时期 ...
[CF342C]Cupboard and Balloons 题解
前言博主太弱了题解这道题目是一个简单的贪心. 首先毋庸置疑,柜子的下半部分是要放满的. 于是我们很容易想到,分以下三种情况考虑: \[\small\text{请不要盗图,如需使用联系博主}\] ...
[算法]Min_25筛
前言本篇文章中使用的字母\(p\),指\(\text{任意的} p \in \text{素数集合}\) 应用场景若函数\(f(x)\)满足, \(f(x)\)是积性函数 \(f(p)\)可以使用多 ...

Monte Carlo Policy Evaluation

Monte Carlo Policy Evaluation的更多相关文章

随机推荐

热门专题