Markov Decision Process in Detail
From the last post about MDP, we know the environment consists of 5 basic elements:
S:State Space of environment;
A:Actions Space that the environment allows;
{Ps,s'}:Transition Matrix, the probabilities of how environment state transit from one to another when actions are taken. The number of matrices equals to the number of actions.
R: Reward, when the system transitions from state s to s' due to action a, how much reward can an agent receive from the environment. Sometimes, reward have different definition.
γ: How reward discounts by time.
How Different between MDP and MRP:
Keyword: Action
The five elements of MDP can be illustrated by the chart below, in which the green circles are states, orange circles are actions, and there are two rewards. In MRP and Markov Process, we directly know the transition matrix. However, in the transition path from one state to another is interupted by actions. And it's worth noting that when the environment is at a certain state, there is no probabilities for actions. The reason is quite understandable: we live in the some world(environment), but different people have different behaviors.

Agent and Policy
Agent is the person or robot who interacts with the environment in Reinforcement Learning. Like human being, everyone may have different behavior under the same condition. The probability distribution of behaviors under different states is Policy. There are so many probabilities in an environment, but for a specific agent (person), he or she may take only one or several possible actions under a certain state. Given states, the policy is defined by:

An example of policy is shown below:

From MRP to MDP: MRP+Policy
Transition Matrix: Without policies we do not exactly know the the probability from state s transitioning to s', because different agents may have different probabilitie to take actions. As long as we get π, we can calculate the state transition matrix. 
In the chart above, for example, if an agent has probabilitie 0.4 and 0.6 for action a0 and a1, the transition probability from s0 to s1 is: 0.4*0.5+0.6*1=0.8
Reward:
In MDP the reward function is related to actions, which average the uncertainties of the result from an action.

Once we've got the Policy π, we know the action distribution of a specific agent, so we can average the uncertaintie of actions, then measure how much immediate reward can receive from state s under policy π.

So now we go back from MDP to MRP, and the Markov Reward Process is defined by the tuple

Two Value Functions:
State Value Function:
State Value Function is the same as the value function in MRP. It is used to evaluate the goodness of being in a state s(by immedate and future reward), and the only difference is to average the uncertainty of actions under policy π. It is in the form of:

Action Value Function:
To average uncertainties of actions, it's neccesary to know the expected reward from possible actions. So we have Action Value Functionin MDP, which reveals whether an action is good or bad when an agent takes an particular action in state s.

If we calculate expectation of Action Value Functions under the same state s, we will end up with the State Value Function v:

Similarly, when an action is taken, the system may end up with different states. When we remove the uncertainty of state transition, we go back from State Value Function to Action Value Function:

If we put them together:

Another way:

Markov Decision Process in Detail的更多相关文章
- Step-by-step from Markov Process to Markov Decision Process
In this post, I will illustrate Markov Property, Markov Reward Process and finally Markov Decision P ...
- Ⅱ Finite Markov Decision Processes
Dictum: Is the true wisdom fortitude ambition. -- Napoleon 马尔可夫决策过程(Markov Decision Processes, MDPs ...
- Markov Decision Processes
为了实现某篇论文中的算法,得先学习下马尔可夫决策过程~ 1. https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/conte ...
- Reinforcement Learning Index Page
Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...
- 论文笔记之:Learning to Track: Online Multi-Object Tracking by Decision Making
Learning to Track: Online Multi-Object Tracking by Decision Making ICCV 2015 本文主要是研究多目标跟踪,而 online ...
- 强化学习二:Markov Processes
一.前言 在第一章强化学习简介中,我们提到强化学习过程可以看做一系列的state.reward.action的组合.本章我们将要介绍马尔科夫决策过程(Markov Decision Processes ...
- (转) Deep Reinforcement Learning: Pong from Pixels
Andrej Karpathy blog About Hacker's guide to Neural Networks Deep Reinforcement Learning: Pong from ...
- 机器学习算法基础(Python和R语言实现)
https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/?spm=5176.100239.blo ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
随机推荐
- dfs找环
http://acm.hdu.edu.cn/showproblem.php?pid=6736 Forest Program Time Limit: 2000/1000 MS (Java/Others) ...
- Spring之使用注解实例化Bean并注入属性
1.准备工作 (1)导入jar包 除了上篇文章使用到的基本jar包外,还得加入aop的jar包,所有jar包如下 所需jar包 (2)配置xml <?xml version="1.0& ...
- 0基础入门 docker 部署 各种 Prometheus 案例 - 程序员学点xx 总集篇
目录 大家好, 学点xx 系列也推出一段时间了.虽然 yann 能力有限,但还是收到了很多鼓励与赞赏.对这个系列 yann 还是很喜欢的,特别是 Prometheus 篇,在期间经历公众号 100 篇 ...
- 修改url,
第一种场景: 无论url怎么变,表单里面的url始终不变 http://127.0.0.1:8000/CC/indexssssssssssssssssss/ url(r'^indexsssssssss ...
- sql基本语法大全
一.定义变量--简单赋值 declare @a intset @a=5 print @a --使用select语句赋值 declare @user1 nvarchar(50) select @user ...
- vue.js(15)--vue的生命周期
生命周期钩子 生命周期钩子=生命周期函数=生命周期事件 每个 Vue 实例在被创建时都要经过一系列的初始化过程——例如,需要设置数据监听.编译模板.将实例挂载到 DOM 并在数据变化时更新 DOM 等 ...
- openstack stein部署手册 8. neutron-api
# 建立数据库用户及权限 create database neutron; grant all privileges on neutron.* to neutron@'localhost' ident ...
- ps:点阵格式图像
我们所看到的图像,究竟是如何构成的呢?这就需要涉及到图像类型的概念. 电脑中的图像类型分为两大类,一类称为点阵图,一类称为矢量图. 点阵图顾名思义就是由点构成的,如同用马赛克去拼贴图案一样,每个马赛克 ...
- $mona$要成为高端玩家
\(mona\)要成为高端玩家! 好在撑过了联赛,接下来要向高端玩家冲击啦! 新时期当然要有新的学习规划啦! 最近的更新(有什么就在这里说啦) 随便更更. \(FFT\)刷着打算先看看生成函数. 感觉 ...
- jmeter性能工具 之 传参 (三)
jmeter 主要有三种方式:键值对传参,json格式传参,外部传参 1.键值对传参 可以参考上篇登陆,使用的传参方式是键值对传参 2.json 格式传参 用json 格式传参不要忘了加http 头 ...