In this post, I will illustrate Markov Property, Markov Reward Process and finally Markov Decision Process, which are fundamental concepts in Reinforcement Learning.

Markov Property

'The state is independent of the past given the present'

Markov Process (Markov Chain)

Keywords: state, transition matrix

A Markov Process is defined by a Tuple(S,P), in which S is the state space, and P is the transition matrix. The following chart is an example.

A transition matrix demonstrates the probabilities of transitioning from one state to another.

In the example above, the transition matrix is:

Markov Reward Process: Markov Process with Value Judgement

Keywords: Reward, Return, Discount Factor, Value Function

MRP add two additional properties into Markov Chain: one is Reward, who represents the immediate feedback an agent can receive at time t+1 if he is in state s at time t; another property is Discount Factor γ∈[0,1]. So the representation tuple is [S,P,R,γ].

Formally, Reward is the immediate feedback, which means when agent gets to state s at time t, it can definetly receive this reward at time t+1. It is defined by:

Given reward and discount factor, we can calculate the Return for a given senario by this equation:

Example for Return calculation:

Senario: Class1->Class2->Class3->Pass->Sleep, and the agent is at state=Class1.

Case 1: when gamma=0, g=-2+(-2)*0+(-2)*0+10*0=-2

Case 2: when gamma=1, g=-2+(-2)*1+(-2)*1+10*1=4

Case 3: when gamma=0.8, g=-2+(-2)*0.8+(-2)*0.64+10*=-4-1.6+5.12=-0.48

From different γ, we know our agent can be exetremely short-sighted (far-sighted) only for immediate reward, or trying to seek balance between short and long term reward.

When an agent is in a certain state, the way to measure the total reward from this state over time is calculating expected Returns for all possible senarios. The function to calculate it is called Value Function:

Ex. If the agent is at Class3 state, it has 0.6 and 0.4 probabilities to transite to Pass and Pub respectively. Because there are loops inside the graph, it's difficult to directly derive expected return from value function. (Forget the red labeled value, they are result...)

Bellman Equation helps to solve this complexity:

It breaks the value function into two parts: Immediate Reward and Future Reward:The future reward is discounted by γ, and it has probabilities on different states, so actually the future reward is an expectation.

Now we can use Bellman Equation to solve value function:

Markov Decision Process: MRP with Actions

Keywords: Action

Markov Decision Process adds more complexity onto MRP, it is defined by a tuple(S,A,P,R,γ), in which:

S is state space, and γ is discount factor, they are same as MRP.

A is a finite set of Actions, which is new. Then because of the existense of Action, Transition Matrix and Reward Function are all conditional on both State and Action.

P is State Transition Matrix: it is conditional on state and action at time t, which means different actions would result in different distribution of state at time t+1.

R is Reward Function conditional upon state and action: also, different actions lead to different reward, despite of the same state s.

A graph(from Wikipedia) helps understanding the role of actions:

So by now, we have already had the model of the environment: all states, all possible actions and transition matrix conditional on state and actions.

Step-by-step from Markov Process to Markov Decision Process的更多相关文章

  1. Step by step Process of creating APD

    Step by step Process of creating APD: Business Scenario: Here we are going to create an APD on top o ...

  2. Step by Step Process of Migrating non-CDBs and PDBs Using ASM for File Storage (Doc ID 1576755.1)

    Step by Step Process of Migrating non-CDBs and PDBs Using ASM for File Storage (Doc ID 1576755.1) AP ...

  3. Tomcat Clustering - A Step By Step Guide --转载

    Tomcat Clustering - A Step By Step Guide Apache Tomcat is a great performer on its own, but if you'r ...

  4. [ZZ] Understanding 3D rendering step by step with 3DMark11 - BeHardware >> Graphics cards

    http://www.behardware.com/art/lire/845/ --> Understanding 3D rendering step by step with 3DMark11 ...

  5. e2e 自动化集成测试 架构 实例 WebStorm Node.js Mocha WebDriverIO Selenium Step by step (二) 图片验证码的识别

    上一篇文章讲了“e2e 自动化集成测试 架构 京东 商品搜索 实例 WebStorm Node.js Mocha WebDriverIO Selenium Step by step 一 京东 商品搜索 ...

  6. Code Understanding Step by Step - We Need a Task

      Code understanding is a task we are always doing, though we are not even aware that we're doing it ...

  7. enode框架step by step之saga的思想与实现

    enode框架step by step之saga的思想与实现 enode框架系列step by step文章系列索引: 分享一个基于DDD以及事件驱动架构(EDA)的应用开发框架enode enode ...

  8. 课程五(Sequence Models),第一 周(Recurrent Neural Networks) —— 1.Programming assignments:Building a recurrent neural network - step by step

    Building your Recurrent Neural Network - Step by Step Welcome to Course 5's first assignment! In thi ...

  9. 精通initramfs构建step by step

    (一)hello world  一.initramfs是什么  在2.6版本的linux内核中,都包含一个压缩过的cpio格式 的打包文件.当内核启动时,会从这个打包文件中导出文件到内核的rootfs ...

随机推荐

  1. 20191202IIS

    IIS和.netfw4.0安装顺序是从前到后,如果不小心颠倒了,无所谓. 打开程序-运行-cmd:输入一下命令重新注册IIS C:\WINDOWS\Microsoft.NET\Framework\v4 ...

  2. 如何设置移动端的tab栏

    这是添加tab栏的代码: {                     "id": "tabBar1",                     "st ...

  3. 下载-MS SQL Server 2005(大全版)含开发人员版、企业版、标准版【转】

    中文名称:微软SQL Server 2005 英文名称:MS SQL Server 2005资源类型:ISO版本:开发人员版.企业版.标准版发行时间:2006年制作发行:微软公司地区:大陆语言:普通话 ...

  4. Java使用多线程发送消息

    在后台管理用户信息的时候,经常会用到批量发送提醒消息,首先想到的有: (1).循环发送列表,逐条发送.优点是:简单,如果发送列表很少,而且没有什么耗时的操作,是比较好的一种选择,缺点是:针对大批量的发 ...

  5. 【CF】38E Let's Go Rolling! (dp)

    前言 这题还是有点意思的. 题意: 给你 \(n\) (\(n<=3000\)) 个弹珠,它们位于数轴上.给你弹珠的坐标 \(x_i\) 在弹珠 \(i\) 上面花费 \(C_i\) 的钱 可以 ...

  6. html 头部设置

    https://juejin.im/post/5a4ae29b6fb9a04504083cac <head> <meta charset="UTF-8"> ...

  7. vertica copy

    copy huimei.ken_copy  from '/home/dbadmin/file.txt' delimiter ';'

  8. SpringBoot自定义Jackson配置

    为了在SpringBoot工程中集中解决long类型转成json时JS丢失精度问题和统一设置常见日期类型序列化格式,我们可以自定义Jackson配置类,具体如下: import com.fasterx ...

  9. Selenium-Switch与SelectApi介绍

    Switch 我们在UI自动化测试时,总会出现新建一个tab页面,弹出一个浏览器级别的弹框或者是出现一个iframe标签,这时我们用WebDriver提供的Api接口就无法处理这些情况了.需要用到Se ...

  10. 关于Reporting Services网站

    1.http://www.c-sharpcorner.com/search/sql%20server%20reporting%20services 2.https://msdn.microsoft.c ...