In this post, I will illustrate Markov Property, Markov Reward Process and finally Markov Decision Process, which are fundamental concepts in Reinforcement Learning.

Markov Property

'The state is independent of the past given the present'

Markov Process (Markov Chain)

Keywords: state, transition matrix

A Markov Process is defined by a Tuple(S,P), in which S is the state space, and P is the transition matrix. The following chart is an example.

A transition matrix demonstrates the probabilities of transitioning from one state to another.

In the example above, the transition matrix is:

Markov Reward Process: Markov Process with Value Judgement

Keywords: Reward, Return, Discount Factor, Value Function

MRP add two additional properties into Markov Chain: one is Reward, who represents the immediate feedback an agent can receive at time t+1 if he is in state s at time t; another property is Discount Factor γ∈[0,1]. So the representation tuple is [S,P,R,γ].

Formally, Reward is the immediate feedback, which means when agent gets to state s at time t, it can definetly receive this reward at time t+1. It is defined by:

Given reward and discount factor, we can calculate the Return for a given senario by this equation:

Example for Return calculation:

Senario: Class1->Class2->Class3->Pass->Sleep, and the agent is at state=Class1.

Case 1: when gamma=0, g=-2+(-2)*0+(-2)*0+10*0=-2

Case 2: when gamma=1, g=-2+(-2)*1+(-2)*1+10*1=4

Case 3: when gamma=0.8, g=-2+(-2)*0.8+(-2)*0.64+10*=-4-1.6+5.12=-0.48

From different γ, we know our agent can be exetremely short-sighted (far-sighted) only for immediate reward, or trying to seek balance between short and long term reward.

When an agent is in a certain state, the way to measure the total reward from this state over time is calculating expected Returns for all possible senarios. The function to calculate it is called Value Function:

Ex. If the agent is at Class3 state, it has 0.6 and 0.4 probabilities to transite to Pass and Pub respectively. Because there are loops inside the graph, it's difficult to directly derive expected return from value function. (Forget the red labeled value, they are result...)

Bellman Equation helps to solve this complexity:

It breaks the value function into two parts: Immediate Reward and Future Reward:The future reward is discounted by γ, and it has probabilities on different states, so actually the future reward is an expectation.

Now we can use Bellman Equation to solve value function:

Markov Decision Process: MRP with Actions

Keywords: Action

Markov Decision Process adds more complexity onto MRP, it is defined by a tuple(S,A,P,R,γ), in which:

S is state space, and γ is discount factor, they are same as MRP.

A is a finite set of Actions, which is new. Then because of the existense of Action, Transition Matrix and Reward Function are all conditional on both State and Action.

P is State Transition Matrix: it is conditional on state and action at time t, which means different actions would result in different distribution of state at time t+1.

R is Reward Function conditional upon state and action: also, different actions lead to different reward, despite of the same state s.

A graph(from Wikipedia) helps understanding the role of actions:

So by now, we have already had the model of the environment: all states, all possible actions and transition matrix conditional on state and actions.

Step-by-step from Markov Process to Markov Decision Process的更多相关文章

  1. Step by step Process of creating APD

    Step by step Process of creating APD: Business Scenario: Here we are going to create an APD on top o ...

  2. Step by Step Process of Migrating non-CDBs and PDBs Using ASM for File Storage (Doc ID 1576755.1)

    Step by Step Process of Migrating non-CDBs and PDBs Using ASM for File Storage (Doc ID 1576755.1) AP ...

  3. Tomcat Clustering - A Step By Step Guide --转载

    Tomcat Clustering - A Step By Step Guide Apache Tomcat is a great performer on its own, but if you'r ...

  4. [ZZ] Understanding 3D rendering step by step with 3DMark11 - BeHardware >> Graphics cards

    http://www.behardware.com/art/lire/845/ --> Understanding 3D rendering step by step with 3DMark11 ...

  5. e2e 自动化集成测试 架构 实例 WebStorm Node.js Mocha WebDriverIO Selenium Step by step (二) 图片验证码的识别

    上一篇文章讲了“e2e 自动化集成测试 架构 京东 商品搜索 实例 WebStorm Node.js Mocha WebDriverIO Selenium Step by step 一 京东 商品搜索 ...

  6. Code Understanding Step by Step - We Need a Task

      Code understanding is a task we are always doing, though we are not even aware that we're doing it ...

  7. enode框架step by step之saga的思想与实现

    enode框架step by step之saga的思想与实现 enode框架系列step by step文章系列索引: 分享一个基于DDD以及事件驱动架构(EDA)的应用开发框架enode enode ...

  8. 课程五(Sequence Models),第一 周(Recurrent Neural Networks) —— 1.Programming assignments:Building a recurrent neural network - step by step

    Building your Recurrent Neural Network - Step by Step Welcome to Course 5's first assignment! In thi ...

  9. 精通initramfs构建step by step

    (一)hello world  一.initramfs是什么  在2.6版本的linux内核中,都包含一个压缩过的cpio格式 的打包文件.当内核启动时,会从这个打包文件中导出文件到内核的rootfs ...

随机推荐

  1. BZOJ 4552(二分+线段树+思维)

    题面 传送门 分析 此题是道好题! 首先要跳出思维定势,不是去想如何用数据结构去直接维护排序过程,而是尝试二分a[p]的值 设二分a[p]的值为x 我们将大于x的数标记为1,小于等于x的数标记为0 则 ...

  2. Vue.js状态管理模式 Vuex

    vuex 是一个专为 Vue.js 应用程序开发的状态管理模式.它采用集中式存储管理应用的所有组件的状态,并以相应的规则保证状态以一种可预测的方式发生变化. 安装.使用 vuex 首先我们在 vue. ...

  3. Java JNA (五)—— 释放Memory对象分配的内存

    Java进程的内存包括Java NonHeap空间.Java Heap空间和Native Heap空间. JNA中的Memory对象是从Native Heap中分配空间.但java的GC是针对Java ...

  4. SQL Server 查找字符串中指定字符出现的次数

    要查找某个指定的字符在字符串中出现的位置,方法比较简单,使用 len() 函数和 replace() 函数结合就可以. SELECT TOP 200 approveInfo approveInfo2, ...

  5. process-hacker

    https://github.com/processhacker/processhacker#process-hacker // begin_phapppub typedef enum _PH_KNO ...

  6. vue2.0 通信

    一.父子组件通信 父组件通过 props 向下传递数据给子组件,子组件通过 events 给父组件发送消息 具体机制如下图: 1.父组件传递数据给子组件 (  parent  ==> child ...

  7. [python 学习]正则表达式

    re 模块函数re 模块函数和正则表达式对象的方法match(pattern,string,flags=0) 尝试使用带有可选的标记的正则表达式的模式来匹配字符串.如果匹配成功,就返回匹配对象:如果失 ...

  8. R语言 Keras Training Flags

    在需要经常进行调参的情况下,可以使用 Training Flags 来快速变换参数,比起直接修改模型参数来得快而且不易出错. https://tensorflow.rstudio.com/tools/ ...

  9. python_012

    一.内置函数 1.sorted()排序函数 a:语法sorted(Iterable,key = None,reverse = False) Iterable:可迭代对象;key:排序规则(函数) ls ...

  10. char* 和 cha[]

    char* s1 = "hello";//字符串常量 s是一个保存了字符串首地址的指针变量,同时也是字符串的名字,s的内容是第一个字符的地址,当s指向常量字符串时候,内容不能改变( ...