From the last post about MDP, we know the environment consists of 5 basic elements:

S:State Space of environment;

A:Actions Space that the environment allows;

{Ps,s'}:Transition Matrix, the probabilities of how environment state transit from one to another when actions are taken. The number of matrices equals to the number of actions.

R: Reward, when the system transitions from state s to s' due to action a, how much reward can an agent receive from the environment. Sometimes, reward have different definition.

γ: How reward discounts by time.

How Different between MDP and MRP:

Keyword: Action

The five elements of MDP can be illustrated by the chart below, in which the green circles are states, orange circles are actions, and there are two rewards. In MRP and Markov Process, we directly know the transition matrix. However, in the transition path from one state to another is interupted by actions. And it's worth noting that when the environment is at a certain state, there is no probabilities for actions. The reason is quite understandable: we live in the some world(environment), but different people have different behaviors.

Agent and Policy

Agent is the person or robot who interacts with the environment in Reinforcement Learning. Like human being, everyone may have different behavior under the same condition. The probability distribution of behaviors under different states is Policy. There are so many probabilities in an environment, but for a specific agent (person), he or she may take only one or several possible actions under a certain state. Given states, the policy is defined by:

An example of policy is shown below:

From MRP to MDP: MRP+Policy

Transition Matrix: Without policies we do not exactly know the the probability from state s transitioning to s', because different agents may have different probabilitie to take actions. As long as we get π, we can calculate the state transition matrix.

In the chart above, for example, if an agent has probabilitie 0.4 and 0.6 for action a0 and a1, the transition probability from s0 to s1 is: 0.4*0.5+0.6*1=0.8

Reward:

In MDP the reward function is related to actions, which average the uncertainties of the result from an action.

Once we've got the Policy π, we know the action distribution of a specific agent, so we can average the uncertaintie of actions, then measure how much immediate reward can receive from state s under policy π.

So now we go back from MDP to MRP, and the Markov Reward Process is defined by the tuple

 

Two Value Functions:

State Value Function:

State Value Function is the same as the value function in MRP. It is used to evaluate the goodness of being in a state s(by immedate and future reward), and the only difference is to average the uncertainty of actions under policy π. It is in the form of:

Action Value Function:

To average uncertainties of actions, it's neccesary to know the expected reward from possible actions. So we have Action Value Functionin MDP, which reveals whether an action is good or bad when an agent takes an particular action in state s.

If we calculate expectation of Action Value Functions under the same state s, we will end up with the State Value Function v:

Similarly, when an action is taken, the system may end up with different states. When we remove the uncertainty of state transition, we go back from State Value Function to Action Value Function:

If we put them together:

Another way:

Markov Decision Process in Detail的更多相关文章

  1. Step-by-step from Markov Process to Markov Decision Process

    In this post, I will illustrate Markov Property, Markov Reward Process and finally Markov Decision P ...

  2. Ⅱ Finite Markov Decision Processes

    Dictum:  Is the true wisdom fortitude ambition. -- Napoleon 马尔可夫决策过程(Markov Decision Processes, MDPs ...

  3. Markov Decision Processes

    为了实现某篇论文中的算法,得先学习下马尔可夫决策过程~ 1. https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/conte ...

  4. Reinforcement Learning Index Page

    Reinforcement Learning Posts Step-by-step from Markov Property to Markov Decision Process Markov Dec ...

  5. 论文笔记之:Learning to Track: Online Multi-Object Tracking by Decision Making

    Learning to Track: Online Multi-Object Tracking by Decision Making ICCV   2015 本文主要是研究多目标跟踪,而 online ...

  6. 强化学习二:Markov Processes

    一.前言 在第一章强化学习简介中,我们提到强化学习过程可以看做一系列的state.reward.action的组合.本章我们将要介绍马尔科夫决策过程(Markov Decision Processes ...

  7. (转) Deep Reinforcement Learning: Pong from Pixels

    Andrej Karpathy blog About Hacker's guide to Neural Networks Deep Reinforcement Learning: Pong from ...

  8. 机器学习算法基础(Python和R语言实现)

    https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/?spm=5176.100239.blo ...

  9. How do I learn machine learning?

    https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644   How Can I Learn X? ...

随机推荐

  1. spring aop实现数据库的读写分离

    为了减轻数据库的压力,一般会使用数据库主从(master/slave)的方式,但是这种方式会给应用程序带来一定的麻烦,比如说,应用程序如何做到把数据写到master库,而读取数据的时候,从slave库 ...

  2. %.*f (特殊的输出符)

    c语言中每一种数据类型都有自己的专属占位符,如整型的%d,浮点型的%f等,而*也是一个占位符,比较特殊而已. 比如输入一个n,输出0.5的n次方,就可以这么写 #include<bits/std ...

  3. Ftp客户端(上传文件)

    #coding=utf-8 import os import socket import hashlib import json # client = socket.socket() #申明socke ...

  4. AppDomain (转)

    AppDomain是CLR的运行单元,它可以加载Assembly.创建对象以及执行程序.AppDomain是CLR实现代码隔离的基本机制. 每一个AppDomain可以单独运行.停止:每个AppDom ...

  5. spark复习笔记(7):sparkSQL

    一.saprkSQL模块,使用类sql的方式访问Hadoop,实现mr计算,底层使用的是rdd 1.hive //hadoop  mr  sql 2.phenoix //hbase上构建sql的交互过 ...

  6. 【rabbitmq】解决SimpleAmqpClient创建连接时阻塞的问题

    https://blog.csdn.net/panxianzhan/article/details/50755409 https://blog.csdn.net/csm201314/article/d ...

  7. robotframework关键字

    *** Settings ***Library Selenium2Library *** Keywords ***Checkbox应该不被选择 [Arguments] ${locator} Check ...

  8. LINUX的一些基本概念和操作

    LINUX和shell的关系: linux是核,是操作系统,用于分配软硬件资源,用于支持运行环境,shell是壳,是命令解析器. linux命令: linux命令行有一个输入输出的行为,输入命令,输出 ...

  9. webdriver显式和隐式等待、强制等待

    implicitly_wait() 方法是隐式等待,用来设置超时,一般把implicitly_wait()方法调用在加载测试地址后,等待所测试的应用程序加载WebDriverWait() 是显式等待, ...

  10. Flask学习笔记03之路由

    1. endpoint from flask import Flask, url_for # 实例化一个Flask对象 app = Flask(__name__) # 打印默认配置信息 # 引入开发环 ...