• A finite set of states St summarizing the information the agent senses from the environment at every time step t ∈ {1, ..., T}.

• A set of actions At which the agent can perform at each time step t ∈ {1, ..., T} to interact with the environment.

• A set of transition probabilities between subsequent states which render the environment stochastic. Note: the probabilities are usually not explicitly modeled but the result of the stochastic nature of the financial asset’s price process.

• A reward (or return) function Rt which provides a numerical feedback value rt to the agent in response to its action At−1 = at−1 in state St−1 = st−1.

• A policy π which maps states to concrete actions to be carried out by the agent. The policy can hence be understood as the agent’s rules for how to choose actions.

• A value function V which maps states to the total (discounted) reward the agent can expect from a given state until the end of the episode (trading period) under policy π.

Given the above framework, the decision problem is formalized as finding the optimal policy π = π ∗ , i.e., the mapping from states to actions, corresponding to the optimal value function V ∗ - see also Dempster et al. (2001); Dempster and Romahi (2002):

  V (st) = max at E[Rt+1 + γV (St+1)|St = st ].(1)

Hereby, E denotes the expectation operator, γ the discount factor, and Rt+1 the expected immediate reward for carrying out action At = at in state St = st . Further, St+1 denotes the next state of the agent. The value function can hence be understood as a mapping from states to discounted future rewards which the agent seeks to maximize with its actions.

To solve this optimization problem, the Q-Learning algorithm (Watkins, 1989) can be applied, extending the above equation to the level of state-action tuples:

  Q (st , at) = E[Rt+1 + γ max at+1 Q (St+1, at+1)|St = st , At = at ].(2)

Hereby, the Q-value Q∗ (st , at) equals to the immediate reward for carrying out action At = at in state St = st plus the discounted future reward from carrying on in the best way possible.

The optimal policy π (the mapping from states to actions) then simply becomes:

  π (st) = max at Q (st , at) .(3)

i.e., in every state St = st , choose the action At = at that yields the highest Q-value. To approximate the Q-function during (online) learning, an iterative optimization is carried out with α denoting the learning rate - see also Sutton and Barto (1998) for further details:

  Q (st , at) ← (1 − α) Q (st , at) + α (rt+1 + γ max at+1 Q (st+1, at+1) ) . (4)

a survey for RL的更多相关文章

  1. (转)Applications of Reinforcement Learning in Real World

    Applications of Reinforcement Learning in Real World 2018-08-05 18:58:04 This blog is copied from: h ...

  2. 论文笔记系列-Neural Network Search :A Survey

    论文笔记系列-Neural Network Search :A Survey 论文 笔记 NAS automl survey review reinforcement learning Bayesia ...

  3. (zhuan) 一些RL的文献(及笔记)

    一些RL的文献(及笔记) copy from: https://zhuanlan.zhihu.com/p/25770890  Introductions Introduction to reinfor ...

  4. A Survey of Visual Attention Mechanisms in Deep Learning

    A Survey of Visual Attention Mechanisms in Deep Learning 2019-12-11 15:51:59 Source: Deep Learning o ...

  5. Generalizing from a Few Examples: A Survey on Few-Shot Learning 小样本学习最新综述 | 三大数据增强方法

    目录 原文链接:小样本学习与智能前沿 01 Transforming Samples from Dtrain 02 Transforming Samples from a Weakly Labeled ...

  6. 知识图谱顶刊综述 - (2021年4月) A Survey on Knowledge Graphs: Representation, Acquisition, and Applications

    知识图谱综述(2021.4) 论文地址:A Survey on Knowledge Graphs: Representation, Acquisition, and Applications 目录 知 ...

  7. SharePoint 2010 Survey的Export to Spreadsheet功能怎么不见了?

    背景信息: 最近用户报了一个问题,说他创建的Survey里将结果导出成Excel文件(Export to spreadsheet)的按钮不见了. 原因排查: 正常情况下,这个功能只存在于SharePo ...

  8. 中间值为什么为l+(r-l)/2,而不是(l+r)/2

    二分法的算法中,我们看到一些代码里取中间值: MID=l+(r-l)/2; 为什么是这个呢?不就是(l+r)/2吗?为什么要多此一举呢? 其实还是有不一样的,看看他们的区别吧: l,r是指针的时候只能 ...

  9. SharePoint Tricks - Survey

    1. SharePoint 2010中,在Survey的问题框中输入HTML代码可以用于插入图片或者链接,具体方法为: 1.1 在问题框中输入html, 1.2 在New Form和Edit Form ...

随机推荐

  1. html页面渲染过程

    1.解析html文件,创建DOM树 自上而下解析,遇到任何样式(link.style)和脚本(script)都会阻塞 1)css加载不会阻塞html文件的解析,但会阻塞dom的渲染 2)css加载会阻 ...

  2. codeforces704D Captain America【上下界最大流】

    分别给行和列hash建两排点,对(x,y)坐标连x行y列的点 设红色价格低,那么就要尽量多选红色 设一个点出度为s,要求最小的最大差值为d,又,假设有流量表示选红没流量表示选蓝,那么要求就变成了这个点 ...

  3. Mysql深入理解(1)

    一.关系型数据主要: 1.架构,2.索引,3.锁,4.语法,5.理论范式 二.设计一个关系型数据库有哪些模块: 存储管理,缓存机制,Sql解析,日志管理,权限划分,容灾机制,索引管理,锁管理管理 1. ...

  4. LDAP第三天 MySQL+LDAP 安装

    https://www.easysoft.com/applications/openldap/back-sql-odbc.html      OpenLDAP 使用 SQLServer 和 Oracl ...

  5. JS高级学习历程-4

    4 执行环境可以访问什么变量 具体可以访问变量类型:局部变量.参数.函数.外部环境变量 优先级:局部变量 > 函数 > 参数 > 外部环境变量 <!DOCTYPE html&g ...

  6. 17.视图--SQL

    一.什么是视图 视图是虚拟的表 为什么使用视图 重用SQL语句. 简化复杂的SQL操作.在编写查询后,可以方便地重用它而不必知道其基本查询细节. 使用表的一部分而不是整个表. 保护数据.可以授予用户访 ...

  7. Python命名空间和作用域

    准备知识: 1.在Python解释器开始执行之后,机会在内存中开辟一个空间,每当遇到 一个变量的时候,就把变量和值之间的关系记录下来,但是当遇到函数定义 的时候,解释器只是把函数名读入内存,表示这个函 ...

  8. 牛客网Java刷题知识点之关键字static、static成员变量、static成员方法、static代码块和static内部类

    不多说,直接上干货! 牛客网Java刷题知识点之关键字static static代表着什么 在Java中并不存在全局变量的概念,但是我们可以通过static来实现一个“伪全局”的概念,在Java中st ...

  9. Nuxt使用记录

    代码及插件需要根据环境引入 (服务端没有window,document,浏览器端没有global) const myPlugins = { install(Vue, options) { Vue.pr ...

  10. .NET资源站点汇总~

    名称:快速入门地址:http://chs.gotdotnet.com/quickstart/描述:本站点是微软.NET技术的快速入门网站,我们不必再安装.NET Framework中的快速入门示例程序 ...