• A finite set of states St summarizing the information the agent senses from the environment at every time step t ∈ {1, ..., T}.

• A set of actions At which the agent can perform at each time step t ∈ {1, ..., T} to interact with the environment.

• A set of transition probabilities between subsequent states which render the environment stochastic. Note: the probabilities are usually not explicitly modeled but the result of the stochastic nature of the financial asset’s price process.

• A reward (or return) function Rt which provides a numerical feedback value rt to the agent in response to its action At−1 = at−1 in state St−1 = st−1.

• A policy π which maps states to concrete actions to be carried out by the agent. The policy can hence be understood as the agent’s rules for how to choose actions.

• A value function V which maps states to the total (discounted) reward the agent can expect from a given state until the end of the episode (trading period) under policy π.

Given the above framework, the decision problem is formalized as finding the optimal policy π = π ∗ , i.e., the mapping from states to actions, corresponding to the optimal value function V ∗ - see also Dempster et al. (2001); Dempster and Romahi (2002):

  V (st) = max at E[Rt+1 + γV (St+1)|St = st ].(1)

Hereby, E denotes the expectation operator, γ the discount factor, and Rt+1 the expected immediate reward for carrying out action At = at in state St = st . Further, St+1 denotes the next state of the agent. The value function can hence be understood as a mapping from states to discounted future rewards which the agent seeks to maximize with its actions.

To solve this optimization problem, the Q-Learning algorithm (Watkins, 1989) can be applied, extending the above equation to the level of state-action tuples:

  Q (st , at) = E[Rt+1 + γ max at+1 Q (St+1, at+1)|St = st , At = at ].(2)

Hereby, the Q-value Q∗ (st , at) equals to the immediate reward for carrying out action At = at in state St = st plus the discounted future reward from carrying on in the best way possible.

The optimal policy π (the mapping from states to actions) then simply becomes:

  π (st) = max at Q (st , at) .(3)

i.e., in every state St = st , choose the action At = at that yields the highest Q-value. To approximate the Q-function during (online) learning, an iterative optimization is carried out with α denoting the learning rate - see also Sutton and Barto (1998) for further details:

  Q (st , at) ← (1 − α) Q (st , at) + α (rt+1 + γ max at+1 Q (st+1, at+1) ) . (4)

a survey for RL的更多相关文章

  1. (转)Applications of Reinforcement Learning in Real World

    Applications of Reinforcement Learning in Real World 2018-08-05 18:58:04 This blog is copied from: h ...

  2. 论文笔记系列-Neural Network Search :A Survey

    论文笔记系列-Neural Network Search :A Survey 论文 笔记 NAS automl survey review reinforcement learning Bayesia ...

  3. (zhuan) 一些RL的文献(及笔记)

    一些RL的文献(及笔记) copy from: https://zhuanlan.zhihu.com/p/25770890  Introductions Introduction to reinfor ...

  4. A Survey of Visual Attention Mechanisms in Deep Learning

    A Survey of Visual Attention Mechanisms in Deep Learning 2019-12-11 15:51:59 Source: Deep Learning o ...

  5. Generalizing from a Few Examples: A Survey on Few-Shot Learning 小样本学习最新综述 | 三大数据增强方法

    目录 原文链接:小样本学习与智能前沿 01 Transforming Samples from Dtrain 02 Transforming Samples from a Weakly Labeled ...

  6. 知识图谱顶刊综述 - (2021年4月) A Survey on Knowledge Graphs: Representation, Acquisition, and Applications

    知识图谱综述(2021.4) 论文地址:A Survey on Knowledge Graphs: Representation, Acquisition, and Applications 目录 知 ...

  7. SharePoint 2010 Survey的Export to Spreadsheet功能怎么不见了?

    背景信息: 最近用户报了一个问题,说他创建的Survey里将结果导出成Excel文件(Export to spreadsheet)的按钮不见了. 原因排查: 正常情况下,这个功能只存在于SharePo ...

  8. 中间值为什么为l+(r-l)/2,而不是(l+r)/2

    二分法的算法中,我们看到一些代码里取中间值: MID=l+(r-l)/2; 为什么是这个呢?不就是(l+r)/2吗?为什么要多此一举呢? 其实还是有不一样的,看看他们的区别吧: l,r是指针的时候只能 ...

  9. SharePoint Tricks - Survey

    1. SharePoint 2010中,在Survey的问题框中输入HTML代码可以用于插入图片或者链接,具体方法为: 1.1 在问题框中输入html, 1.2 在New Form和Edit Form ...

随机推荐

  1. RVO算法

    http://blog.sina.com.cn/s/blog_6ad33d350102xqal.html 简介 在介绍VO,RVO之前,需要先介绍路径规划. 对Agent进行路径规划,实际上要完成的任 ...

  2. uoj#340. 【清华集训2017】小 Y 和恐怖的奴隶主(矩阵加速)

    传送门 uoj上的数据太毒了--也可能是我人傻常数大的缘故-- 三种血量的奴隶主加起来不超过\(8\)个,可以枚举每种血量的奴隶主个数,那么总的状态数只有\(165\)种,设\(dp_{t,i,j,k ...

  3. IT兄弟连 JavaWeb教程 AJAX以及JSON字符串经典案例

    案例需求:客户端发送AJAX请求服务器端获取用户信息的数据. 案例实现: 在服务器端要将Java对象转换成JSON字符串,如果使用拼接JSON字符串的方式非常繁琐,并且非常容易出错,所以一般会借助第三 ...

  4. PJzhang:子域名发掘工具Sublist3r

    猫宁!!! 参考链接:https://www.freebuf.com/sectool/90584.html 作者上一次更新是2018年10月16日了,sublist3r中融合有另外一个子域名爆破工具S ...

  5. QDU第一届程序设计大赛——E到I题解法(非官方题解)

    题目链接https://qduoj.com/contest/28/problems,密码:qdu1230 E题: 思路:先进行排序,然后去暴力模拟就可以,但可能WA了几次,导致此题没解出来,有点可惜 ...

  6. 规则引擎.Net Core

    .Net Core 环境下构建强大且易用的规则引擎 https://www.cnblogs.com/chenug/p/9160397.html 本文源码: https://github.com/jon ...

  7. 线程池ThreadPoolExecutor的学习

    我们知道,ExecutorService是一个抽象出线程池的一个接口,然后我们在使用线程池的时候,用的是Executors工具类中的一系列newCachedThreadPool() 等类似的方法,这些 ...

  8. Ubuntu 18.04 Python3.6.6导入wx模块报Gtk-Message : 17:06:05.797 :Failed to load module "canberra-gtk-module"

    解决办法: root@sishen:~# apt-get install libcanberra-gtk-module

  9. Git 忽略規則及 .gitignore 規則不生效的辦法

    Git忽略规则: 在git中如果想忽略掉某个文件,不让这个文件提交到版本库中,可以使用修改根目录中 .gitignore 文件的方法(如果没有这个文件,则需自己手工建立此文件).这个文件每一行保存了一 ...

  10. 使用jdbc完成curd操作

    jdbc: java操作数据库,jdbc是oracle公司指定的一套规范(一套接口) 驱动:jdbc的实现类,由数据库厂商提供 我们可以通过一套规范操作不同的数据库(多态) jdbc作用: 连接数据库 ...