a survey for RL

• A finite set of states S_t summarizing the information the agent senses from the environment at every time step t ∈ {1, ..., T}.

• A set of actions A_t which the agent can perform at each time step t ∈ {1, ..., T} to interact with the environment.

• A set of transition probabilities between subsequent states which render the environment stochastic. Note: the probabilities are usually not explicitly modeled but the result of the stochastic nature of the financial asset’s price process.

• A reward (or return) function Rt which provides a numerical feedback value rt to the agent in response to its action A_t−1 = at−1 in state S_t−1 = s_t−1.

• A policy π which maps states to concrete actions to be carried out by the agent. The policy can hence be understood as the agent’s rules for how to choose actions.

• A value function V which maps states to the total (discounted) reward the agent can expect from a given state until the end of the episode (trading period) under policy π.

Given the above framework, the decision problem is formalized as finding the optimal policy π = π ∗ , i.e., the mapping from states to actions, corresponding to the optimal value function V ∗ - see also Dempster et al. (2001); Dempster and Romahi (2002):

　　V ^∗ (s_t) = max at E[R_t+1 + γV ^∗ (S_t+1)|S_t = s_t ].(1)

Hereby, E denotes the expectation operator, γ the discount factor, and Rt+1 the expected immediate reward for carrying out action At = at in state St = st . Further, St+1 denotes the next state of the agent. The value function can hence be understood as a mapping from states to discounted future rewards which the agent seeks to maximize with its actions.

To solve this optimization problem, the Q-Learning algorithm (Watkins, 1989) can be applied, extending the above equation to the level of state-action tuples:

　　Q ^∗ (s_t , a_t) = E[R_t+1 + γ max a_t+1 Q ^∗ (S_t+1, a_t+1)|S_t = s_t , A_t = a_t ].(2)

Hereby, the Q-value Q∗ (st , at) equals to the immediate reward for carrying out action At = at in state St = st plus the discounted future reward from carrying on in the best way possible.

The optimal policy π ^∗ (the mapping from states to actions) then simply becomes:

　　π ^∗ (s_t) = max a_t Q ^∗ (s_t , a_t) .(3)

i.e., in every state St = st , choose the action At = at that yields the highest Q-value. To approximate the Q-function during (online) learning, an iterative optimization is carried out with α denoting the learning rate - see also Sutton and Barto (1998) for further details:

　　Q ^∗ (s_t , a_t) ← (1 − α) Q ^∗ (s_t , a_t) + α (r_t+1 + γ max a_t+1 Q ^∗ (s_t+1, a_t+1) ) . (4)

a survey for RL的更多相关文章

（转）Applications of Reinforcement Learning in Real World
Applications of Reinforcement Learning in Real World 2018-08-05 18:58:04 This blog is copied from: h ...
论文笔记系列-Neural Network Search ：A Survey
论文笔记系列-Neural Network Search :A Survey 论文笔记 NAS automl survey review reinforcement learning Bayesia ...
(zhuan) 一些RL的文献（及笔记）
一些RL的文献(及笔记) copy from: https://zhuanlan.zhihu.com/p/25770890 Introductions Introduction to reinfor ...
A Survey of Visual Attention Mechanisms in Deep Learning
A Survey of Visual Attention Mechanisms in Deep Learning 2019-12-11 15:51:59 Source: Deep Learning o ...
Generalizing from a Few Examples: A Survey on Few-Shot Learning 小样本学习最新综述 | 三大数据增强方法
目录原文链接:小样本学习与智能前沿 01 Transforming Samples from Dtrain 02 Transforming Samples from a Weakly Labeled ...
知识图谱顶刊综述 - (2021年4月) A Survey on Knowledge Graphs: Representation, Acquisition, and Applications
知识图谱综述(2021.4) 论文地址:A Survey on Knowledge Graphs: Representation, Acquisition, and Applications 目录知 ...
SharePoint 2010 Survey的Export to Spreadsheet功能怎么不见了？
背景信息: 最近用户报了一个问题,说他创建的Survey里将结果导出成Excel文件(Export to spreadsheet)的按钮不见了. 原因排查: 正常情况下,这个功能只存在于SharePo ...
中间值为什么为l+(r-l)/2，而不是(l+r)/2
二分法的算法中,我们看到一些代码里取中间值: MID=l+(r-l)/2; 为什么是这个呢?不就是(l+r)/2吗?为什么要多此一举呢? 其实还是有不一样的,看看他们的区别吧: l,r是指针的时候只能 ...
SharePoint Tricks - Survey
1. SharePoint 2010中,在Survey的问题框中输入HTML代码可以用于插入图片或者链接,具体方法为: 1.1 在问题框中输入html, 1.2 在New Form和Edit Form ...

随机推荐

@functions @model @using
/////@functions自定义函数////// @helper辅助方法的确可以很方便的完成辅助方法开发,不过却失去了一些弹性,例如,无法在@helper中自定义属性,只能单纯的传入参数,然后格式 ...
CV codes代码分类整理合集《转》
from:http://www.sigvc.org/bbs/thread-72-1-1.html 一.特征提取Feature Extraction: SIFT [1] [Demo program] ...
ue4 模拟tween
timeline的设置,注意timeLine可以使用外部的曲线,这个比较方便做各种曲线,timeline内部只适合打单个点
linux 安装mysql 5.7
1.下载安装包http://dev.mysql.com/downloads/mysql/#downloads推荐下载通用安装方法的TAR包(http://cdn.mysql.com//Download ...
js的Element.scrollIntoView的学习
1.Element.scrollIntoView() 该方法让当前元素滚动到浏览器窗口的可是区域内: 2.语法: element.scrollIntoView();//等同于element.sc ...
HE学业水平考试游记 By cellur925
\(I'm\) \(back\). Day -2 今天高二全体学生开始了愉悦的长达两天半的自习2333. 第一天刚了最不会的地理必修一.以前没发现,其实真的挺有趣的233. 于是用了一天学习了一年的地 ...
优酷土豆的Redis服务平台化之路
前言 Nginx 是一个免费的 , 开源的 , 高性能的 HTTP 服务器和反向代理 ,以及 IMAP / POP3代理服务器. Nginx 以其高性能,稳定性,丰富的功能,简单的配置和低资源消 ...
css 文本溢出时显示省略号
.text-ellipsis { width:100px; height:60px; overflow: hidden;//隐藏滚动条 text-overflow:ellipsis; white-sp ...
C# 文件异步操作
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.T ...
II play with GG（思维规律）
时间限制:C/C++ 1秒,其他语言2秒空间限制:C/C++ 262144K,其他语言524288K 64bit IO Format: %lld 题目描述 IG won the S champion ...

a survey for RL

a survey for RL的更多相关文章

随机推荐

热门专题