Reinforcement Learning Solutions Ed2 Chapter 1

RL到了第三章题目多的不可思议

前两章比较简单，就在博客随便写写了。之后的用pdf更新。

1.1: Self-play will result different move even from the first step due to randomization of the action choice. The method should then learn two sets of value functions, first hand and second hand. In general, I believe self-play would improve the ability to win over the long run but only converge at slower speed than playing against some one with knowledge. Indeed, the self play sets no learning object of incoming opponents and may result the exploitation of opponent’s weakness a harder job.

1.2: Mirror positions should be bind together to the same status in the value function. Either create 4 images for each status or perform rotation during playtime. However, if the opponent does not take the advantage of symmetry and has some strange belief in some patterns, value function should take each status differently in order to exploit such difference. Although, if trained with a well played opponent, such amendments are not necessary any more. Any way the agent has no information of his opponent as a priori.

1.3: Under the assumption that the agent explores, greedy may be good. For example, in 10-arm bandit problem, traditional solution is indeed greedy. However, less greedy algorithms such as soft max have better performance and convergence speed since it could quickly understand the outcome of all behaviors instead of the seemingly great ones. Of course, with change of policy from the opponent, greedy will be very slow to react and different action choice method has to be considered.

1.4: 略题意不清。简单的说如果探索但是把探索行为也更新前值前面的行为会被错误的赋予一个探索才能引发的后果会降低或提高所有行为的评价而该行为序列却可能是不可重复的因为这是随机的探索罢了

1.5: 略。开放性过大。很多优化其实后面才会说。比如结合DL。

2.1: 75%（sigma reflects possibility to explore the entire action space instead of the one other than the optimal)

2.2: A4 and all. A4 is not optimal and see 2.1 for the reason why all actions can be exploration.

2.3: the one with 0.01 possibility to explore. Limit is just higher than 0.99 which is higher than action with 0.1 possibility to explore. Given enough time step, indeed, the one explores less would always has higher cumulative rewards.

2.4: 数学就略了跟n扯上关系除法侧重前乘法侧重后

2.5: TODO

2.6: 这真的恶心我的推测是一开始初始化为5实在太乐观了所以选择什么 value就抖降选完一轮降一轮直到value接近于真实附近的时候 t在逐渐增加但还不够大以至于算法极为短暂的利用了几个可能最优的行为（由于取样太小用value判断最优率也只有40%正确率）形成了图中的spike。但紧接着t增加算法的第二项增加促进了新一轮探索使得收益率又降低直到算法进入了稳定期 n也足够大第二项已经名义上接近消失算法无限接近于最优所以先升再降再升

2.7: 数学略但大概是个exponential average 说的夸张点和kernel method 类似？

2.8: 和2.6差不多我其实和在一起讲了

2.9: 这问题有点贱这书明明没讲过sigmoid 是的他们显然是相似的因为sigmoid本来就可以写成 e^2z/(e^z + e^2z) 相当于softmax选择

2.10: 在这个条件下一切超过0.5收益的算法都是刷流氓传统的constant alpha或许可行但其实最优结果目测是来自纯粹greedy。即使给了label

2.11: TODO

Reinforcement Learning Solutions Ed2 Chapter 1 - 2 问题解答的更多相关文章

(转) Deep Reinforcement Learning: Playing a Racing Game
Byte Tank Posts Archive Deep Reinforcement Learning: Playing a Racing Game OCT 6TH, 2016 Agent playi ...
Awesome Reinforcement Learning
Awesome Reinforcement Learning A curated list of resources dedicated to reinforcement learning. We h ...
(转) Deep Reinforcement Learning: Pong from Pixels
Andrej Karpathy blog About Hacker's guide to Neural Networks Deep Reinforcement Learning: Pong from ...
论文笔记之：Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning ICML 2016 深度强化学习最近被人发现貌似不太稳定,有人提出很多改善的方法,这些方法有很 ...
Reinforcement Learning in R
Reinforcement learning has gained considerable traction as it mines real experiences with the help o ...
[Reinforcement Learning] 动态规划(Planning)
动态规划动态规划(Dynamic Programming,简称DP)是一种通过把原问题分解为相对简单的子问题的方式求解复杂问题的方法. 动态规划常常适用于具有如下性质的问题: 具有最优子结构(Opt ...
temporal credit assignment in reinforcement learning 【强化学习经典论文】
Sutton 出版论文的主页: http://incompleteideas.net/publications.html Phd 论文: temporal credit assignment i ...
[转]Introduction to Learning to Trade with Reinforcement Learning
Introduction to Learning to Trade with Reinforcement Learning http://www.wildml.com/2018/02/introduc ...
Introduction to Learning to Trade with Reinforcement Learning
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ The academic ...

随机推荐

CodeSmith 一、连接Mysql
下载了codesmith 8,连接Mysql却提示“找不到请求的 .Net Framework Data Provider". 1,下载MySql.Data.dll:https://dev. ...
git命令的理解与扩展
Git的模式如图: Workspace:工作区 Index / Stage:暂存区 Repository:仓库区(或本地仓库) Repository:仓库区(或本地仓库) 一.新建代码库 # 查看gi ...
【转】Linux之crontab定时任务命令
1. crontab命令概念 crontab命令用于设置周期性被执行的指令.该命令从标准输入设备读取指令,并将其存放于“crontab”文件中,以供之后读取和执行. cron 系统调度进程. 可以使 ...
前端之DOM操作
一.概念 javascript javascript是一种脚本语言,可以被浏览器解析,所以它可以称之为前端的三把利器之一. javascript跟java没有半毛钱关系. 声明局部变量:使用关键字va ...
Jira安装过程
一.下载jira jira下载路径:https://www.atlassian.com/software/jira/download 二.安装 jira安装一直下一步下一步即可三.破解 E:\JIR ...
Linux(Ubuntu)使用日记------为程序添加桌面快捷方式
我们Ubuntu中的所以的程序的快捷方式都放在了/usr/share/applications文件夹下,都是以.desktop结尾的文件.我们可以在这个文件夹下创建我们的快捷方式,然后复制到桌面即可 ...
mongodb备份还原
备份:mongodump mongodump常用参数 --db:指定导出的数据库 --collection:指定导出的集合 --excludeCollection:指定不导出的集合 --host :远 ...
2019中山大学程序设计竞赛 Triangle
今天水了一发hdu上的中山校赛这个题交了将近十遍才过...... 就是说给 n 个木棍,如果能找出3个能组成三角形的木棍就输出yes 反之输出no 乍一看很简单一个排序遍历一遍就好了但是n值太大 ...
【地图功能开发系列：一】根据当前坐标点获取距离不超过N公里的门店
在此处输入标题声明变量 //假设当前坐标 double lon1 = 113.336028; double lat1 = 23.21745; //距离m double distance = 1000 ...

Reinforcement Learning Solutions Ed2 Chapter 1 - 2 问题解答

Reinforcement Learning Solutions Ed2 Chapter 1 - 2 问题解答的更多相关文章

随机推荐

热门专题