Reinforcement Learning Solutions Ed2 Chapter 1 - 2 问题解答
RL到了第三章题目多的不可思议
前两章比较简单,就在博客随便写写了。之后的用pdf更新。
1.1: Self-play will result different move even from the first step due to randomization of the action choice. The method should then learn two sets of value functions, first hand and second hand. In general, I believe self-play would improve the ability to win over the long run but only converge at slower speed than playing against some one with knowledge. Indeed, the self play sets no learning object of incoming opponents and may result the exploitation of opponent’s weakness a harder job.
1.2: Mirror positions should be bind together to the same status in the value function. Either create 4 images for each status or perform rotation during playtime. However, if the opponent does not take the advantage of symmetry and has some strange belief in some patterns, value function should take each status differently in order to exploit such difference. Although, if trained with a well played opponent, such amendments are not necessary any more. Any way the agent has no information of his opponent as a priori.
1.3: Under the assumption that the agent explores, greedy may be good. For example, in 10-arm bandit problem, traditional solution is indeed greedy. However, less greedy algorithms such as soft max have better performance and convergence speed since it could quickly understand the outcome of all behaviors instead of the seemingly great ones. Of course, with change of policy from the opponent, greedy will be very slow to react and different action choice method has to be considered.
1.4: 略 题意不清。简单的说如果探索 但是把探索行为也更新前值 前面的行为会被错误的赋予一个探索才能引发的后果 会降低或提高所有行为的评价 而该行为序列却可能是不可重复的 因为这是随机的探索罢了
1.5: 略。开放性过大。很多优化其实后面才会说。比如结合DL。
2.1: 75%(sigma reflects possibility to explore the entire action space instead of the one other than the optimal)
2.2: A4 and all. A4 is not optimal and see 2.1 for the reason why all actions can be exploration.
2.3: the one with 0.01 possibility to explore. Limit is just higher than 0.99 which is higher than action with 0.1 possibility to explore. Given enough time step, indeed, the one explores less would always has higher cumulative rewards.
2.4: 数学就略了 跟n扯上关系 除法侧重前 乘法侧重后
2.5: TODO
2.6: 这真的恶心 我的推测是 一开始初始化为5实在太乐观了 所以选择什么 value就抖降 选完一轮降一轮 直到value接近于真实附近的时候 t在逐渐增加 但还不够大 以至于算法极为短暂的利用了几个可能最优的行为(由于取样太小 用value判断最优率也只有40%正确率)形成了图中的spike。但紧接着t增加 算法的第二项增加 促进了新一轮探索 使得收益率又降低 直到算法进入了稳定期 n也足够大 第二项已经名义上接近消失 算法无限接近于最优 所以先升再降再升
2.7: 数学 略 但大概是个exponential average 说的夸张点和kernel method 类似?
2.8: 和2.6差不多 我其实和在一起讲了
2.9: 这问题有点贱 这书明明没讲过sigmoid 是的他们显然是相似的 因为sigmoid本来就可以写成 e^2z/(e^z + e^2z) 相当于softmax选择
2.10: 在这个条件下一切超过0.5收益的算法都是刷流氓 传统的constant alpha或许可行 但其实最优结果目测是来自纯粹greedy。即使给了label
2.11: TODO
Reinforcement Learning Solutions Ed2 Chapter 1 - 2 问题解答的更多相关文章
- (转) Deep Reinforcement Learning: Playing a Racing Game
Byte Tank Posts Archive Deep Reinforcement Learning: Playing a Racing Game OCT 6TH, 2016 Agent playi ...
- Awesome Reinforcement Learning
Awesome Reinforcement Learning A curated list of resources dedicated to reinforcement learning. We h ...
- (转) Deep Reinforcement Learning: Pong from Pixels
Andrej Karpathy blog About Hacker's guide to Neural Networks Deep Reinforcement Learning: Pong from ...
- 论文笔记之:Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning ICML 2016 深度强化学习最近被人发现貌似不太稳定,有人提出很多改善的方法,这些方法有很 ...
- Reinforcement Learning in R
Reinforcement learning has gained considerable traction as it mines real experiences with the help o ...
- [Reinforcement Learning] 动态规划(Planning)
动态规划 动态规划(Dynamic Programming,简称DP)是一种通过把原问题分解为相对简单的子问题的方式求解复杂问题的方法. 动态规划常常适用于具有如下性质的问题: 具有最优子结构(Opt ...
- temporal credit assignment in reinforcement learning 【强化学习 经典论文】
Sutton 出版论文的主页: http://incompleteideas.net/publications.html Phd 论文: temporal credit assignment i ...
- [转]Introduction to Learning to Trade with Reinforcement Learning
Introduction to Learning to Trade with Reinforcement Learning http://www.wildml.com/2018/02/introduc ...
- Introduction to Learning to Trade with Reinforcement Learning
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ The academic ...
随机推荐
- ThinkPad T400 键帽下面的X支架的安装方法
有一台古老的T400,清理键盘的时候,X支架老化断了,淘宝买了几个支架,研究一下安装方法: 1.注意支架方向:上面是横向的细支架,下面是两个小孔2.用小螺丝刀把下面的两个小孔推进键盘下面的两个金属钩里 ...
- Linux常用命令之Tmux
Tmux是一个优秀的终端复用软件,类似GNU Screen,但来自于OpenBSD,采用BSD授权.使用它最直观的好处就是,通过一个终端登录远程主机并运行tmux后,在其中可以开启多个控制台而无需再“ ...
- NIO的初步入门
NIO java NIO简介 Java NIO 简介 是从java1.4版本开始引入的一个新的IO AP可以替代标准java IO API NIO与原来的IO有同样的作用和目的,但是使用方式完全不同 ...
- js定时器 实现提交成功提示
应用场景: 用户评论后,在合适位置弹出“评论成功”,2秒钟后自动消失,提示用户评论成功. HTML: {#评论成功提示#} <div class="popup_con" st ...
- java UTC时间格式化
import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; import ja ...
- nginx 详细配置
Nginx全局变量 Nginx中有很多的全局变量,可以通过$变量名来使用.下面列举一些常用的全局变量: 变量 说明 boxClass 需要执行动画的元素的 变量 说明 $args 请求中的参数,如ww ...
- centos7升级内核至最新
应用背景: 最近在接触docker,其对内核版本要求较高,就连目前使用的centos7.x默认内核版本为3.10.0-xxx,也是刚好满足其最低要求,故借此机会记录一下升级内核的操作步骤. 测试环境: ...
- python之路day09--函数
s='金老板啊'print(len(s)) #内置函数 s='金老板啊'# def my_len():# i=0# for k in s:# i+=1# # print(i)# print(my_le ...
- iis7设置ftp
目前是所有网站一个域下.ftp登录后可看到所有网站,目前想ftp一个网站,查看了下服务器,貌似只有serv-u这么个东西,还不能再创建第二个域.不得其解.百度发现两篇文章正好: http://blog ...
- 在linux服务器上搭建nvidia-docker环境
docker相当于一个容器,其可以根据你所需要的运行环境构建相应的运行环境,此时各个环境之间彼此隔离,就不会存在在需要跑一个新的代码的时候破坏原来跑的代码所需要的环境,各个环境之间彼此隔离开,好像一个 ...