一些RL的文献(及笔记)

copy from: https://zhuanlan.zhihu.com/p/25770890 

Introductions

Introduction to reinforcement learning
Index of /rowan/files/rl

ICML Tutorials:
http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

NIPS Tutorials:
CS 294 Deep Reinforcement Learning, Spring 2017
https://drive.google.com/file/d/0B_wzP_JlVFcKS2dDWUZqTTZGalU/view

Deep Q-Learning


DQN:
[1312.5602] Playing Atari with Deep Reinforcement Learning (and its nature version)

Double DQN
[1509.06461] Deep Reinforcement Learning with Double Q-learning

Bootstrapped DQN
[1602.04621] Deep Exploration via Bootstrapped DQN

Priority Experienced Replay
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/prioritized-replay.pdf

Duel DQN
[1511.06581] Dueling Network Architectures for Deep Reinforcement Learning

Classic Literature

SuttonBook
http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf
Book

David Silver's thesis
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/thesis.pdf

Policy Gradient Methods for Reinforcement Learning with Function Approximation
https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf
(Policy gradient theorem)

1. Policy-based approach is better than value based: policy function is smooth, while using value function to pick policy is not continuous.

2. Policy Gradient method.
Objective function is averaged on the stationary distribution (starting from s0).
For average reward, it needs to be truly stationary.
For state-action (with discount), if all experience starts with s0, then the objective is averaged over a discounted distribution (not necessarily fully-stationary). If we starts with any arbitrary state, then the objective is averaged over the (discounted) stationary distribution.
Policy gradient theorem: gradient operator can “pass” through the state distribution, which is dependent on the parameters (and at a first glance, should be taken derivatives, too).

3. You can replace Q^\pi(s, a) with an approximate, which is only accurate when the approximate f(s, a) satisfies df/dw = d\pi/d\theta /\pi
If pi(s, a) is loglinear wrt some features, then f has to be linear to these features and \sum_a f(s, a) = 0 (So f is an advantage function).

4. First time to show the RL algorithm converges to a local optimum with relatively free-form functional estimator.

DAgger
https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats10-paper.pdf

Actor-Critic Models

Asynchronous Advantage Actor-Critic Model
[1602.01783] Asynchronous Methods for Deep Reinforcement Learning

Tensorpack's BatchA3C (ppwwyyxx/tensorpack) and GA3C ([1611.06256] Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU)
Instead of using a separate model for each actor (in separate CPU threads), they process all the data generated by actors with a single model, which is updated regularly via optimization.

On actor-critic algorithms.
http://www.mit.edu/~jnt/Papers/J094-03-kon-actors.pdf
Only read the first part of the paper. It proves that actor-critic will converge to the local minima, when the feature space used to linearly represent Q(s, a) also covers the space spanned by \nabla log \pi(a|s) (compatibility condition), and the actor learns slower than the critic.

https://dev.spline.de/trac/dbsprojekt_51_ss09/export/74/ki_seminar/referenzen/peters-ECML2005.pdf
Natural Actor-Critic
Natural gradient is applied on actor critic method. When the compatibility condition proposed by the policy gradient paper is satisfied (i.e., Q(s, a) is a linear function with respect to \nabla log pi(a|s), so that the gradient estimation using this estimated Q is the same as the true gradient which uses the unknown perfect Q function computed from the ground truth policy), then the natural gradient of the policy's parameters is just the linear coefficient of Q.

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients
https://hal.archives-ouvertes.fr/hal-00756747/document
Covers the above two papers.

Continuous State/Action

Reinforcement Learning with Deep Energy-Based Policies 
Use the soft-Q formulation proposed by https://arxiv.org/pdf/1702.08892.pdf (in the math section) and naturally incorporate the entropy term in the Q-learning paradigm. For continuous space, both the training (updating Bellman equation) and sampling from the resulting policy (in terms of Q) are intractable. For the former, they propose to use a surrogate action distribution, and compute the gradient with importance sampling. For the latter, they use Stein variational method that matches a deterministic function a = f(e, s) to the learned Q-distribution. In terms of performance, they are comparable with DDPG. But since the learnt Q could be diverse (multimodal) under maximal entropy principle, it can be used as a common initialization for many specific tasks (Example, pretrain=learn to run towards arbitrary direction, task=run in a maze).

Deterministic Policy Gradient Algorithms
http://jmlr.org/proceedings/papers/v32/silver14.pdf
Silver's paper. Learn an actor to prediction the deterministic action (rather than a conditional probability distribution \pi(a|s)) in Q-learning. When trained with Q-learning, propagate through Q to \pi. Similar to Policy Gradient Theorem (gradient operator can “pass” the state distribution, which is dependent on the parameters), there is also deterministic version of it. Also interesting comparison with stochastic offline actor-critic model (stochastic = \pi(a|s)).

Continuous control with deep reinforcement learning (DDPG)
Deep version of DPG (with DQN trick). Neural network + minibatch → not stable, so they also add target network and replay buffer.

Reward Shaping

Policy invariance under reward transformations: theory and application to reward shaping.
http://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf
Andrew Ng's reward shaping paper. It proves that for reward shaping, policy is invariant if and only if a difference of a potential function is added to the reward.

Theoretical considerations of potential-based reward shaping for multi-agent systems
Theoretical considerations of potential-based reward shaping for multi-agent systems
Potential based reward-shaping can help a single-agent achieve optimal solution without changing the value (or Nash Equilibrium). This paper extends it to multi-agent case.

Reinforcement Learning with Unsupervised Auxiliary Tasks
[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks
ICLR17 Oral. Add auxiliary task to improve the performance of Atari Games and Navigation. Auxiliary task includes maximizing pixel changes and maximizing the activation of individual neurons.

Navigation

Learning to Navigate in Complex Environments
https://openreview.net/forum?id=SJMGPrcle¬eId=SJMGPrcle
Raia's group from DM. ICLR17 poster, adding depth prediction as the auxiliary task and improve the navigation performance (also uses SLAM results as network input)

[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks (in reward shaping)

Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments
Goal: navigation without SLAM.
Learn successor features (Q, V before the last layer, these features have a similar Bellman equation.) for transfer learning: learn k top weights simultaneously while sharing the successor features, using DQN acting on the features). In addition to successor features, also try to reconstruct the frame.

Experiments on simulation.
state: 96x96x four most recent frames.
action: four discrete actions. (still, left, right, straight(1m))
baseline: train a CNN to directly predict the action of A*

Deep Recurrent Q-Learning for Partially Observable MDPs
There is no much performance difference between stacked frame DQN versus DRQN. DRQN may be more robust when the game state is flickered (some are 0)

Counterfactual Regret Minimization

Dynamic Thresholding
http://www.cs.cmu.edu/~sandholm/dynamicThresholding.aaai17.pdf
With proofs:
http://www.cs.cmu.edu/~ckroer/papers/pruning_agt_at_ijcai16.pdf

Study game state abstraction and its effect on Ludoc Poker.
https://webdocs.cs.ualberta.ca/~bowling/papers/09aamas-abstraction.pdf

https://www.cs.cmu.edu/~noamb/papers/17-AAAI-Refinement.pdf
https://arxiv.org/pdf/1603.01121v2.pdf
http://anytime.cs.umass.edu/aimath06/proceedings/P47.pdf

Decomposition:
Solving Imperfect Information Games Using Decomposition
http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8407/8476

Safe and Nested Endgame Solving for Imperfect-Information Games
https://www.cs.cmu.edu/~noamb/papers/17-AAAI-Refinement.pdf

Game-specific RL

Atari Game
http://www.readcube.com/articles/10.1038/nature14236

Go
AlphaGo https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf

DarkForest [1511.06410] Better Computer Go Player with Neural Network and Long-term Prediction

Super Smash Bros
https://arxiv.org/pdf/1702.06230.pdf

Doom
Arnold: [1609.05521] Playing FPS Games with Deep Reinforcement Learning
Intel: [1611.01779] Learning to Act by Predicting the Future
F1: https://openreview.net/forum?id=Hk3mPK5gg¬eId=Hk3mPK5gg

Poker
Limited Texas hold' em
http://ai.cs.unibas.ch/_files/teaching/fs15/ki/material/ki02-poker.pdf

Unlimited Texas hold 'em 
DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

(zhuan) 一些RL的文献(及笔记)的更多相关文章

  1. 转 :hlda文献学习笔记

    David M.BLEI nCR文献学习笔记(基本完成了)  http://yhbys.blog.sohu.com/238343705.html 题目:The Nested Chinese Resta ...

  2. [系统重装日志1]快速迁移/恢复Mendeley的文献和笔记

    一时手贱把原先系统的EFI分区给删了,按照网上的教程还没有恢复成功,无奈之下只能重装系统,想想这么多环境和配置真是酸爽. 身为一个伪科研工作者,首先想到的是自己的文献和阅读笔记.我所使用的文献管理工具 ...

  3. 文献阅读笔记——group sparsity and geometry constrained dictionary

    周五实验室有同学报告了ICCV2013的一篇论文group sparsity and geometry constrained dictionary learning for action recog ...

  4. 人体姿势识别,Convolutional pose machines文献阅读笔记。

    开源实现 https://github.com/shihenw/convolutional-pose-machines-release(caffe版本) https://github.com/psyc ...

  5. 文献管理软件 Zotero 安装、配置与使用

    简介 Zotero优缺点 使用Zotero作为主力文献管理工具的原因: 软件本身完全免费并且开源,不存在盗版问题 注册后本身只包括 300M 空间同步,但支持 WebDAV 同步,例如 Dropbox ...

  6. [MOC062066]背景建模资料收集整理

    一.相关博客 背景建模相关资料收集,各个链接都已给出. 资料,不可能非常完整,以后不定期更新. -----------------切割线----------------- 这个哥们总结的非常好啊,看完 ...

  7. CVPR 2013 关于图像/场景分类(classification)的文章paper list

    CVPR 2013 关于图像/场景分类(classification)的文章paper list 八14by 小军   这个搜罗了cvpr2013有关于classification的相关文章,自己得m ...

  8. [BZOJ3626] [LNOI2014] LCA 离线 树链剖分

    题面 考虑到询问的\(l..r,z\)具有可减性,考虑把询问差分掉,拆成\(r,z\)和\(l-1,z\). 显然这些LCA一定在\(z\)到根的路径上.下面的问题就是怎么统计. 考虑不是那么暴力的暴 ...

  9. 论文阅读:Relation Structure-Aware Heterogeneous Information Network Embedding

    Relation Structure-Aware Heterogeneous Information Network Embedding(RHINE) (AAAI 2019) 本文结构 (1) 解决问 ...

随机推荐

  1. Another Meaning (KMP + DP)

    先用KMP重叠匹配求出各个匹配成功的尾串位置.然后利用DP去求,那转移方程应该是等于 上一个状态 (无法匹配新尾巴) 上一个状态 + 以本次匹配起点为结尾的状态(就是说有了新的位置) + 1 (单单一 ...

  2. BP神经网络的直观推导与Java实现

    人工神经网络模拟人体对于外界刺激的反应.某种刺激经过人体多层神经细胞传递后,可以触发人脑中特定的区域做出反应.人体神经网络的作用就是把某种刺激与大脑中的特定区域关联起来了,这样我们对于不同的刺激就可以 ...

  3. 关于SqlCommand对象的2个方法:ExecuteNonQuery 方法和ExecuteScalar方法

    1.SqlCommand.ExecuteNonQuery 方法 对连接执行 Transact-SQL 语句并返回受影响的行数. 语法:public override int ExecuteNonQue ...

  4. ORM some

    1 -- 增 models.表名(类).objects.create(字段1=值,字段2=值) 查 models.表名(类).objects.get(pk = 3) models.表名(类).obje ...

  5. SQL知识点、SQL语句学习

    一. 数据库简介和创建1. 系统数据库在安装好SQL SERVER后,系统会自动安装5个用于维护系统正常运行的系统数据库: (1)master:记录了SQL SERVER实例的所有系统级消息,包括实例 ...

  6. NFS客户端阻塞睡眠问题与配置调研

    Linux NFS客户端需要很小心地配置,否则在NFS服务器崩溃时,访问NFS的程序会被挂起,用ps查看,进程状态(STAT)处于D,意为(由于IO阻塞而进入)不可中断睡眠(如果是D+,+号表示程序运 ...

  7. java之定时任务

    package com.financial.server.util; import java.text.SimpleDateFormat; import java.util.Date; import ...

  8. @RefreshScope 的作用

    让在application.properties里自定义的变量也能通过@Value 注解正常注入

  9. Maven项目启动报错:java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

    1.场景 1.1.先确认pom.xml文件已添加mysql依赖: <dependency>    <groupId>mysql</groupId>     < ...

  10. ltp-fcntl36 偶尔出现fail unexpected data offset 20928 value 94

    每次出错的都是和posix相关先把结论说了: fn_ofd_w和fn_ofd_r的SAFE_FCNTL参数F_OFD_SETLKW fn_posix_w和fn_posix_r的SAFE_FCNTL参数 ...