offline RL | BCQ：学习 offline dataset 的 π(a|s)，直接使用 (s, π(s)) 作为 Q learning 训练数据

题目： Off-Policy Deep Reinforcement Learning without Exploration，ICLR 2019
pdf 版本：https://arxiv.org/pdf/1812.02900.pdf
html 版本：https://ar5iv.labs.arxiv.org/html/1812.02900
GitHub：https://github.com/sfujim/BCQ
参考博客：
- https://zhuanlan.zhihu.com/p/493039753，感觉讲得更好一些
- https://zhuanlan.zhihu.com/p/136844574

0 abstract

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.

背景：RL 的许多实际应用，限制 agent 只能从已收集的固定数据中学习。
讨论：在本文中，我们证明了 extrapolation 引入的误差，会导致标准的 off-policy RL（DQN、DDPG）对固定 dataset 不 work，在没有与 distribution under the current policy 相关的数据的情况下。
method：我们引入了一类新颖的 off-policy 算法（其实就是 offline RL）batch-constrained reinforcement learning，去限制 action space，迫使 agent 的 policy 的 distribution 接近给定数据的子集（？）。
contribution：我们提出了第一个连续控制 DRL 算法，可以从任意 fixed batch data（其实就是 offline dataset）中有效学习，并且做了很多实验。

1 intro

好像 offline RL 之前叫做 batch reinforcement learning（？）
imitation learning 不适用于数据质量不佳的情况。
直接在 offline RL setting 里面用 off-policy RL 不太 work，被称为 extrapolation error。
BCQ：① 最大化 discounted return，② 尽量让 policy 的 s-a distribution 与 dataset 的 s-a distribution 相匹配。
实验： MuJoCo 的 offline setting（？）

2 background

RL 基础，没有新鲜的东西。

3 extrapolation error - 外推误差

extrapolation error 的定义： offline RL 学到的 value function 的误差（？）
extrapolation error 的归因：
- Absent Data：有一些 state-action pair 在 dataset 里是没有的。
- Model Bias：基于 offline dataset 构建的 MDP 跟真实 MDP 不一样，因此，利用 offline dataset 的 MDP 来计算 value function，也会有所偏差。
- Training Mismatch：听起来像是，如果强行拿 offline dataset 的 off-policy 数据跑 on-policy RL，效果会不好。（根据 3.1 节，即使跑 off-policy RL 算法，效果也会不好）
可以参考（https://zhuanlan.zhihu.com/p/493039753）博客的第二节。

4 method: batch-constrained RL

简单的 idea：为避免 extrapolation error，应该限制 policy 的 state-action distribution 与 offline data 相似。
描述了 3 个 objectives：
- 最小化当前 state 下所选 actions 与 offline data 的距离。
- 选择 action，使得下一 state 尽量是熟悉的 state（或 (s,a,s') ？）
- 最大化 value function。
section 4.1 貌似是在 finite MDP 下的理论分析，要精确量化 extrapolation error。
- 使用 offline data（batch）B 定义了 MDP M_B ，在 offline data 上做 Q-learning，会学到 M_B 上的最优 policy。
- 建议去看（https://zhuanlan.zhihu.com/p/136844574）博客的第二节。
- 今日困倦，改日再看……
section 4.2 貌似是 BCQ 的算法。
- BCQ 有 4 个 networks：一个 generative model \(G_ω(s)\)、一个扰动模型 \(ξ_\phi(s,a)\)、两个 Q networks \(Q_{\{θ_1,θ_2\}}(s,a)\) 。
- 生成模型 G，拟合了 offline dataset 的 s-a 分布，使用变分自编码器 VAE，用来提高生成动作的多样性（？）
- 扰动模型 ξ，对 action a 的扰动范围是 [-Φ, Φ]。
- 调节参数 n 与 Φ：当 Φ=0 n=1 时，BCQ 类似于 imitation learning；当 Φ = a max - a min 且 n→∞ 时，扰动模型 ξ 的训练，跟 DDPG 算法的训练目标类似。
算法流程：
- 从 generative model 里生成 n 个 actions，然后对这些 actions 使用 ξ 进行扰动。
- 训练 Q function 的参数 θ，最小化 Q 的 TD error。
- 训练扰动模型 ξ 的参数 Φ，最大化 Q(s, a+ξ(s,a)) ，这里的 a 由 generative model 生成。
- 更新 target networks 的参数（trick）
- 感觉有点抽象……

5 experiment - 实验

baselines： off-policy RL（DQN DDPG）、behavior cloning（和 VAE-BC）。
实验 setting 仍旧是 1. 使用 DQN 的训练数据 final buffer，2. 与 DQN 同期（？）训练的 concurrent，3. 直接使用 expert dataset 的 imitation。增加 4. 添加噪音与 behavior policy 随机探索 action 的 imperfect。

6 related work

batch RL，imitation learning，RL 中的 uncertainty。
感觉 batch RL 主要在说 distribution 要 match，imitation learning 主要在说大家都依赖于 expert data。