RLHF · PBRL | PEBBLE：通过 human preference 学习 reward model

论文题目：PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training，貌似是 ICML 2021 的文章。
本博客为论文阅读笔记，【不能代替】阅读原文的工作量。原文写的也很好，是 AI 顶会的风格，相对容易读懂。
阅读材料：
- pdf 版本： https://arxiv.org/pdf/2106.05091.pdf （包含 supplementary materials）
- html 版本： https://ar5iv.labs.arxiv.org/html/2106.05091
- PEBBLE 的网站： https://sites.google.com/view/icml21pebble
- 代码：https://github.com/rll-research/BPref
（PEBBLE 算法名称应该大写的，下文可能比较懒，就采用小写了）

0 abstract
1 intro
2 Related Work
3 Preliminaries
4 PEBBLE
5 Experiments
6 Discussion
Appendix

0 abstract

Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher's preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent's past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions.

摘要：

RL 的 reward function：对于复杂的目标，难以设计信息量足够 + 易于实现的 reward function。Human-in-the-loop RL 允许人的交互性教学，但是过于昂贵。
method：
- 提出了一个 off-policy 的 interactive RL 算法，通过主动查询 teacher 对两个行为片段（clips of behavior）的 preference，来学习一个 reward model，并用它来训练 RL agent。
- 为了实现 off-policy learning，当 reward model 变化时，对 agent 的所有 past experience 进行 relabel。
- 使用无监督探索（unsupervised exploration）来 pre-train agent，可以大幅提高查询里程（the mileage of its queries）（？）
实验结果：
- 可以应对 locomotion 和 robotic manipulation，先前的 Human-in-the-loop 方法难以应对。
- 可以利用人类的实时反馈（real-time human feedback）来阻止 reward exploitation。
- 学习难以用标准 reward function 指定的新行为。

1 intro

reward shaping 是 RL 应用的大问题。sparse reward 不容易训练成功，dense reward 需要对 agent 状态进行密集地观测。
尽管可以密集观测 agent 状态，但由于 reward exploitation，可能仍然很难构建合适的 reward function。
- reward exploitation 指的是 agent 使用意想不到的方式达成高 reward（hack 了 reward function）。一个解决方案是 imitation learning，但获得适合模仿的专家轨迹也很昂贵。
human-in-the-loop (HiL) RL 或许是个避免 reward exploitation 的好方法。
PEBBLE: unsupervised PrE-training and preference-Based learning via relaBeLing Experience。
- 包含两个模块（见图 1）：unsupervised pre-training、off-policy learning。
- 首先，agent 仅使用内在动机（intrinsic motivation）进行探索，得到一些 experience，产生连贯的 behaviors。这种 unsupervised pre-training 提高了 teacher’s initial feedback 的 efficiency。
- 然后，supervisor 对于一对 clips of agent behavior 给出自己的 preference，作为训练 reward model 的数据，最后使用 RL 来最大化该 reward function 的 return。
RL 需要大量 supervised data，重用（reuse）数据可以提高 RL 的 sample efficiency。
- 先前的 HiL RL 通常使用 on-policy 算法，为减轻 on-line learning 引起的奖励非平稳性。
- pebble 是 off-policy 的，在每次更新 reward model 时，都会 relabel agent 的所有 past experience。
main contributions：
- 首次证明，unsupervised pre-training 和 off-policy learning，可以显著提高 HiL RL 的 sample-efficiency 和 feedback-efficiency。
- 在 DeepMind 和 Meta-world 的 locomotion 与 robotic manipulation 任务上，outperform PBRL（preference-based RL）baselines。
- 证明 pebble 可以学习典型奖励函数难以有效设计的行为。
- pebble 可以避免 reward exploitation，并且相比使用 engineered reward function 训练的 agent，行为更理想。

2 Related Work

Learning from human feedback：
- 第一段是一些古早工作。
- 通过 human feedback，学习一个 reward model：① 可以学习一个任务是否成功的 classifier，基于它构建奖励函数。② 直接对实数形式的 human feedback 进行回归（regression），不太可行，因为 human feedback 不太可靠。
human feedback 是 01 判断的形式：
- 比较行为的好与坏，被称作 preference-based learning。
- 第三段有大量相关工作。Christiano et al.（2017）提出了 on-line 的 PBRL，但样本效率太低，随后的 demonstrations (Ibarz et al., 2018) 和 non-binary rankings (Cao et al., 2020) 有改善一些。
- pebble 使用 off-policy 经验回放 + unsupervised pre-training 来提高样本效率。
RL 的无监督预训练（Unsupervised pre-training for RL）：
- 通过无监督预训练，提取强大的行为先验（behavioral priors）。
- 具体的，鼓励智能体通过最大化各种内在奖励，来扩展可见状态的边界，比如 env dynamics 的预测误差、基于计数的状态新颖性、互信息、状态熵。
- 听起来像 RL exploration（本站关于 RL exploration 的综述博客）。

3 Preliminaries

简单介绍 RL。
Soft Actor-Critic（SAC）：off-policy，最大化 reward 和 policy entropy 的加权，交替进行 ① 软策略评估（公式 1）② 软策略改进（公式 2）。
Reward learning from preferences：
- segment σ：一段 trajectory $\{s_k,a_k,\cdots,s_{k+H},a_{k+H}\}$。
- 对于 segments σ0 和 σ1，有一个 preference y ∈{ (0,1), (1,0), (0.5,0.5) }。
- judgment 的形式为 (σ0, σ1, y) 的三元组。
- 奖励函数 $\hat r_{\psi}$ 满足下式，其中 $σ^1\succ σ^0$ 代表 σ1 比 σ0 更可取。
- \[P_\psi[σ^1\succ σ^0] = \frac{\exp\sum_t \hat r_{\psi}(s_t^1,a_t^1)}{\sum_{i\in\{0,1\}}\exp\sum_t \hat r_{\psi}(s_t^i,a_t^i)}
  \]
- 可以解释为，我们假设，human 偏好某一 segment 的概率，取决于一个 underlying reward function 在 segment 的每一 (st,at) 上指数求和。
- 更新 $\hat r_{\psi}$ 的 loss function：
- \[L^{reward}=-E_{(σ^0,σ^1,y)\sim D}\bigg[ y(0)\log P_\psi[σ^0\succ σ^1] + y(1)\log P_\psi[σ^1\succ σ^0] \bigg]
  \]

4 PEBBLE

建立 policy $\pi_\phi$ 、Q function $Q_\theta$ 、reward function $\hat r_\psi$ ，它们的更新过程如下：

Step 0 (unsupervised pre-training)：只使用 intrinsic motivation 来预训练 policy $\pi_\phi$ ，去 explore 并收集不同的 experiences。（4.1 节）
Step 1 (reward learning)：通过从 teacher 那里获得反馈，学习 reward function $\hat r_\psi$ 。
Step 2 (agent learning)：使用 off-policy RL 算法，更新 policy $\pi_\phi$ 和 Q-function $Q_\theta$ ，并重新标记（relabel）以减轻非平稳奖励函数（non-stationary reward function） $\hat r_\psi$ 的影响（参见第 4.3 节）。
重复 Step 1 和 Step 2。

4.1 Accelerating Learning via Unsupervised Pre-training - 通过无监督预训练加速学习

关于 intrinsic reward：

原始形式：状态熵 $H(s)=-E_{s\sim p(s)}[\log p(s)]$ ，鼓励访问更广泛的状态（应该是希望状态熵越大越好）。
简化版本，基于粒子的熵估计： $\hat H(s)\propto\sum_i\log(\|s_i-s_i^k\|)$ ，其中 $s_i^k$ 是 si 的第 k 个最近邻，这意味着，最大化状态与其最近邻之间的距离，会增加整体状态熵。
设计 reward： $r^{int}(s_t)=\log(\|s_t-s_t^k\|)$ ，将当前状态的 intrinsic reward 定义为与其 k-th nearest neighbor 之间的距离。（该思想参考了一篇 21 年的文章，将每个 transition 视作一个 particle 粒子）
对于每个样本 (st,at)，计算它与 replay buffer 中所有样本之间的 k-NN 距离，并将其除以标准差的运行估计值，对 intrinsic reward 进行归一化，最终得到 r int，将其作为 pre-train 过程的 RL reward。

【算法 1（EXPLORE: Unsupervised exploration）无监督探索。基于 SAC 框架，将 reward 换成 r int。】

最后会初始化得到 replay buffer B 和初始 policy $\pi_\phi$ 。

4.2 Selecting Informative Queries - 选择信息量大的 queries

如何选择两个 segment（或者叫 segment pair），拿去问人类的 preference，能获得最多、最有效、帮助最大的信息呢？
也就是说，如何选择 informative queries？（query 指的是拿 segment pair 问人类 preference 的过程）
理想情况下，应该计算 EVOI（expect value of information）（不知道是什么），但比较难算，因为需要对更新后的 policy 的所有可能 trajectories 求期望，所以，会有一些近似方法。
pebble 使用了 17 年的一个抽样方案：
- ① 均匀抽样 uniform sampling；
- ② 基于集成的抽样 ensemble-based sampling，在 ensemble reward models 中选择具有高方差的 pairs of segment。
- 我们探索了第三种方法 ③ 基于熵的抽样，试图消除最接近 decision boundary 的 pairs of segments 的歧义。也就是说，我们对一大批 segment pairs 进行采样，并选择最大化的 H(Pψ) 的 segment pair，其中 H(Pψ) 是 section 3 中提到的、基于 reward model 的 segment preference 概率。
这些 sampling methods 的有效性，在 Section 5 中评估。

4.3 Using Off-policy RL with Non-Stationary Reward - 在非平稳奖励中使用 off-policy RL

需要注意，reward function $\hat r_\psi$ 可能是非平稳（non-stationary）的，因为我们在训练期间会更新它。
先前工作使用 on-policy 来接近这个问题，但 sample efficiency 太低了。
pebble 使用了 off-policy RL 框架，在每次更新 reward model 时，使用新的 reward model 对所有 transition 的 reward 进行 relabel。

【算法 2（PEBBLE）】

第 7-19 行在学习 reward model。
第 20-23 行在收集新数据，使用更新后的 reward model，对所有 replay buffer 里的 transitions 进行 relabel。
第 24-27 行在优化 sac loss，是在学 RL。
反复跑 7-19、20-23、24-27 的过程。

5 Experiments

四个问题：

pebble 在 sample efficiency 和 feedback efficiency 方面，与现有方法相比如何？
pebble 中提出的每种技术，贡献分别是什么？
pebble 能否学习到，典型奖励函数难以针对性设计的新行为？
pebble 能否减轻 reward exploitation 的影响？

5.1 Setups

benchmarks： DeepMind Control Suite，Meta-world。
使用一个基于 real reward function 的脚本 teacher，提供对 segment pair 的 preference。
- 因为 scripted teacher 跑很快，所以可以多做几次实验，实验结果包含十次运行的平均值和标准差。
human teacher：
- 可以教授新颖（novel）的行为（比如挥舞腿），这些行为在 original benchmarks 中没有定义。
- 发现使用 engineered reward function 的 agent 可能会 reward exploitation，但使用 human feedback 的 agent 不会。
- 对于所有实验，每个 segment 都以 1 秒的视频剪辑呈现给人类，最多需要 1 小时的人类时间。
baselines：
- Christiano 等人（2017）的工作，是同样使用这种 preference + segment pair 的最新工作了，将它们的方法称为 Preference PPO。
- 使用 ground truth reward 的 SAC、PPO，它们作为 upper bound。
- 对于 section 4.2 想选取 informative queries 时，提到的三种 sampling method，可以得到三个 reward model，我们将它们进行 ensemble（？）

5.2 Benchmark Tasks with Unobserved Rewards

图 3 - Locomotion tasks from DMControl（DeepMind Control）：
- （在图例中标出）我们给 Preference PPO 提供更多 feedback。pebble 只需要更少的 feedbacks，就能匹配其他 baselines 的性能。
- 绿色、黄色、棕色是 pebble，性能非常好的样子。
图 4 - Robotic manipulation tasks from Meta-world：
- （发现那种漂亮的带阴影的线，其实是多个 runs 的平均）
- 同样，好像性能非常好，使用 5000 / 10000 feedback 的 pebble 接近几个 upper bound oracle。

5.3 Ablation Study

图 5-a 考察了 relabeling 和 pre-training 的效果。
图 5-b 比较了 section 4.2 提到的 sampling schemes。
图 5-c 考察了 segment 长度对性能的影响，segment length = 1 是所谓 step-wise feedback。发现 50 比 1 性能更好，推测是因为长 segment 可以提供更多背景信息。

5.4 Human Experiments

图 6 - novel behaviors：演示了 ① Cart 代理挥动杆子（使用 50 个 queries），② 四足动物代理挥舞前腿（使用 200 个 queries），③ Hopper 执行后空翻（使用 50 个 queries）。supplementary material 提供了一些视频和 queries 实例。
图 7 - reward exploitation：发现 Walker agent 只用一条腿学习走路，hack 了 reward function。然而，通过使用 200 个 human queries，我们能够训练 Walker 以更自然、更像人类的方式（使用双腿）行走。

6 Discussion

提出了 pebble，一种用于 HiL RL（Human-in-the-Loop RL）的 feedback-efficient 的算法。
性能：
- 通过利用 ① unsupervised pre-training ② off-policy RL，可以显著提高 HiL RL 的 sample-efficiency 和 feedback-efficiency。
- 并且，pebble 可以用于比先前工作更复杂的任务，包括各种运动（locomotion）和 robotic manipulation 。
- 此外，通过实验证明了，pebble 可以学习 novel 的行为，并避免 reward exploitation；与经过工程奖励函数训练的 agent 相比，pebble 可以产生更理想的行为。
我们相信，pebble 通过使 PBRL 更可行，有助于扩大 RL 的影响；RL 的影响力将会从专家精心设计 reward function 的局限，扩展到外行可以通过简单的 preference 比较，来促进 RL 领域的进步。

Appendix

A. State Entropy Estimator
B. Experimental Details
C. Effects of Sampling Schemes
D. Examples of Selected Queries