off-policy RL | Advantage-Weighted Regression (AWR)：组合先前策略得到新 base policy

论文题目：Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning，ICLR 2020 2021 连续 reject…
pdf 版本：https://arxiv.org/pdf/1910.00177.pdf
html 版本：https://ar5iv.labs.arxiv.org/html/1910.00177
open review：https://openreview.net/forum?id=ToWi1RjuEr8

省流

一种结合 off-policy 数据来训 RL 的方法。
具体可以看这篇博客，比本博客写得好（惭愧）

0 abstract

In this work, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can accommodate continuous and discrete actions, and can be implemented in just a few lines of code on top of standard supervised learning methods. We provide a theoretical motivation for AWR and analyze its properties when incorporating off-policy data from experience replay. We evaluate AWR on a suite of standard OpenAI Gym benchmark tasks, and show that it achieves competitive performance compared to a number of well-established state-of-the-art RL algorithms. AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. Furthermore, we demonstrate our algorithm on challenging continuous control tasks with highly complex simulated characters.

method：
- 开发一种简单、可扩展的 RL 算法，该算法使用标准的监督学习方法作为子程序，同时还能利用 off-policy data。
- 我们的算法优势加权回归（advantage-weighted regression，AWR），由两个标准的监督学习步骤组成：① 对于 value function，回归到 target values，② 对于 policy，回归到 weighted target actions。
- 该方法简单通用，适用于连续和离散的 action，并且可以在标准监督学习方法的基础上，通过几行代码实现。
- 我们为 AWR 提供了理论动机，并在纳入来自 experience replay 的 off-policy data 时，分析了其特性。
实验：
- 在一套标准的 OpenAI Gym 基准测试任务上评估了 AWR，并表明，与许多成熟的最先进的 RL 算法相比，它实现了有竞争力的性能。
- 当从纯静态（static）数据集中学习时，AWR 能够获得比大多数 off-policy 算法更有效的策略，而无需额外的环境交互。
- 此外，我们还演示了我们的算法，用于挑战性的连续控制任务，其中具有高度复杂的模拟角色。（盲猜是 kitchen 之类（？））

open review

（作者似乎放弃了 rebuttal）
缺点：
- novelty： proposed method 似乎是对现有 off-policy solver 的微小修改。
- 实验：结果不足以支持该论文的说法， AWR 在几项任务中似乎明显比 SAC 差。
- 理论：理论分析似乎与算法不匹配。

3 method： AWR

看不懂算法了，整点知乎博客：
- https://zhuanlan.zhihu.com/p/500001362
- （感觉这个博客讲的很好，别看笔者的博客了，看这篇博客去吧（bless））
好像 3.1 节是基本算法，3.2 节是 off-policy。
off-policy：
- 最新策略 \(\pi_k\) 收集到的数据存储到缓冲区 D 中。对 V 函数进行拟合和策略改进时，采样策略为之前的策略或者其他不同策略共同组成的一个复合策略。
- 对每个策略 \(\pi_i\) ，使用权重 \(w_i\) ，定义一个复合策略 \(\mu\) 。
- 直接得到，策略 \(\pi\) 比 \(\mu\) 好的程度，\(\eta(\pi)=J(\pi)-J(\mu)=J(\pi)-\sum_{i=1}^k w_iJ(\pi_i)\) \(=\sum_{i=1}^kw_i(J(\pi)-J(\mu))=\sum_{i=1}^kw_i\big(E_{s\sim d_\pi(s),a\sim\pi(a|s)}[A^{\pi_i}(s,a)]\big)\) 。
- 然后就 arg max \(\pi\) ，β 乘一个 KL 散度，似乎是希望 \(\pi~\mu\) 尽可能接近；还有一个 α 的 Lagrange 乘子，不知道是什么。
理论：见公式 7 - 10。（反正笔者是一点理论也看不懂（哭））