(其实是专业课作业 感觉算法岗面试可能会问,来存一下档)

友链:RL | Value Iteration 的收敛性证明


问题:证明 Policy Iteration 收敛性

Please prove that Policy Iteration algorithm could terminate within finite steps under discrete state, discrete action and discounted reward settings.

Answer:

0 Background - 背景

First of all, let's review what is Policy Iteration. It includes two steps:

  • 1 - Policy Evaluation:

    • For an initial policy \(\pi_1\) and an initial value function \(V_0\), we use Bellman Operator \(B_{\pi_1}V(s)=E_{a\sim\pi(a|s)}[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V(s')]\) to get the accurate value function \(V_1\) for policy \(\pi_1\).
    • In practice, we repeatedly use the Bellman operator \(B_{\pi_1}\) to update the value function the from its initial value \(V_0\), until it reaches \(B_{\pi_1}V(s)=V(s)\) for all s. We denote the \(V\) satisfying the above equation as \(V_1\), so \(V_1\) is the corresponding value function of policy \(\pi_1\).
  • 2 - Policy Improvement:
    • We can get a policy \(\pi_2\) better than the previous policy \(\pi_1\) use its value function \(V_1\). For all state \(s\), \(\pi_2(s)=\arg\max_a[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V_1(s')]\).
  • Iteration:
    • After getting \(\pi_2\), we calculate its value function \(V_2\) using Bellman Operator \(B_{\pi_2}\); Then, we can get a better policy \(\pi_3\) using value function \(V_2\) ... Finally, the iterated policy \(\pi_{k+1}\) will be the same as its previous policy \(\pi_k\), and at that time, we get the optimal policy \(\pi_k\).

在最开始,先回顾一下 Policy Iteration(策略迭代)的定义:

  • 它包含两部分:1. Policy Evaluation(策略评估),2. Policy Improvement(策略改进)。
  • 第一步是去求解给定策略的 value function,第二步是基于该 value function 做 a = argmax [r+γV(s')],得到一个更好的新策略。
  • 这样不断迭代,不断得到更好的新策略;如果某次迭代,新策略 = 上一次的策略,那么策略收敛到了最优策略。

In the following sections, we will demonstrate two keypoints for the convergence guarantee of Policy Iteration:

  • First, the value function will converge to the value function of the given policy during the Policy Evaluation.
  • Second, we can use finite Policy Iteration steps to get the optimal policy.

接下来,我们会证明两件事情:1. 策略评估环节的值函数真的能收敛,2. 策略改进环节能通过有限次迭代得到最优策略。

1 Policy Evaluation converges to the value function of the given policy - 策略评估的值函数会收敛到给定策略的值函数

What we want to prove is that, through the Policy Evaluation ,we can always get the corresponding value function \(V_i\) for a specific policy \(\pi_i\). To prove this, we have to point out that, the Bellman Operator \(B_\pi\) is a Contraction Mapping. The proof is as follows (similar to homework 4):

\[\begin{aligned}
& |B_\pi V_1(s)-B_\pi V_2(s)| \\
&= \bigg|E_{a\sim\pi(a|s)}\big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V_1(s')]
-r(s,a)-\gamma E_{s'\sim p(s'|s,a)}[V_2(s')] \big]\bigg| \\
&= \gamma\bigg|E_{a\sim\pi(a|s)} E_{s'\sim p(s'|s,a)}[V_1(s')-V_2(s')]\bigg| \\
&\le \gamma\bigg|E_{a\sim\pi(a|s),s'\sim p(s'|s,a)}\max_s [V_1(s)-V_2(s)]\bigg| \\
&= \gamma|V_1-V_2|_\infty
\end{aligned}
\tag1
\]

If the Bellman Operator \(B_\pi\) is a Contraction Mapping, we can use the Banach fixed point theorem to obtain the convergence guarantee. It is the same as the proof in homework 4, so won't go into detail here.

我们希望证明,策略评估真的能得到特定策略的 value function。需要证明 Bellman Operator \(B_\pi\) 是压缩映射(Contraction Mapping)(通过一通放缩就能得到了),然后使用巴纳赫不动点定理,即可得到 Policy Evaluation 的收敛保证(见 上次作业)。

2 Policy Improvement will converge to the optimal policy - 策略改进的策略会收敛到最优策略

Then, we need to prove the effectiveness of Policy Improvement. The proof includes two parts: 1. Policy Improvement will get a better policy than the previous one; 2. It will finally converge to the optimal policy after finite steps.

证明完 Policy Evaluation 能得到 value function 后,我们要去证明 Policy Improvement 的有效性。证明分为两步:1. 证明每次 Policy Improvement 都能得到更好的策略;2. 证明有限步 Policy Improvement 后就能收敛到最优策略。

2.1 Policy Improvement will push the policy better - 策略改进总能让策略性能提升

To show this, we need to unfold the iteration process that repeatedly conducts the Bellman Operator \(B_{\pi_{i+1}}\) in the value function \(V_{i+1}\) using the previous value function \(V_i\).

想要证明 Policy Improvement 总能通过 a = argmax Q(s,a) 得到更好的策略(或至少不会变的更差),我们想证明的是,新策略 \(\pi_{i+1}\) 的 value function \(V_{i+1}\) 大于等于旧策略 \(\pi_i\) 的 value function \(V_i\)。为此,我们需要把 Policy Evaluation 求解 \(V_{i+1}\) 的过程中,反复使用 Bellman Operator \(B_{\pi_{i+1}}\) 的过程进行展开。

Consider the following policy: we use the improved policy \(\pi_{i+1}\) for the current step, and then use policy \(\pi_i\) for the remaining episode, then the value function will be:

\[V_{i+1}^{(1)}(s)=\arg\max_a[r(s,a)+E_{s'\sim p(s'|s,a)}\gamma V(s')]\ge V_i(s).
\tag2
\]

Then, at the state \(s'\), we continue to use the improved policy \(\pi_{i+1}\), changing value function into

\[V_{i+1}^{(2)}(s)=\arg\max_a[r(s,a)+E_{s';a'\sim\pi_{i+1}(s')}[r(s',a')+E_{s''}V_i(s'')]]\ge V_{i+1}^{(1)}\ge V_i(s).
\tag3
\]

Now, we can infer that, if we continue to use the improved policy \(\pi_{i+1}\) till the episode ends, we can get the value function \(V_{i+1}\) which satisfies the following inequality:

\[V_{i+1}(s)=V_{i+1}^{(m)}(s)\ge V_{i+1}^{(m-1)}(s)\ge\cdots\ge V_{i+1}^{(1)}(s)\ge V_{i}(s).
\tag4
\]

Thus, we obtain the conclusion that, the performance of the improved policy is better or no worse than the previous one.

  • 先考虑这样一种策略:在当前决策,我们使用新策略 \(\pi_{i+1}\) 得到新 state \(s'\),然后继续使用旧策略 \(\pi_i\)。这样得到的 value function \(V_{i+1}^{(1)}(s)\) 如上文的公式 (2) 所示。
  • 然后,我们使用两步新策略 \(\pi_{i+1}\) ,也就是在得到新 state \(s'\) 后继续使用 \(\pi_{i+1}\) ,这样得到的 value function \(V_{i+1}^{(2)}(s)\) 如上文的公式 (3) 所示,有 \(V_{i+1}^{(2)}(s)\ge V_{i+1}^{(2)}(s)\)。
  • 这样,一直使用新策略 \(\pi_{i+1}\),无穷无尽地继续下去,就能得到新策略 \(\pi_{i+1}\) 的 value function \(V_{i+1}(s)\) !可以得到公式 (4) 的不等式,即,新 value function 一定大于等于旧 value function,得证。

2.2 Policy Improvement will converge to the optimal policy - 策略改进会收敛到最优策略

It is actually very simple: If the state space and action space are all discrete and finite, then we have a finite number of policies. If the policy cannot converge to the optimal policy, then there is only one possibility called "policy oscillation", which means that, when we get policy \(\pi_{a}\) and its value function \(V_a\), the improved policy based on \(V_a\) is \(\pi_b\); we get policy \(\pi_{b}\) and its value function \(V_b\), the improved policy based on \(V_b\) turns back to \(\pi_a\), (or more oscillated policies \(\pi_a,\pi_b,\pi_c,\cdots\)).

证明 Policy Improvement 会收敛到最优策略,其实非常简单:因为 state space 和 action space 都是离散、有限的,因此策略的数量也有限,一直迭代,总会收敛到最优策略。除非碰到了“策略震荡”(policy oscillation)的情况:策略一直在比如说 \(\pi_a\) 和 \(\pi_b\) 间震荡(当然也可能在更多策略间震荡),\(\pi_a\) Policy Improvement 得到 \(\pi_b\),\(\pi_b\) Policy Improvement 得到 \(\pi_a\),如此循环往复。

However, it contradicts with the guarantee that Policy Improvement can always get a better (or no worse) policy. If the improved policy is better, we will get \(V_a\lt V_b\lt V_a\), which is obviously wrong. If the improved policy is as good as the previous one, then the oscillated policies are all optimal policies, so we have obtained the optimal policy.

然而,“策略震荡”现象与 2.1 节所说,Policy Improvement 一定能得到更好的策略(或至少不更差的策略)相矛盾。如果能得到更好的策略,那么 \(\pi_a,\pi_b\) 的 value function 满足 \(V_a\lt V_b\lt V_a\),显然是不对的。如果这些策略一样好,那么它们已经是最优策略了,我们就得到了最优策略。

Reference - 参考资料

RL 基础 | Policy Iteration 的收敛性证明的更多相关文章

  1. K-Means算法的收敛性和如何快速收敛超大的KMeans?

    不多说,直接上干货! 面试很容易被问的:K-Means算法的收敛性. 在网上查阅了很多资料,并没有看到很清晰的解释,所以希望可以从K-Means与EM算法的关系,以及EM算法本身的收敛性证明中找到蛛丝 ...

  2. 再论EM算法的收敛性和K-Means的收敛性

    标签(空格分隔): 机器学习 (最近被一波波的笔试+面试淹没了,但是在有两次面试时被问到了同一个问题:K-Means算法的收敛性.在网上查阅了很多资料,并没有看到很清晰的解释,所以希望可以从K-Mea ...

  3. [ML从入门到入门] 支持向量机:从SVM的推导过程到SMO的收敛性讨论

    前言 支持向量机(Support Vector Machine,SVM)在70年代由苏联人 Vladimir Vapnik 提出,主要用于处理二分类问题,也就是研究如何区分两类事物. 本文主要介绍支持 ...

  4. 【题解】CF24D Broken Robots(收敛性)

    [题解]CF24D Broken Robots http://codeforces.com/problemset/problem/24/D 解1(不会写,口胡的) 获得一个比较显然的转移式子 \(dp ...

  5. 51nod1674:区间的价值2(分治,利用&和|的收敛性)

    lyk拥有一个区间. 它规定一个区间的价值为这个区间中所有数and起来的值与这个区间所有数or起来的值的乘积. 例如3个数2,3,6.它们and起来的值为2,or起来的值为7,这个区间对答案的贡献为2 ...

  6. Policy Improvement and Policy Iteration

    From the last post, we know how to evaluate a policy. But that's not enough, because the purpose of ...

  7. 2020-BUAA OO-面向对象设计与构造-HW11中对ageVar采用缓存优化的等价性证明(包括溢出情况)

    HW11中对ageVar采用缓存优化的等价性证明(包括溢出情况) 概要 我们知道,第三次作业里age上限变为2000,而如果缓存年龄的平方和,2000*2000*800 > 2147483647 ...

  8. CSS+DIV入门第一天基础视频 CSS选择器层叠性和继承性

    大家好,我是小强老师, 现在网上的CSS+DIV视频,要么讲的太深,要么太浅,很多初学的同学们总是遇到困难,今天小强老师专门给大家准备了css课程的视频.带你从零基础学习CSS+DIV一直到能独立完成 ...

  9. java并发基础(六)--- 活跃性、性能与可伸缩性

    <java并发编程实战>的第9章主要介绍GUI编程,在实际开发中实在很少见到,所以这一章的笔记暂时先放一放,从第10章开始到第12章是第三部分,也就是活跃性.性能.与测试,这部分的知识偏理 ...

  10. 【Java并发基础】安全性、活跃性与性能问题

    前言 Java的多线程是一把双刃剑,使用好它可以使我们的程序更高效,但是出现并发问题时,我们的程序将会变得非常糟糕.并发编程中需要注意三方面的问题,分别是安全性.活跃性和性能问题. 安全性问题 我们经 ...

随机推荐

  1. SpringBoot 使用 Sa-Token 实现账号封禁、分类封禁、阶梯封禁

    一.需求分析 之前的章节中,我们学习了 踢人下线 和 强制注销 功能,用于清退违规账号.在部分场景下,我们还需要将其 账号封禁,以防止其再次登录. Sa-Token 是一个轻量级 java 权限认证框 ...

  2. .NET程序的 GDI句柄泄露 的再反思

    一:背景 1. 讲故事 上个月我写过一篇 如何洞察 C# 程序的 GDI 句柄泄露 文章,当时用的是 GDIView + WinDbg 把问题搞定,前者用来定位泄露资源,后者用来定位泄露代码,后面有朋 ...

  3. 《敏捷无敌之DevOps时代》读后感

    背景: 2020年基于我司业务形态,我开始实行敏捷项目管理.以敏捷为道,Scrum为法,迭代为术,禅道作器,大张旗鼓的搞了2年敏捷开发.随着时间推移,问题出现在2022年,当时我们已经完全按照Scru ...

  4. 2021-6-16 TcpIp

    using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.T ...

  5. 图像阈值_有cv2.threshold,cv2.adaptiveThreshold 等。

    1.简单阈值 使用的函数:cv2.threshold (src, thresh, maxval, type) 注释: 与名字一样,这种方法非常简单.但像素值高于阈值时,我们给这个像素赋予一个新值(可能 ...

  6. 【pandas小技巧】--反转行列顺序

    反转pandas DataFrame的行列顺序是一种非常实用的操作.在实际应用中,当我们需要对数据进行排列或者排序时,通常会使用到Pandas的行列反转功能.这个过程可以帮助我们更好地理解数据集,发现 ...

  7. [论文阅读] 颜色迁移-Illuminant Aware Gamut-Based

    [论文阅读] 颜色迁移-Illuminant Aware Gamut-Based 文章: [Illuminant Aware Gamut-Based Color Transfer], [python代 ...

  8. 去中心化组件共享方案 —— Webpack Module Federation(模块联邦)

    在大型应用中, 我们可能会对其进行拆分,分成容器.主应用和多个子应用,使拆分后的应用独立开发与部署,更加容易维护.但无论是微应用.公共模块应用,都需要放到容器中才能使用. 如果多个应用之间希望资源共享 ...

  9. 【go笔记】使用WaitGroup控制协程退出

    前言 正常情况下,主协程一旦退出,其子协程也会全部中止并退出.为了阻塞主协程,可以使用time.Sleep(),也可以使用WaitGroup. 用法说明 // 导入sync import " ...

  10. Combobox后台绑定

    本文主要介绍WPF中Combobox的后台绑定,我在这里主要讲解数据驱动 1.对于前台绑定,我们首先写出想要绑定的对象 新建一个Models文件夹,将Student类写入 public class S ...