RL 基础 | Value Iteration 的收敛性证明

（其实是专业课作业感觉算法岗面试可能会问，来存一下档）

问题：证明 Value Iteration 收敛性
0 Definitions - 定义
1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射
2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数
Reference - 参考资料

问题：证明 Value Iteration 收敛性

Please prove the convergence of Value Iteration algorithm in discounted reward setting, and analyze the performance between resulting policy and optimal policy. 证明 Value Iteration 的收敛性。

背景：

Policy Iteration：先得到一个 policy，然后算出它的 value function，基于该 value function 取 max_a Q(s,a) 作为新 policy，再算新 policy 的 value function…
Value Iteration：不停对 value function 使用贝尔曼算子，新 V = max_a [r(s,a) + γV(s')]。

0 Definitions - 定义

First, we define

Discrete state space $S=\{s_1,\cdots,s_n\}$, discrete action space $A=\{a_1,\cdots,a_n\}$.
Value function: $V(s)=E_{a\sim\pi(a|s)}\big[r(s,a) + \gamma E_{s'\sim p(s'|s,a)}V(s')\big]$ .
Bellman operator: for all $s\in S$ ,

\[\begin{aligned}
B_\pi V(s) &\triangleq E_{a\sim\pi(a|s)}\big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V(s') \big]
\\
B_* V(s) &\triangleq \max_a \big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V(s') \big]
\end{aligned}
\]

The Value Iteration algorithm is to conduct Bellman operator $B_*$ on the value function $V(s)$. We want to prove that the sequence will converge to the optimal value function $V^*$, which should be the of the Bellman equation $B_*V=V$ .

我们定义了状态空间、动作空间、值函数（value function）和贝尔曼算子（Bellman operator）。VI 算法，就是对 value function 一直使用 Bellman operator，希望最后能收敛到最优的 value function。这样，该最优 value function 必然是 Bellman equation $B_*V=V$ 的不动点。

1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射

Then, we need to prove that Bellman Operator is a Contraction Mapping. The definition of Contraction Mapping is as follows:

\[d(f(x_1),f(x_2))\le\gamma d(x_1,x_2),
\]

where $d$ is a measurement over a metric space, $\gamma\in(0,1)$, and $f$ is the Contraction Mapping. In our case, the definition of Contraction Mapping is as follows:

\[|f(x_1)-f(x_2)|_\infty \le \gamma|x_1-x_2|_\infty,
\]

where $||_\infty$ is the infinite norm (return the max value over all dimensions of a vector).

接下来，证明贝尔曼算子是压缩映射（Contraction Mapping）。在这里，压缩映射的定义是，两个变量的无穷范数（向量各维之差的最大绝对值）会被 $f$ 映射压缩 $\gamma$ 倍。

Let's prove that Bellman Operator $B_*$ is a Contraction Mapping. The proof is as follows.

\[\begin{aligned}
&|B_*V_1(s)-B_*V_2(s)| \\
&=\bigg|\max_a\big[r(s,a)+\gamma E_{s'~p(s'|s,a)}V_1(s')\big] - \max_a\big[r(s,a)+\gamma E_{s}V_2(s')\big] \bigg| \\
&\le \bigg| \max_a[r(s,a) + \gamma E_{s'}V_1(s') - r(s,a) - \gamma E_{s'}V_2(s')]\bigg| \\
&= \bigg|\gamma\max_a E_{s'}(V_1(s')-V_2(s'))\bigg| \\
&\le \gamma\max_a\bigg|E_{s'}(V_1(s')-V_2(s'))\bigg| \\
&= \gamma\max_a \bigg|\sum_{s'}p(s'|s,a)\big[V_1(s')-V_2(s')\big]\bigg| \\
&\le \gamma\max_a \sum_{s'}p(s'|s,a)\bigg|V_1(s')-V_2(s')\bigg| \\
&= \gamma \sum_{s'}p(s'|s,a_*(s))\bigg|V_1(s')-V_2(s')\bigg| \\
&\le \gamma \sum_{s'}p(s'|s,a_*(s))\max_s\bigg|V_1(s)-V_2(s)\bigg| \\
&= \gamma|V_1-V_2|_\infty
\end{aligned}
\]

贝尔曼算子是压缩映射的证明：一连串放缩，如上。

2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数

If Bellman Operator is a Contraction Mapping, then we can use the Banach fixed point theorem to prove that, the optimal value function $V^*$ is the unique fixed point of the Bellman Operator $B_*$, and the sequence $B_*(B_*(\cdots B_*V))$ converges to $V^*$.

如果贝尔曼算子是压缩映射，则可以用巴拿赫不动点定理（Banach fixed point theorem），证明 $V^*$ 是 Bellman Equation $B_*V=V$ 的唯一不动点，并且一定会收敛到该不动点。

The first step is to prove that the fixed point $V^*$ is unique. Use proof by contradiction. Suppose there are two converged value $V_1^*,V_2^*$, then we have $|B_*V_1^*-B_*V_2^*|_\infty=|V_1^*-V_2^*|_\infty$. However, Bellman operator $B_*$ is a Contraction Mapping, so there is $|B_*V_1^*-B_*V_2^*|_\infty\le\gamma|V_1^*-V_2^*|_\infty$, which contradicts the previous one. Thus, there can be only one converged value, and it is obviously the fixed point $V^*$.

第一步，证明收敛到的不动点是唯一的。使用反证法，假设有两个不动点，那么它们俩的压缩映射都映射到原地（不动点嘛），距离没有 γ 倍压缩，矛盾，从而得证。

The second step is to prove that the fixed point $V^*$ really exists. If we can prove that $\{V,B_*V,B_*^2V,\cdots\}$ is a Cauchy sequence, the fixed point $V^*$ exists. Take two value $B_*^a,B_*^b$ from the sequence, where $a\gg b$. Use the triangle inequality of $||_\infty$ as a measurement, we can get

\[\begin{aligned}
|B_*^a-B_*^b|_\infty &\le |B_*^a-B_*^{a-1}|_\infty + |B_*^{a-1}-B_*^b|_\infty
\\
&\le |B_*^a-B_*^{a-1}|_\infty + |B_*^{a-1}-B_*^{a-2}|_\infty + |B_*^{a-2}-B_*^b|_\infty
\\
&\le |B_*^a-B_*^{a-1}|_\infty + \cdots + |B_*^{b+1}-B_*^b|_\infty
\\
&\le \gamma^b|B_*V-V|\sum_{i=0}^{a-b-1}\gamma^i
\le \frac{\gamma^b}{1-\gamma}|B_*V-V|
\end{aligned}.
\]

So, if we can choose a large enough $b$, the value of $|B_*^a-B_*^b|_\infty$ can be less than any $\epsilon\gt0$. Thus, there must exists an $N\in\mathbb N$, such that $|B_*^a-B_*^b|_\infty\lt\epsilon$, for all $a,b\gt N$ . The sequence is a Cauchy sequence, and the fixed point $V^*$ exists.

第二步，证明不动点 $V^*$ 存在。去证 $\{V,B_*V,B_*^2V,\cdots\}$ 是柯西序列（Cauchy sequence），如果是柯西序列，那么就存在收敛的不动点。柯西序列的定义是，对于序列里的 $x_a,x_b$，存在正整数 N 使得对于任意 $a,b\gt N$，度量 $d(x_a,x_b)\lt\epsilon$，其中 $\epsilon$ 是任意小的正实数。先使用三角不等式，将 $x_a-x_b$ 的形式拆成 $(x_a-x_{a-1})+(x_{a-1}-x_{a-2})+\cdots+(x_{b+1}-x_b)$，然后使用压缩映射的性质，放缩成 γ 的等比序列求和，即可得到，它小于任意的小实数。

Reference - 参考资料

A blog from Zhihu: https://zhuanlan.zhihu.com/p/419208786
非常感谢！很清晰的博客