广义随机森林

了解causal forest之前，需要先了解其forest实现的载体：GENERALIZED RANDOM FORESTS[6]（GRF)

其是随机森林的一种推广，经典的随机森林只能去估计label Y，不能用于估计复杂的目标，比如causal effect，Causal Tree、Cauasl Forest的同一个作者对其进行了改良。先定义一下矩估计参数表达式：

\[\begin{equation} \tag{1}

\mathbb E[\psi_{\theta(x), \upsilon(x)}(O_i)|X=x]=0

\end{equation}

其中，$\psi$ 是score function，也就是measure metric，$\theta$ 是我们不得不去计算的参数，比如tree里面的各项参数如特征threshold，叶子节点估计值..etc, $\upsilon$

则是一个可选参数。$O$ 表示和计算相关的值，比如监督信号。像response类的模型，$O_i={Y_i}$, 像causal 模型，$O_i={Y_i, W_i}$ $W$ 表示某种treatment。

该式在实际优化参数的时候，等价于最小化：

\[\tag{2} \left(\hat \theta(x), \upsilon(x)\right)\in argmin_{\theta, \upsilon}\left|\left|\sum\alpha_i(x)\psi_{\theta, \upsilon(O_i)}\right|\right|_2
\]

其中，$\alpha$ 是一种权重，当然，这里也可以理解为树的权重，假设总共需要学习$B$ 棵树：

\[\alpha_i(x)=\frac{1}{B}\sum_{b=1}^{B}\alpha_{bi}(x)
\]

\[\alpha_{bi(x)}=\frac{1(\{x\in L_b(x)\})}{|L_b(x)|}
\]

其中，$L_b(x)$ 表示叶子节点里的样本。本质上，这个权重表示的是：训练样本和推理或者测试样本的相似度，因为如果某个样本$x_i$落入叶子$L_b$ ,且我们可以认为叶子节点内的样本同质的情况下，那么可以认为这个样本和当前落入的tree有相似性。

当然，按照这个公式，如果$L_b$ 很大，说明进入这个叶子的训练样本很多，意味着没划分完全，异质性低，则最后分配给这棵树的权重就低，反之亦然。

分裂准则框架

对于每棵树，父节点$P$ 通过最优化下式进行分裂：

\[\tag{3}\left(\hat{\theta}_P, \hat{\nu}_P\right)(\mathcal{J}) \in \operatorname{argmin}_{\theta, \nu}\left\{\left\|\sum_{\left\{i \in \mathcal{J}: X_i \in P\right\}} \psi_{\theta, \nu}\left(O_i\right)\right\|_2\right\} .
\]

其中，$\mathcal{J}$ 表示train set，分裂后形成的2个子节点标准为：通过最小化估计值与真实值间的误差平方：

\[\tag{4}\operatorname{err}\left(C_1, C_2\right)=\sum_{j=1,2} \mathbb{P}\left[X \in C_j \mid X \in P\right] \mathbb{E}\left[\left(\hat{\theta}_{C_j}(\mathcal{J})-\theta(X)\right)^2 \mid X \in C_j\right]
\]

等价于最大化节点间的异质性：

\[\tag{5}\Delta\left(C_1, C_2\right):=n_{C_1} n_{C_2} / n_P^2\left(\hat{\theta}_{C_1}(\mathcal{J})-\hat{\theta}_{C_2}(\mathcal{J})\right)^2
\]

但是$\theta$ 参数比较难优化，交给梯度下降：

\[\tag{6}\tilde{\theta}_C=\hat{\theta}_P-\frac{1}{\left|\left\{i: X_i \in C\right\}\right|} \sum_{\left\{i: X_i \in C\right\}} \xi^{\top} A_P^{-1} \psi_{\hat{\theta}_P, \hat{\nu}_P}\left(O_i\right)
\]

其中，$\hat \theta_P$ 通过 (2) 式获得, $A_p$ 为score function的梯度

\[\tag{7}A_P=\frac{1}{\left|\left\{i: X_i \in P\right\}\right|} \sum_{\left\{i: X_i \in P\right\}} \nabla \psi_{\hat{\theta}_P, \hat{\nu}_P}\left(O_i\right),
\]

梯度计算部分包含2个step：

step1：labeling-step 得到一个pseudo-outcomes

\[\tag{8}\rho_i=-\xi^{\top} A_P^{-1} \psi_{\hat{\theta}_P, \hat{\nu}_P}\left(O_i\right) \in \mathbb{R}$.
\]

step2：回归阶段，用这个pseudo-outcomes 作为信号，传递给split函数, 最终是最大化下式指导节点分割

\[{\Delta}\left(C_1, C_2\right)=\sum_{j=1}^2 \frac{1}{\left|\left\{i: X_i \in C_j\right\}\right|}\left(\sum_{\left\{i: X_i \in C_j\right\}} \rho_i\right)^2
\]

以下是GRF的几种Applications：

Causal Forest

以Casual-Tree为base，不做任何估计量的改变

与单棵 tree 净化到 ensemble 一样，causal forest[7] 沿用了经典bagging系的随机森林，将一颗causal tree 拓展到多棵：

\[\hat \tau=\frac{1}{B}\sum_{b=1}^{B} \hat \tau_b(x)
\]

其中，每科子树$\hat \tau$ 为一颗Casual Tree。使用随机森林作为拓展的好处之一是不需要对causal tree做任何的变换，这一点比boosing系的GBM显然成本也更低。

不过这个随机森林使用的是广义随机森林 , 经典的随机森林只能去估计label Y，不能用于估计复杂的目标，比如causal effect，Causal Tree、Cauasl Forest的同一个作者对其进行了改良，放在后面再讲。

在实现上，不考虑GRF，单机可以直接套用sklearn的forest子类，重写fit方法即可。分布式可以直接套用spark ml的forest。

self._estimator = CausalTreeRegressor(

			    control_name=control_name,

			    criterion=criterion,

			    groups_cnt=groups_cnt)

trees = [self._make_estimator(append=False, random_state=random_state)

                for i in range(n_more_estimators)]

trees = Parallel(

                n_jobs=self.n_jobs,

                verbose=self.verbose,

                **_joblib_parallel_args,

            )(

                delayed(_parallel_build_trees)(

                    t,

                    self,

                    X,

                    y,

                    sample_weight,

                    i,

                    len(trees),

                    verbose=self.verbose,

                    class_weight=self.class_weight,

                    n_samples_bootstrap=n_samples_bootstrap,

                )

                for i, t in enumerate(trees)

            )

            self.estimators_.extend(trees)

CAPE: 适用连续treatment 的 causal effect预估

Conditional Average Partial Effects(CAPE)

GRF给定了一种框架：输入任意的score-function，能够指导最大化异质节点的方向持续分裂子树，和response类的模型一样，同样我们需要一些估计值(比如gini index、entropy)来计算分裂前后的score-function变化，计算估计值需要估计量，定义连续treatment的估计量为：

\[\theta(x)=\xi^{\top} \operatorname{Var}\left[W_i \mid X_i=x\right]^{-1} \operatorname{Cov}\left[W_i, Y_i \mid X_i=x\right]
\]

估计量参与指导分裂计算，但最终，叶子节点存储的依然是outcome的期望。

此处的motivation来源于工具变量和线性回归：

\[y=f(x)=wx+b
\]

此处我们假设$x$是treatment，y是outcome， $w$ 作为一个参数简单的描述了施加treatment对结果的直接影响，要寻找到参数我们需要一个指标衡量参数好坏, 也就是loss, 和casual tree一样，通常使用mse：

\[L(w, b) = \frac{1}{2}\sum(f(x)-y)^2
\]

为了最快的找到这个w，当然是往函数梯度的方向, 我们对loss求偏导并令其为0：

\[\tag{1}\frac{\partial L}{\partial w}=\sum(f(x)-y)x=\sum(wx+b-y)x
\]

\[ \tag{2}

\begin{aligned}

\frac{\partial L}{\partial b} & = \sum(f(x)-y)=\sum(wx+b-y) \\

& \Rightarrow \sum b= \sum y-\sum wx \\

& \Rightarrow b = E(y)-wE(x) = \bar y - w\bar x

\end{aligned}

(2) 代入 (1) 式可得：

\[
\begin{aligned}

\frac{\partial L}{\partial w} & \Rightarrow \sum(wx+\bar y-w\bar x-y)x =0 \\

&\Rightarrow w=\frac{\sum xy-\bar y\sum x}{\sum x^2-\bar x\sum x} \\

&\Rightarrow w=\frac{\sum(x-\bar x)(y-\bar y)}{\sum(x-\bar x)^2}\\

&\Rightarrow w=\frac{Cov(x,y)}{Var(x)}

\end{aligned}

可简化得参数w是关于treatment和outcome的协方差/方差。至于$\xi$ , 似乎影响不大。

refs

https://hwcoder.top/Uplift-1
工具: scikit-uplift
Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning
Athey, Susan, and Guido Imbens. "Recursive partitioning for heterogeneous causal effects." Proceedings of the National Academy of Sciences 113.27 (2016): 7353-7360.
https://zhuanlan.zhihu.com/p/115223013
Athey, Susan, Julie Tibshirani, and Stefan Wager. "Generalized random forests." (2019): 1148-1178.
Wager, Stefan, and Susan Athey. "Estimation and inference of heterogeneous treatment effects using random forests." Journal of the American Statistical Association 113.523 (2018): 1228-1242.
Rzepakowski, P., & Jaroszewicz, S. (2012). Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, 32, 303-327.
annik Rößler, Richard Guse, and Detlef Schoder. The best of two worlds: using recent advances from uplift modeling and heterogeneous treatment effects to optimize targeting policies. International Conference on Information Systems, 2022.

Causal Inference理论学习篇-Tree Based-Causal Forest的更多相关文章

Targeted Learning R Packages for Causal Inference and Machine Learning（转）
Targeted learning methods build machine-learning-based estimators of parameters defined as features ...
因果推理综述——《A Survey on Causal Inference》一文的总结和梳理
因果推理本文档是对<A Survey on Causal Inference>一文的总结和梳理. 论文地址简介关联与因果先有的鸡,还是先有的蛋?这里研究的是因果关系,因果关系与普通 ...
【统计】Causal Inference
[统计]Causal Inference 原文传送门 http://www.stat.cmu.edu/~larry/=sml/Causation.pdf 过程一.Prediction 和 causa ...
Causal Inference
目录 Standardization 非参数情况 Censoring 参数模型 Time-varying 静态 IP weighting 无参数 Censoring 参数模型 censoring 条件 ...
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON ...
算法---FaceNet理论学习篇
FaceNet算法-理论学习篇 @WP20190228 ＝＝＝＝＝＝＝＝＝＝＝＝＝＝目录＝＝＝＝＝＝＝＝＝＝＝＝一.LFW数据集简介二.FaceNet算法简介 FaceNet算法=MTCNN模型 ...
Decision Tree、Random Forest、AdaBoost、GBDT
原文地址:https://www.jianshu.com/p/d8ceeee66a6f Decision Tree 基本思想在于每次分裂节点时选取一个特征使得划分后得到的数据集尽可能纯. 划分标准信 ...
Chapter 2 Randomized Experiments
目录概 2.1 Randomization 2.2 Conditional randomization 2.3 Standardization 2.4 Inverse probability wei ...
Chapter 6 Graphical Representation of Causal Effects
目录 6.1 Causal diagrams 6.2 Causal diagrams and marginal independence 6.3 Causal diagrams and conditi ...
Chapter 1 A Definition of Causal Effect
目录 1.1 Individual casual effects 1.2 Average casual effects 1.5 Causation versus association Hern\(\ ...

随机推荐

navigator.aardio 代码备份
import win.ui; import console; import process; import string.regex; import libs.router; /*DSG{{*/ ma ...
buntu中查看网卡信息的基础知识
ubuntunetworkexpress工具网络access Ubuntu 中,通常有线网卡为eth0,无线网卡则为wlan0,后续增加的以此类推(可能某些无线网卡型号命名为eth1,而非wlan0) ...
[C++]使用auto遍历判断是否是最后一个元素
一.背景略二.代码 for(auto& it:vec){ if(&it==&vec.back()){ cout<<"is the last eleme ...
在 Windows 上利用Qwen大模型搭建一个 ChatGPT 式的问答小助手
本文首发于公众号:Hunter后端原文链接:在 Windows 上利用Qwen大模型搭建一个 ChatGPT 式的问答小助手最近 ChatGPT 式的聊天机器人比较火,可以提供各种问答功能,阿里最 ...
自定义Key类型的字典无法序列化的N种解决方案
当我们使用System.Text.Json.JsonSerializer对一个字典对象进行序列化的时候,默认情况下字典的Key不能是一个自定义的类型,本文介绍几种解决方案. 一.问题重现二.自定义J ...
深入浅出Java多线程(十三)：阻塞队列
引言大家好,我是你们的老伙计秀才!今天带来的是[深入浅出Java多线程]系列的第十三篇内容:阻塞队列.大家觉得有用请点赞,喜欢请关注!秀才在此谢过大家了!!! 在多线程编程的世界里,生产者-消费者问 ...
记录--通过手写，分析async await核心原理
这里给大家分享我在网上总结出来的一些知识,希望对大家有所帮助前言 async await 语法是 ES7出现的,是基于ES6的 promise和generator实现的 generator函数在之 ...
FPGA的PCB设计
FPGA的PCB设计一.FPGA的高速电路板设计 PCB板的设计规模增大,IO传输问题也就出现.为了兼容其他高速模块,必须对PCB的设计进行优化. 1️⃣电源滤波,降低系统噪声2️⃣匹配信号线3️⃣ ...
Non-local Network：人类早期在CV驯服Transformer尝试 | CVPR 2018
Non-local操作是早期self-attention在视觉任务上的尝试,核心在于依照相似度加权其它特征对当前特征进行增强,实现方式十分简洁,为后续的很多相关研究提供了参考来源:晓飞的算法工程 ...
HashMap对key或value进行排序--Java--小白必懂2
HashMap对key进行排序 public static void main (String[]args){ HashMap<String, Integer> map = new Has ...

Causal Inference理论学习篇-Tree Based-Causal Forest