Variance Inflation Factor (VIF) 方差膨胀因子解释_附python脚本
python信用评分卡(附代码,博主录制)

https://etav.github.io/python/vif_factor_python.html
Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoeffunction.
Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.
A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:
$$ V.I.F. = 1 / (1 - R^2). $$
The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.
Steps for Implementing VIF
- Run a multiple regression.
- Calculate the VIF factors.
- Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.
#Imports
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor df = pd.read_csv('loan.csv')
df.dropna()
df = df._get_numeric_data() #drop non-numeric cols df.head()
| id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | int_rate | installment | annual_inc | dti | delinq_2yrs | ... | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1077501 | 1296599 | 5000.0 | 5000.0 | 4975.0 | 10.65 | 162.87 | 24000.0 | 27.65 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1077430 | 1314167 | 2500.0 | 2500.0 | 2500.0 | 15.27 | 59.83 | 30000.0 | 1.00 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 1077175 | 1313524 | 2400.0 | 2400.0 | 2400.0 | 15.96 | 84.33 | 12252.0 | 8.72 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 1076863 | 1277178 | 10000.0 | 10000.0 | 10000.0 | 13.49 | 339.31 | 49200.0 | 20.00 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 1075358 | 1311748 | 3000.0 | 3000.0 | 3000.0 | 12.69 | 67.79 | 80000.0 | 17.94 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 51 columns
df = df[['annual_inc','loan_amnt', 'funded_amnt','annual_inc','dti']].dropna() #subset the dataframe
Step 1: Run a multiple regression
%%capture
#gather features
features = "+".join(df.columns - ["annual_inc"]) # get y and X dataframes based on this regression:
y, X = dmatrices('annual_inc ~' + features, df, return_type='dataframe')
Step 2: Calculate VIF Factors
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
Step 3: Inspect VIF Factors
vif.round(1)
| VIF Factor | features | |
|---|---|---|
| 0 | 5.1 | Intercept |
| 1 | 1.0 | dti |
| 2 | 678.4 | funded_amnt |
| 3 | 678.4 | loan_amnt |
As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.
https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149


Variance Inflation Factor (VIF) 方差膨胀因子解释_附python脚本的更多相关文章
- 可决系数R^2和方差膨胀因子VIF
然而很多时候,被筛选的特征在模型上线的预测效果并不理想,究其原因可能是由于特征筛选的偏差. 但还有一个显著的因素,就是选取特征之间之间可能存在高度的多重共线性,导致模型对测试集预测能力不佳. 为了在筛 ...
- GWAS: 曼哈顿图,QQ plot 图,膨胀系数( manhattan、Genomic Inflation Factor)
画曼哈顿图和QQ plot 首推R包“qqman”,简约方便.下面具体介绍以下. 一.画曼哈顿图 install.packages("qqman") library(qqman) ...
- Java 序列化Serializable具体解释(附具体样例)
Java 序列化Serializable具体解释(附具体样例) 1.什么是序列化和反序列化 Serialization(序列化)是一种将对象以一连串的字节描写叙述的过程:反序列化deserializa ...
- Cocos2d-x手机游戏开发与项目实践具体解释_随书代码
Cocos2d-x手机游戏开发与项目实战具体解释_随书代码 作者:沈大海 因为原作者共享的资源为UTF-8字符编码.下载后解压在win下显示乱码或还出现文件不全问题,现完整整理,解决全部乱码问题,供 ...
- 杭电 2136 Largest prime factor(最大素数因子的位置)
Largest prime factor Time Limit: 5000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Oth ...
- 斯坦福大学公开课机器学习: advice for applying machine learning | regularization and bais/variance(机器学习中方差和偏差如何相互影响、以及和算法的正则化之间的相互关系)
算法正则化可以有效地防止过拟合, 但正则化跟算法的偏差和方差又有什么关系呢?下面主要讨论一下方差和偏差两者之间是如何相互影响的.以及和算法的正则化之间的相互关系 假如我们要对高阶的多项式进行拟合,为了 ...
- glibc中malloc的详细解释_转
glibc中的malloc实现: The main properties of the algorithms are:* For large (>= 512 bytes) requests, i ...
- cmd /c和cmd /k 解释,附★CMD命令★ 大全
cmd /c和cmd /k http://leaning.javaeye.com/blog/380810 java的Runtime.getRuntime().exec(commandStr)可以调用执 ...
- c语言_文件操作_FILE结构体解释_涉及对操作系统文件FCB操作的解释_
1. 文件和流的关系 C将每个文件简单地作为顺序字节流(如下图).每个文件用文件结束符结束,或者在特定字节数的地方结束,这个特定的字节数可以存储在系统维护的管理数据结构中.当打开文件时,就建立了和文件 ...
随机推荐
- 使用Junit测试框架学习Java
前言 在日常的开发中,离不开单元测试,而且在学习Java时,特别是在测试不同API使用时要不停的写main方法,显得很繁琐,所以这里介绍使用Junit学习Java的方法.此外,我使用log4j将结果输 ...
- 0024SpringMVC中几个常见注解的实验
对SpringMVC中的以下几个常用注解进行简单的实验测试: 1.@RequestParam 2.@PathVariable 3.@RequestBody 4.@RequestHeader 5.@Co ...
- LG5487 【模板】线性递推+BM算法
[模板]线性递推+BM算法 给出一个数列 \(P\) 从 \(0\) 开始的前 \(n\) 项,求序列 \(P\) 在\(\bmod~998244353\) 下的最短线性递推式,并在 \(\bmod~ ...
- Union-Find(并查集): Quick union算法
Quick union算法 Quick union: Java implementation Quick union 性能分析 在最坏的情况下,quick-union的find root操作cost( ...
- Zookeeper选举(fastleaderelection算法)
1.选举相关概念: 选票:(myid,zxid,当前节点选取轮次,被推举服务器选举轮次,状态(looking)). 选举发生情况:启动时选举,运行时选举. 外部投票:其他服务器发送来的投票. 内部投票 ...
- MySQL 是怎么保证数据一致性的(转载)
在<写数据库同时发mq消息事务一致性的一种解决方案>一文的方案中把分布式事务巧妙转成了数据库事务.我们都知道关系型数据库事务能保证数据一致性,那数据库到底是怎么设计事务这一特性的呢? 一. ...
- HiveQL 查询
一.select ...... from 语句 1.使用正则表达式来指定列 1)从表stocks中选择symbol列和列名以price作为前缀的列 select symbol,`price.*` f ...
- 019_Python3 输入和输出
在前面几个章节中,我们其实已经接触了 Python 的输入输出的功能.本章节我们将具体介绍 Python 的输入输出. ************************************ 1 ...
- webuploader+php如何实现分片+断点续传
这里只写后端的代码,基本的思想就是,前端将文件分片,然后每次访问上传接口的时候,向后端传入参数:当前为第几块文件,和分片总数 下面直接贴代码吧,一些难懂的我大部分都加上注释了: 上传文件实体类: 看得 ...
- .net文件夹上传下载组件
ASP.NET上传文件用FileUpLoad就可以,但是对文件夹的操作却不能用FileUpLoad来实现. 下面这个示例便是使用ASP.NET来实现上传文件夹并对文件夹进行压缩以及解压. ASP.NE ...