python信用评分卡(附代码,博主录制)

https://etav.github.io/python/vif_factor_python.html

Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoeffunction.

Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:

$$ V.I.F. = 1 / (1 - R^2). $$

The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

Steps for Implementing VIF

  1. Run a multiple regression.
  2. Calculate the VIF factors.
  3. Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.
#Imports
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor df = pd.read_csv('loan.csv')
df.dropna()
df = df._get_numeric_data() #drop non-numeric cols df.head()
  id member_id loan_amnt funded_amnt funded_amnt_inv int_rate installment annual_inc dti delinq_2yrs ... total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
0 1077501 1296599 5000.0 5000.0 4975.0 10.65 162.87 24000.0 27.65 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1077430 1314167 2500.0 2500.0 2500.0 15.27 59.83 30000.0 1.00 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1077175 1313524 2400.0 2400.0 2400.0 15.96 84.33 12252.0 8.72 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1076863 1277178 10000.0 10000.0 10000.0 13.49 339.31 49200.0 20.00 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1075358 1311748 3000.0 3000.0 3000.0 12.69 67.79 80000.0 17.94 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 51 columns

df = df[['annual_inc','loan_amnt', 'funded_amnt','annual_inc','dti']].dropna() #subset the dataframe

Step 1: Run a multiple regression

%%capture
#gather features
features = "+".join(df.columns - ["annual_inc"]) # get y and X dataframes based on this regression:
y, X = dmatrices('annual_inc ~' + features, df, return_type='dataframe')

Step 2: Calculate VIF Factors

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

Step 3: Inspect VIF Factors

vif.round(1)
  VIF Factor features
0 5.1 Intercept
1 1.0 dti
2 678.4 funded_amnt
3 678.4 loan_amnt

As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

Variance Inflation Factor (VIF) 方差膨胀因子解释_附python脚本的更多相关文章

  1. 可决系数R^2和方差膨胀因子VIF

    然而很多时候,被筛选的特征在模型上线的预测效果并不理想,究其原因可能是由于特征筛选的偏差. 但还有一个显著的因素,就是选取特征之间之间可能存在高度的多重共线性,导致模型对测试集预测能力不佳. 为了在筛 ...

  2. GWAS: 曼哈顿图,QQ plot 图,膨胀系数( manhattan、Genomic Inflation Factor)

    画曼哈顿图和QQ plot 首推R包“qqman”,简约方便.下面具体介绍以下. 一.画曼哈顿图 install.packages("qqman") library(qqman) ...

  3. Java 序列化Serializable具体解释(附具体样例)

    Java 序列化Serializable具体解释(附具体样例) 1.什么是序列化和反序列化 Serialization(序列化)是一种将对象以一连串的字节描写叙述的过程:反序列化deserializa ...

  4. Cocos2d-x手机游戏开发与项目实践具体解释_随书代码

    Cocos2d-x手机游戏开发与项目实战具体解释_随书代码 作者:沈大海  因为原作者共享的资源为UTF-8字符编码.下载后解压在win下显示乱码或还出现文件不全问题,现完整整理,解决全部乱码问题,供 ...

  5. 杭电 2136 Largest prime factor(最大素数因子的位置)

    Largest prime factor Time Limit: 5000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Oth ...

  6. 斯坦福大学公开课机器学习: advice for applying machine learning | regularization and bais/variance(机器学习中方差和偏差如何相互影响、以及和算法的正则化之间的相互关系)

    算法正则化可以有效地防止过拟合, 但正则化跟算法的偏差和方差又有什么关系呢?下面主要讨论一下方差和偏差两者之间是如何相互影响的.以及和算法的正则化之间的相互关系 假如我们要对高阶的多项式进行拟合,为了 ...

  7. glibc中malloc的详细解释_转

    glibc中的malloc实现: The main properties of the algorithms are:* For large (>= 512 bytes) requests, i ...

  8. cmd /c和cmd /k 解释,附★CMD命令★ 大全

    cmd /c和cmd /k http://leaning.javaeye.com/blog/380810 java的Runtime.getRuntime().exec(commandStr)可以调用执 ...

  9. c语言_文件操作_FILE结构体解释_涉及对操作系统文件FCB操作的解释_

    1. 文件和流的关系 C将每个文件简单地作为顺序字节流(如下图).每个文件用文件结束符结束,或者在特定字节数的地方结束,这个特定的字节数可以存储在系统维护的管理数据结构中.当打开文件时,就建立了和文件 ...

随机推荐

  1. nginx的压缩、https加密实现、rewrite、常见盗链配置

    Nginx 压缩功能 ngx_http_gzip_module #ngx_http_gzip_module 用gzip方法压缩响应数据,节约带宽 #启用或禁用gzip压缩,默认关闭 gzip on | ...

  2. 使用Junit测试框架学习Java

    前言 在日常的开发中,离不开单元测试,而且在学习Java时,特别是在测试不同API使用时要不停的写main方法,显得很繁琐,所以这里介绍使用Junit学习Java的方法.此外,我使用log4j将结果输 ...

  3. 公用表表达式(CTE) with as

    在编写T-SQL代码时,往往需要临时存储某些结果集.前面我们已经广泛使用和介绍了两种临时存储结果集的方法:临时表和表变量.除此之外,还可以使用公用表表达式的方法.公用表表达式(Common Table ...

  4. 用js刷剑指offer(把数组排成最小的数)

    题目描述 输入一个正整数数组,把数组里所有数字拼接起来排成一个数,打印能拼接出的所有数字中最小的一个.例如输入数组{3,32,321},则打印出这三个数字能排成的最小数字为321323. 思路 对ve ...

  5. 介绍一个二次排序的小技巧(best coder27期1001jump jump jump)

    先来描述一下问题: 问题描述 有n小孩在比赛跳远,看谁跳的最远.每个小孩可以跳3次,这个小孩的成绩就是三次距离里面的最大值.例如,一个小孩跳3次的距离分别时10, 30和20,那么这个小孩的成绩就是3 ...

  6. 题解 洛谷P4872 【OIer们的东方梦】

    一道码量比较大的广搜题,但让我这个辣鸡小学生自闭了一天呜呜呜. 一开始看数据\(n,m \leq 1000\)也并不是特别大,于是用就开始用广搜乱水了. 由于这道题每走一步的代价不是\(1\),所以并 ...

  7. 《The One!团队》:BETA Scrum metting1

    项目 内容 作业所属课程 所属课程 作业要求 作业要求 团队名称 < The One !> 作业学习目标 (1)掌握软件黑盒测试技术:(2)学会编制软件项目总结PPT.项目验收报告:(3) ...

  8. CH6303 天天爱跑步

    6303 天天爱跑步 0x60「图论」例题 描述 小C同学认为跑步非常有趣,于是决定制作一款叫作<天天爱跑步>的游戏.<天天爱跑步>是一个养成类游戏,需要玩家每天按时上线,完成 ...

  9. C#向数据库中插入或更新null空值

    一.在SQL语句中直接插入null或空字符串“” int? item = null; item == null ? "null" : item.ToString(); item = ...

  10. HashMap的个别方法

    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // 16 默认初始容量 16 static final int MAXIMUM_C ...