lightgbm

histogram算法
- 将连续的浮点值离散成k个离散值，构造宽度为k的histogram
leaf-wise生长策略
- 每次在所有叶子中找到分裂增益最大的一个叶子，一般也是数据量最大的

参数
- num_leaves 叶节点数目，<=2^(max_depth)
- max_depth ^[1]

调参

hyperopt:自动获取最佳的超参数。

pip install hyperopt

	import hyperopt

	def hyperopt_objective(params):

		model=lgb.LGBMRegressor(

			num_leaves=31,learning_rate=0.1,n_estimators=int(params["n_estimators"]),max_depth=int(params["max_depth"]),objective="binary",eval_metric="auc")

		res=lgb.cv(model.get_params(),train_matrix,nfold=5,early_stopping_rounds=10,metrics="auc")

		return -max(res["auc-mean"])

定义一个目标函数hyperopt_objective，由于fmin返回最小值，因此用-auc



	params_space={

 		"n_estimators":hyperopt.hp.randint("n_estimators",300),

		"max_depth":hyperopt.hp.randint("max_depth",8)

	}

定义搜索空间：

 hp.uniform(label,low,high)参数在low和high之间均匀分布；

 hp.quniform(label,low,high,q)参数的取值是round(uniform(low,high)/q)*q，适用于那些离散的取值

 hp.randint(label,upper)返回一个在[0,upper)前闭后开的区间内的随机整数。



	trials=hyperopt.Trials()

	best=hyperopt.fmin(hyperopt_objective,space=params_space,algo=hyperopt.tpe.suggest,

	max_evals=10,trials=trials)

在搜索空间内搜索

阿里天池大赛：金融风控-贷款违约预测
1. 导包



	import pandas as pd

	import numpy as np

	import matplotlib.pyplot as plt

	import seaborn as sns

	from IPython.core.interactiveshell import InteractiveShell

	InteractiveShell.ast_node_interactivity = "all"#显示全部行输出结果

2. 导入数据



	train=pd.read_csv("/风控/train (1).csv")

	train.head()#默认显示前五行

	train.shape

	train.columns#查看字段信息

3. 区分离散型和连续型变量



	numerical_columns=['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership',

	'annualIncome', 'verificationStatus',

	'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',

	'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',

	'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',

	'initialListStatus', 'applicationType', 'title',

	'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',

	'n9', 'n10', 'n11', 'n12', 'n13', 'n14']

	def featype(data,feature):

	    numerical_continus=[]

	    numerical_discrete=[]

	    for fea in feature:

	        count=data[fea].nunique() #返回唯一值的个数

	        if count>10:

	            numerical_continus.append(fea)

	        else:

	            numerical_discrete.append(fea)

	    return numerical_continus,numerical_discrete

	numerical_continus,numerical_discrete=featype(train,numerical_columns)

	numerical_continus

	numerical_discrete

4. 处理字符型变量apply



	employmentlength_dict={"10+ years":10,"2 years":2,"< 1 year":0,"1 year":1,"5 years":5,"4 years":4,"6 years":6,"8 years":8,"7 years":7,"9 years":9,"3 years":3}

	def func4(m):

	    m["employmentlength_dict"]=m["employmentLength"].apply(lambda x:x if x not in employmentlength_dict else employmentlength_dict[x])

	    return m

	train=func4(train)



	def func(x):

	    month,year=x.split("-")

	    month_dict={"Aug":8,"Nov":11,"Feb":2,"Jan":1,"Mar":3,"Jul":7,"Oct":10,"Jun":6,"Apr":4,"Sep":9,"May":5,"Dec":12}

	    month_dict=month_dict[month]

	    earlistdate=int(year)*12+int(month_dict)

	    return earlistdate

	train["earlistdate"]=train["earliesCreditLine"].apply(lambda x:func(x))

	def func2(x):

	    year,month,day=x.split("-")

	    final_date=int(year)*12+int(month)

	    return final_date

	train["issueDate_dict"]=train["issueDate"].apply(lambda x:func2(x))

	train[["subGrade","interestRate","grade"]].corr()#subgrade grade 为object

	def func3(x):

	    tmp=x[["subGrade"]].sort_values(["subGrade"]).drop_duplicates()

	    tmp["subgrade_dict"]=range(len(tmp))

	    x=x.merge(tmp,on="subGrade",how="left")

	    tmp1=x[["grade"]].sort_values(["grade"]).drop_duplicates()

	    tmp1["grade_dict"]=range(len(tmp1))

	    x=x.merge(tmp1,on="grade",how="left")

	    return x

	train=func3(train)

5.填充空值:离散型用众数，连续型用中位数



	#查看变量缺失值占比

	d=(train.isnull().sum()/train.shape[0]).to_dict()

	d

	(train.isnull().sum()/train.shape[0]).plot.bar()



	train[numerical_continus]=train[numerical_continus].fillna(value=train[numerical_continus].median())

	train[numerical_discrete]=train[numerical_discrete].fillna(value=train[numerical_discrete].mode())

	category=["employmentlength_dict","subgrade_dict","grade_dict","issueDate_dict",'earlistdate']

	train[category]=train[category].fillna(train[category].mode())

6. 参数优化



	import lightgbm as lgb

	from sklearn.model_selection import train_test_split

	x=train.drop(["grade","subGrade","employmentLength","issueDate","earliesCreditLine","isDefault"],axis=1)

	y=train["isDefault"]

	# 数据集划分

	x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2)

	train_matrix = lgb.Dataset(x_train, label=y_train)

	valid_matrix = lgb.Dataset(x_val, label=y_val)



	import hyperopt

	def hyperopt_objective(params):

	    model=lgb.LGBMRegressor(num_leaves=31,learning_rate=0.1,n_estimators=int(params["n_estimators"]),max_depth=int(params["max_depth"]),objective="binary",eval_metric="auc")

	    res=lgb.cv(model.get_params(),train_matrix,nfold=5,early_stopping_rounds=10,metrics="auc")

	    return -max(res["auc-mean"])



	params_space={

	    "n_estimators":hyperopt.hp.randint("n_estimators",300),

	    "max_depth":hyperopt.hp.randint("max_depth",8)

	}

	trails=hyperopt.Trials()

	best=hyperopt.fmin(hyperopt_objective,space=params_space,algo=hyperopt.tpe.suggest,max_evals=10,trials=trails)



	print(best)

```{'max_depth': 6, 'n_estimators': 237}

	7. 训练模型

params = {

    'boosting_type': 'gbdt',

    'objective': 'binary',

    'learning_rate': 0.1,#较小的学习率，较大的决策树个数

    "n_estimators":237,

    'metric': 'auc',

    'num_leaves': 31,

    'max_depth': 6,#树的最大深度，防止过拟合

    'feature_fraction': 1, #每次选择所有的特征训练树

    'bagging_fraction': 1,

}

"""使用训练集数据进行模型训练"""

model = lgb.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=20000, verbose_eval=1000, early_stopping_rounds=200)



	8. 数据预测

test=pd.read_csv("C:/Users/廖言/Desktop/新建文件夹/努力学习天天向上/风控/testA.csv")

test.head()

#数据处理

test=func4(test)

test["earlistdate"]=test["earliesCreditLine"].apply(lambda x:func(x))

test["issueDate_dict"]=test["issueDate"].apply(lambda x:func2(x))

test=func3(test)

#空值填充

test[numerical_continus]=test[numerical_continus].fillna(value=test[numerical_continus].median())

test[numerical_discrete]=test[numerical_discrete].fillna(value=test[numerical_discrete].mode())

test[category]=test[category].fillna(test[category].mode())

test2=test.drop(["grade","subGrade","employmentLength","issueDate","earliesCreditLine"],axis=1)

#数据预测

test["isDefault"]=model.predict(test2)

#数据输出

test.to_csv("C:/Users/廖言/Desktop/新建文件夹/努力学习天天向上/风控/output3.csv")

树的深度是从根点到叶子节点的结点数，叶子节点是没有左右孩子的结点。

- n_estimators 迭代次数=决策树个数

- bagging_fraction 每次迭代用的数据比例，小比例加快训练速度，减小过拟合

- feature_fraction 每次迭代用的特征比例

- min_data_in_leaf 每个叶节点的最少样本数量，设置一个较大的数可以处理过拟合 ︎

lightgbm与贷款违约预测项目的更多相关文章

Lending Club—构建贷款违约预测模型
python信用评分卡(附代码,博主录制) https://study.163.com/course/introduction.htm?courseId=1005214003&utm_camp ...
Datawhale 人工智能培养方案
版本号:V0.9 阅读须知每个专业方向对应一个课程表格课程表格里的课程排列顺序即为本培养方案推荐的学习顺序诚挚欢迎为本培养方案贡献课程,有意向的同学请联系Datawhale开源项目管理委员会本 ...
R语言-来自Prosper的贷款数据探索
案例分析:Prosper是美国的一家P2P在线借贷平台,网站撮合了一些有闲钱的人和一些急用钱的人.用户若有贷款需求,可在网站上列出期望数额和可承受的最大利率.潜在贷方则为数额和利率展开竞价. 本项目拟 ...
kaggle 欺诈信用卡预测——Smote+LR
from:https://zhuanlan.zhihu.com/p/30461746 本项目需解决的问题本项目通过利用信用卡的历史交易数据,进行机器学习,构建信用卡反欺诈预测模型,提前发现客户信用卡 ...
基于Spark.NET和ML.NET Automated ML (自动学习)进行餐厅等级的检查预测
简介 Apache Spark是一个开源.分布式.通用的分析引擎.多年来,它一直是大数据生态系统中对大型数据集进行批量和实时处理的主要工具.尽管对该平台的本地支持仅限于JVM语言集,但其他通常用于数据 ...
数据挖掘项目之---通过对web日志的挖掘来实现内容推荐系统
先说一说问题,不知道大家有没有这样的经验,反正我是经常碰到. 举例1,某些网站每隔几天就发邮件给我,每次发的邮件内容都是一些我根本不感兴趣的东西,我不甚其扰,对其深恶痛绝. 举例2,添 ...
由Kaggle竞赛wiki文章流量预测引发的pandas内存优化过程分享
pandas内存优化分享缘由最近在做Kaggle上的wiki文章流量预测项目,这里由于个人电脑配置问题,我一直都是用的Kaggle的kernel,但是我们知道kernel的内存限制是16G,如下: ...
【SVM】kaggle之澳大利亚天气预测
项目目标由于大气运动极为复杂,影响天气的因素较多,而人们认识大气本身运动的能力极为有限,因此天气预报水平较低,预报员在预报实践中,每次预报的过程都极为复杂,需要综合分析,并预报各气象要素,比如温度. ...
Python爱好者社区历史文章列表（每周append更新一次）
2月22日更新: 0.Python从零开始系列连载: Python从零开始系列连载(1)——安装环境 Python从零开始系列连载(2)——jupyter的常用操作 Python从零开始系列连载( ...
【干货】Kaggle 数据挖掘比赛经验分享（mark 专业的数据建模过程）
简介 Kaggle 于 2010 年创立,专注数据科学,机器学习竞赛的举办,是全球最大的数据科学社区和数据竞赛平台.笔者从 2013 年开始,陆续参加了多场 Kaggle上面举办的比赛,相继获得了 C ...

随机推荐

nodeJs 写个爬虫小玩意
内容起一个服务,爬某个网站的数据(我这里爬了个夕阳红游戏交易网站的数据),页面看到我要爬的内容代码 1 //引入内置的http包 2 var http = require('http'); 3 c ...
sql语句查询优化
SQL 性能优化 explain 中的 type:至少要达到 range 级别,要求是 ref 级别,如果可以是 consts 最好. consts:单表中最多只有一个匹配行(主键或者唯一索引),在优 ...
实验1task3
<实验结论> #include <stdio.h> #include <stdlib.h> int main() { int a, b, t; a = 3; b = ...
自定义select组件
(声明:当前记录篇参考于该人员 https://www.jb51.net/article/166679.htm ) 一.创建组件 1.新建文件夹:select 2.新建Component: selec ...
蓝桥杯训练赛二-1169 问题 D: 绝对值排序
题目描述输入n(n<=100)个整数,按照绝对值从大到小排序后输出.题目保证对于每一个测试实例,所有的数的绝对值都不相等. 输入输入数据有多组,每组占一行,每行的第一个数字为n,接着是n个整 ...
vue高级进阶( 三 ) 组件高级用法及最佳实践
vue高级进阶( 三 ) 组件高级用法及最佳实践世界上有太多孤独的人害怕先踏出第一步. ---绿皮书书接上回,上篇介绍了vue组件通信比较有代表性的几种方法,本篇主要讲述一下组件的高级用法和最 ...
城市间最短路径问题——R和Rcpp实现
这里的最短路径问题也叫做相识问题,具体问题来自 https://www.math.pku.edu.cn/teachers/lidf/docs/Rbook/html/_Rbook/examples.ht ...
yarn 安装全局包
yarn 安装全局包,无法使用,需要添加yarn的bin文件夹到环境变量然后重启一下即可使用,再装其他全局包也可直接使用
Div的几种选择器
Div 是一个html标签,一个块级元素(单独显示一行),单独使用没有意义,需要结合CSS来使用,主要用于页面的布局. div选择器: 1.元素选择器: 1 <style> 2 div{ ...
postman或浏览器可以访问，java不能访问的post请求，连接超时
代码中用RestTemplate请求url一直是连接超时可以修改一下jvm配置 -Djava.net.preferIPv4Stack=true

lightgbm与贷款违约预测项目

lightgbm

lightgbm与贷款违约预测项目的更多相关文章

随机推荐

热门专题