该示例所用的数据可从该链接下载,提取码为3y90,数据说明可参考该网页。该示例的“模型调参”这一部分引用了这篇博客的步骤。

数据前处理

  1. 导入数据

    import pandas as pd
    import numpy as np
    from sklearn.cross_validation import train_test_split
    ### Load data
    ### Split the data to train and test sets
    data = pd.read_csv('data/loan/Train.csv', encoding = "ISO-8859-1")
    train, test = train_test_split(data,train_size=0.7,random_state=123,stratify=data['Disbursed'])
    ### Check number of nulls in each feature column
    nulls_per_column = train.isnull().sum()
    print(nulls_per_column)
  2. 将特征拆分成数值型和种类型
    ### Drop the useless columns
    train_1 = train.drop(['ID','Lead_Creation_Date','LoggedIn'],axis=1)
    ### Split the columns to numerical and categorical
    category_cols = train_1.columns[train_1.dtypes==object].tolist()
    category_cols.remove('DOB')
    category_cols.append('Var4')
    numeric_cols = list(set(train_1.columns)-set(category_cols))
  3. 分析并处理种类型特征
    ### explore the categorical columns
    for v in category_cols:
    print('Ratio of missing value for variable {0}: {1}'.format(v,nulls_per_column[v]/train_1.shape[0]))
    print('-----------------------------------------------------------')
    counts = dict()
    for v in category_cols:
    print('\nFrequency count for variable %s'%v)
    counts[v] = train_1[v].value_counts()
    print(counts[v])
    ### merge the cities that counts<200
    merge_city = [c for c in counts['City'].index if counts['City'][c]<200]
    train_1['City'] = train_1['City'].apply(lambda x: 'others' if x in merge_city else x)
    ### merge the salary accounts that counts<100
    merge_sa = [c for c in counts['Salary_Account'].index if counts['Salary_Account'][c]<100]
    train_1['Salary_Account'] = train_1['Salary_Account'].apply(lambda x: 'others' if x in merge_sa else x)
    ### merge the sources that counts<100
    merge_sr = [c for c in counts['Source'].index if counts['Source'][c]<100]
    train_1['Source'] = train_1['Source'].apply(lambda x: 'others' if x in merge_sr else x)
    ### impute the missing value
    train_1['City'].fillna('Missing',inplace=True)
    train_1['Salary_Account'].fillna('Missing',inplace=True)
    ### delete the column Employer_Name since too many categories
    train_2 = train_1.drop('Employer_Name',axis=1)
  4. 分析并处理数值型特征
    ### Explore the numerical columns
    for v in numeric_cols:
    print('Ratio of missing value for variable {0}: {1}'.format(v,nulls_per_column[v]/train_2.shape[0]))
    print('-----------------------------------------------------------')
    for v in numeric_cols:
    print('\nStatistical summary for variable %s'%v)
    print(train_2[v].describe())
    ### Create Age column:
    train_2['Age'] = train_2['DOB'].apply(lambda x: 118 - int(x[-2:]))
    ### High proportion missing so create a new variable stating whether this is missing or not:
    train_2['Loan_Amount_Submitted_Missing'] = train_2['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
    train_2['Loan_Tenure_Submitted_Missing'] = train_2['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
    train_2['EMI_Loan_Submitted_Missing'] = train_2['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
    train_2['Interest_Rate_Missing'] = train_2['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)
    train_2['Processing_Fee_Missing'] = train_2['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)
    ### Impute the missing value
    train_2['Existing_EMI'].fillna(train_2['Existing_EMI'].median(), inplace=True)
    train_2['Loan_Amount_Applied'].fillna(train_2['Loan_Amount_Applied'].median(),inplace=True)
    train_2['Loan_Tenure_Applied'].fillna(train_2['Loan_Tenure_Applied'].median(),inplace=True)
    ### Drop original columns
    train_3 = train_2.drop(['DOB','Loan_Amount_Submitted','Loan_Tenure_Submitted','EMI_Loan_Submitted', \
    'Interest_Rate','Processing_Fee'],axis=1)
  5. One-Hot encoding
    from sklearn.preprocessing import LabelEncoder
    dropped_columns = ['ID','Lead_Creation_Date','LoggedIn','Employer_Name','DOB','Loan_Amount_Submitted', \
    'Loan_Tenure_Submitted','EMI_Loan_Submitted','Interest_Rate','Processing_Fee']
    le = LabelEncoder()
    var_to_encode = list(set(category_cols)-set(dropped_columns))
    for col in var_to_encode:
    train_3[col] = le.fit_transform(train_3[col])
    ### pd.get_dummies can also be used directly without LabelEncoder
    train_3 = pd.get_dummies(train_3, columns=var_to_encode)

模型调参

  1. 建立基础模型并使用early_stop调整迭代次数

    import xgboost as xgb
    import matplotlib.pyplot as plt
    from sklearn import metrics
    ### base model
    target = 'Disbursed'
    predictors = [x for x in train_3.columns if x!=target]
    xgb1 = xgb.XGBClassifier(learning_rate=0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, \
    subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27)
    ### use early_stop in xgb.cv
    def get_n_estimators(alg, dtrain, predictors, target, cv_folds=5, early_stopping_rounds=50):
    xgb_param = alg.get_xgb_params()
    xgtrain = xgb.DMatrix(dtrain[predictors], label=dtrain[target])
    cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds, \
    metrics='auc', early_stopping_rounds=early_stopping_rounds, stratified=True)
    alg.set_params(n_estimators=cvresult.shape[0])
    #Print model report:
    print("\nModel Report")
    print("Set n_estimators to {0}".format(cvresult.shape[0]))
    print(cvresult.tail(1)['test-auc-mean'])
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain[target], eval_metric='auc')
    #Feature importance
    feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances', figsize=(20,6))
    plt.ylabel('Feature Importance Score')
    return
    ### get n_estimators
    get_n_estimators(xgb1, train_3, predictors, target)
  2. Tune max_depth and min_child_weight
    from sklearn.model_selection import GridSearchCV
    ### optimal: {'max_depth':5,'min_child_weight':5}
    param_test1 = {'max_depth':range(3,10,2),'min_child_weight':range(1,6,2)}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=141, max_depth=5, min_child_weight=1, gamma=0, \
    subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch1 = GridSearchCV(estimator = alg, param_grid = param_test1, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch1.fit(train_3[predictors],train_3[target])
    print(gsearch1.best_params_)
    print(gsearch1.best_score_)
    ### optimal: {'max_depth':4,'min_child_weight':6}
    param_test2 = {'max_depth':[4,5,6],'min_child_weight':[4,5,6]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=141, max_depth=5, min_child_weight=5, gamma=0, \
    subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch2 = GridSearchCV(estimator = alg, param_grid = param_test2, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch2.fit(train_3[predictors],train_3[target])
    print(gsearch2.best_params_)
    print(gsearch2.best_score_)
    ### optimal: {'min_child_weight':6}
    param_test2b = {'min_child_weight':[6,8,10,12]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=141, max_depth=4, min_child_weight=6, gamma=0, \
    subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch2b = GridSearchCV(estimator = alg, param_grid = param_test2b, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch2b.fit(train_3[predictors],train_3[target])
    print(gsearch2b.best_params_)
    print(gsearch2b.best_score_)
  3. Tune gamma
    ### optimal: {'gamma':0.2}
    param_test3 = {'gamma':[i/10.0 for i in range(0,5)]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=141, max_depth=4, min_child_weight=6, gamma=0, \
    subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch3 = GridSearchCV(estimator = alg, param_grid = param_test3, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch3.fit(train_3[predictors],train_3[target])
    print(gsearch3.best_params_)
    print(gsearch3.best_score_)
    ### get n_estimators
    xgb2 = xgb.XGBClassifier(learning_rate=0.1, n_estimators=1000, max_depth=4, min_child_weight=6, gamma=0.2, \
    subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27)
    get_n_estimators(xgb2, train_3, predictors, target)
  4. Tune subsample and colsample_bytree
    ### optimal: {'colsample_bytree': 0.7, 'subsample': 0.7}
    param_test4 = {'subsample':[i/10.0 for i in range(6,11)], 'colsample_bytree':[i/10.0 for i in range(6,11)]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=142, max_depth=4, min_child_weight=6, gamma=0.2, \
    subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch4 = GridSearchCV(estimator = alg, param_grid = param_test4, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch4.fit(train_3[predictors],train_3[target])
    print(gsearch4.best_params_)
    print(gsearch4.best_score_)
    ### optimal: {'colsample_bytree': 0.75, 'subsample': 0.7}
    param_test5 = {'subsample':[i/100.0 for i in range(65,80,5)], 'colsample_bytree':[i/100.0 for i in range(65,80,5)]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=142, max_depth=4, min_child_weight=6, gamma=0.2, \
    subsample=0.7, colsample_bytree=0.7, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch5 = GridSearchCV(estimator = alg, param_grid = param_test5, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch5.fit(train_3[predictors],train_3[target])
    print(gsearch5.best_params_)
    print(gsearch5.best_score_)
  5. Tune reg_alpha
    ### optimal: {'reg_alpha': 0.01}
    param_test6 = {'reg_alpha':[0, 1e-5, 1e-2, 0.1, 1, 100]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=142, max_depth=4, min_child_weight=6, gamma=0.2, \
    subsample=0.7, colsample_bytree=0.75, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch6 = GridSearchCV(estimator = alg, param_grid = param_test6, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch6.fit(train_3[predictors],train_3[target])
    print(gsearch6.best_params_)
    print(gsearch6.best_score_)
    ### optimal: {'reg_alpha': 0.01}
    param_test7 = {'reg_alpha':[0.001, 0.005, 0.01, 0.05]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=142, max_depth=4, min_child_weight=6, gamma=0.2, reg_alpha=0.01, \
    subsample=0.7, colsample_bytree=0.75, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch7 = GridSearchCV(estimator = alg, param_grid = param_test7, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch7.fit(train_3[predictors],train_3[target])
    print(gsearch7.best_params_)
    print(gsearch7.best_score_)
  6. Tune reg_lambda
    ### optimal: {'reg_lambda': 1}
    param_test8 = {'reg_lambda':[0, 0.01, 0.1, 1, 10, 100]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=142, max_depth=4, min_child_weight=6, gamma=0.2, reg_alpha=0.01, \
    subsample=0.7, colsample_bytree=0.75, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch8 = GridSearchCV(estimator = alg, param_grid = param_test8, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch8.fit(train_3[predictors],train_3[target])
    print(gsearch8.best_params_)
    print(gsearch8.best_score_)
    ### optimal: {'reg_lambda': 1}
    param_test9 = {'reg_lambda':[0.5, 0.7, 1, 3, 5]}
    alg = xgb.XGBClassifier(learning_rate=0.1, n_estimators=142, max_depth=4, min_child_weight=6, gamma=0.2, reg_alpha=0.01, \
    subsample=0.7, colsample_bytree=0.75, objective= 'binary:logistic', nthread=4, seed=27)
    gsearch9 = GridSearchCV(estimator = alg, param_grid = param_test9, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
    gsearch9.fit(train_3[predictors],train_3[target])
    print(gsearch9.best_params_)
    print(gsearch9.best_score_)
    ### get n_estimators
    xgb3 = xgb.XGBClassifier(learning_rate=0.1, n_estimators=1000, max_depth=4, min_child_weight=6, gamma=0.2, \
    reg_alpha=0.01, reg_lambda=1, subsample=0.7, colsample_bytree=0.75, \
    objective= 'binary:logistic', nthread=4, seed=27)
    get_n_estimators(xgb3, train_3, predictors, target)
  7. Reduce learning rate
    xgb4 = xgb.XGBClassifier(learning_rate=0.01, n_estimators=5000, max_depth=4, min_child_weight=6, gamma=0.2, \
    reg_alpha=0.01, reg_lambda=1, subsample=0.7, colsample_bytree=0.75, \
    objective= 'binary:logistic', nthread=4, seed=27)
    get_n_estimators(xgb4, train_3, predictors, target)

根据上述过程构建完整的Pipeline

import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import Imputer, FunctionTransformer, LabelBinarizer
from sklearn_pandas import DataFrameMapper, CategoricalImputer
from sklearn.pipeline import Pipeline data = pd.read_csv('Train.csv', encoding = "ISO-8859-1")
train, test = train_test_split(data,train_size=0.7,random_state=123,stratify=data['Disbursed']) target_raw = 'Disbursed'
predictors_raw = [col for col in train.columns if col!=target_raw]
train_X, train_y = train[predictors_raw], train[target_raw] category_cols = train_X.columns[train_X.dtypes==object].tolist()
category_cols.remove('DOB')
category_cols.append('Var4')
numeric_cols = list(set(train_X.columns)-set(category_cols))
numeric_cols = numeric_cols+['Age', 'Loan_Amount_Submitted_Missing', 'Loan_Tenure_Submitted_Missing', \
'EMI_Loan_Submitted_Missing', 'Interest_Rate_Missing', 'Processing_Fee_Missing'] counts = dict()
for v in category_cols:
counts[v] = train_X[v].value_counts()
non_merge_city = [c for c in counts['City'].index if counts['City'][c]>=200]
non_merge_sa = [c for c in counts['Salary_Account'].index if counts['Salary_Account'][c]>=100]
non_merge_sr = [c for c in counts['Source'].index if counts['Source'][c]>=100] dropped_columns = ['ID','Lead_Creation_Date','LoggedIn','Employer_Name','DOB','Loan_Amount_Submitted', \
'Loan_Tenure_Submitted','EMI_Loan_Submitted','Interest_Rate','Processing_Fee'] # Function Transform
def preprocess(X):
X['City'] = X['City'].apply(lambda x: 'others' if x not in non_merge_city and not pd.isnull(x) else x)
X['Salary_Account'] = X['Salary_Account'].apply(lambda x: 'others' if x not in non_merge_sa and not pd.isnull(x) else x)
X['Source'] = X['Source'].apply(lambda x: 'others' if x not in non_merge_sr and not pd.isnull(x) else x) X['Age'] = X['DOB'].apply(lambda x: 118 - int(x[-2:])) X['Loan_Amount_Submitted_Missing'] = X['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
X['Loan_Tenure_Submitted_Missing'] = X['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
X['EMI_Loan_Submitted_Missing'] = X['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
X['Interest_Rate_Missing'] = X['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)
X['Processing_Fee_Missing'] = X['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0) return X.drop(dropped_columns, axis=1) # Apply numeric imputer
numeric_imputer = [([feature], Imputer(strategy="median")) for feature in numeric_cols if feature not in dropped_columns]
# Apply categorical imputer and one-hot encode
category_imputer = [(feature, [CategoricalImputer(strategy='constant', fill_value='Missing'),LabelBinarizer()]) \
for feature in category_cols if feature not in dropped_columns]
# Combine the numeric and categorical transformations
numeric_categorical_union = DataFrameMapper(numeric_imputer+category_imputer,input_df=True,df_out=True) # Tuned Classifier
tuned_xgb = xgb.XGBClassifier(learning_rate=0.01, n_estimators=1480, max_depth=4, min_child_weight=6, gamma=0.2, \
reg_alpha=0.01, reg_lambda=1, subsample=0.7, colsample_bytree=0.75, \
objective= 'binary:logistic', nthread=4, seed=27) # Create full pipeline
pipeline = Pipeline([("preprocessor", FunctionTransformer(preprocess, validate=False)), \
("featureunion", numeric_categorical_union), ("classifier", tuned_xgb)])
pipeline.fit(train_X, train_y) #Feature importance
feat_imp = pd.Series(pipeline.named_steps['classifier'].get_booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances', figsize=(20,6))
plt.ylabel('Feature Importance Score') # individual prediction
print(pipeline.predict_proba(test.iloc[[1]][predictors_raw]))
# test data predictions
# AUC Score (Test): 0.8568
predprob=pipeline.predict_proba(test[predictors_raw])[:,1]
print("AUC Score (Test): %f" % metrics.roc_auc_score(test[target_raw], predprob))

XGBOOST应用及调参示例的更多相关文章

  1. XGBoost 重要参数(调参使用)

    XGBoost 重要参数(调参使用) 数据比赛Kaggle,天池中最常见的就是XGBoost和LightGBM. 模型是在数据比赛中尤为重要的,但是实际上,在比赛的过程中,大部分朋友在模型上花的时间却 ...

  2. xgboost/gbdt在调参时为什么树的深度很少就能达到很高的精度?

    问题: 用xgboost/gbdt在在调参的时候把树的最大深度调成6就有很高的精度了.但是用DecisionTree/RandomForest的时候需要把树的深度调到15或更高.用RandomFore ...

  3. 【Python机器学习实战】决策树与集成学习(七)——集成学习(5)XGBoost实例及调参

    上一节对XGBoost算法的原理和过程进行了描述,XGBoost在算法优化方面主要在原损失函数中加入了正则项,同时将损失函数的二阶泰勒展开近似展开代替残差(事实上在GBDT中叶子结点的最优值求解也是使 ...

  4. xgboost参数及调参

    常规参数General Parameters booster[default=gbtree]:选择基分类器,可以是:gbtree,gblinear或者dart.gbtree和draf基于树模型,而gb ...

  5. xgboost的遗传算法调参

    遗传算法适应度的选择: 机器学习的适应度可以是任何性能指标 —准确度,精确度,召回率,F1分数等等.根据适应度值,我们选择表现最佳的父母(“适者生存”),作为幸存的种群. 交配: 存活下来的群体中的父 ...

  6. Xgboost调参总结

    一.参数速查 参数分为三类: 通用参数:宏观函数控制. Booster参数:控制每一步的booster(tree/regression). 学习目标参数:控制训练目标的表现. 二.回归 from xg ...

  7. xgboost使用调参

    欢迎关注博主主页,学习python视频资源 https://blog.csdn.net/q383700092/article/details/53763328 调参后结果非常理想 from sklea ...

  8. xgboost的sklearn接口和原生接口参数详细说明及调参指点

    from xgboost import XGBClassifier XGBClassifier(max_depth=3,learning_rate=0.1,n_estimators=100,silen ...

  9. xgboost入门与实战(实战调参篇)

    https://blog.csdn.net/sb19931201/article/details/52577592 xgboost入门与实战(实战调参篇) 前言 前面几篇博文都在学习原理知识,是时候上 ...

随机推荐

  1. 突发奇想想学习做一个HTML5小游戏

    前言: 最近一期文化馆轮到我分享了,分享了两个,一个是关于童年教科书的回忆,一个是关于免费电子书的.最后我觉得应该会不敌web,只能说是自己在这中间回忆了一下那个只是会学习的年代,那个充满梦想的年代. ...

  2. [转]Oracle密码过期, 报:ORA-01017: 用户名/口令无效; 登录被拒绝

    本文转自:https://blog.csdn.net/jeff06143132/article/details/25696371 连接Oracle,以Oracle用户登陆:   $su - oracl ...

  3. Spring Security之Remember me详解

    Remember me功能就是勾选"记住我"后,一次登录,后面在有效期内免登录. 先看具体配置: pom文件: <dependency> <groupId> ...

  4. win10 uwp 如何开始写 uwp 程序

    本文告诉大家如何创建一个 UWP 程序. 这是一系列的 uwp 入门博客,所以写的很简单 本文来告诉大家如何创建一个简单的程序 安装 VisualStudio 在开始写 UWP 需要安装 Visual ...

  5. ECMAScript typeof用法

    typeof 返回变量的类型字符串值 .其中包括 “object”.“number”.“string”.“undefined”.“boolean”. 1.在变量只声明.却不初始化值   Or 在变量没 ...

  6. 【c#】6.0与7.0新特性介绍记录

    c#发展史 引用地址:https://www.cnblogs.com/ShaYeBlog/p/3661424.html 6.0新特性 1.字符串拼接优化 语法格式:$”string {参数}” 解释: ...

  7. SpringCloud初体验之Eureka

    Eureka简介 SpringBoot简化了Spring工程的复杂度,之前复杂的Spring工程被拆分成了一个个小的SpringBoot工程.那么SpringBoot之间如何通讯,相互获取信息呢?这就 ...

  8. 算法第四版-文字版-下载地址-Robert Sedgewick

    下载地址:https://download.csdn.net/download/moshenglv/10777447 算法第四版,文字版,可复制,方便copy代码 目录: 第1章 基 础 ...... ...

  9. 【读书笔记】iOS-处理内存警告

    -(void)didReceiveMemoryWarning{ [super didReceiveMemoryWarning]; } 在这里你需要释放掉所有占用了很大内存的对象,如果你忽略了这个警告, ...

  10. 【图解】Web前端实现类似Excel的电子表格

    本文将通过图解的方式,使用纯前端表格控件 SpreadJS 来一步一步实现在线的电子表格产品(例如可构建Office 365 Excel产品.Google的在线SpreadSheet). 工具简介: ...