XGBoost API

参数

可以参考官方文档，比较清晰，多做几个demo，应该就记住了，

分类问题

这个demo是datacamp上面的使用的不是原生xgb而是sklearn接口的xgb，目前很多人是推荐使用接口的

我先来看一下原生的和使用接口的有什么区别

使用逻辑回归

# Import xgboost

import xgboost as xgb

# Create arrays for the features and the target: X, y

#备注一下这个X是从第一列到倒数第二列，y是最后一列，哈哈，因为原来对slice的概念没熟透

X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl

#以逻辑回归定义损失函数

xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set

xg_cl.fit(X_train,y_train)

# Predict the labels of the test set: preds

preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy

#做的是样本精度评估

accuracy = float(np.sum(preds==y_test))/y_test.shape[0]

print("accuracy: %f" % (accuracy))

<script.py> output:

    accuracy: 0.743300

使用交叉验证划分数据集

# Create arrays for the features and the target: X, y

X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix

churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params

params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results

cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,

                  nfold=3, num_boost_round=5,

                  metrics="error", as_pandas=True, seed=123)

# Print cv_results

print(cv_results)

# Print the accuracy

print(((1-cv_results["test-error-mean"]).iloc[-1]))

<script.py> output:

       train-error-mean  train-error-std  test-error-mean  test-error-std

    0           0.28232         0.002366          0.28378        0.001932

    1           0.26951         0.001855          0.27190        0.001932

    2           0.25605         0.003213          0.25798        0.003963

    3           0.25090         0.001845          0.25434        0.003827

    4           0.24654         0.001981          0.24852        0.000934

    0.75148

使用AUC评价分类效果

# Perform cross_validation: cv_results

cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,

                  nfold=3, num_boost_round=5,

                  metrics="auc", as_pandas=True, seed=123)

# Print cv_results

print(cv_results)

# Print the AUC

print((cv_results["test-auc-mean"]).iloc[-1])

<script.py> output:

       train-auc-mean  train-auc-std  test-auc-mean  test-auc-std

    0        0.768893       0.001544       0.767863      0.002820

    1        0.790864       0.006758       0.789157      0.006846

    2        0.815872       0.003900       0.814476      0.005997

    3        0.822959       0.002018       0.821682      0.003912

    4        0.827528       0.000769       0.826191      0.001937

    0.826191

分类问题一般都是测试结果的准确程度

回归问题一般是用mse，rmse测试结果的误差

回归问题

线性分类器liner

# Create the training and test sets

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg

xg_reg = xgb.XGBRegressor(objective="reg:linear",n_estimators=10,booster="gbtree",seed=123)

# Fit the regressor to the training set

xg_reg.fit(X_train,y_train)

# Predict the labels of the test set: preds

preds = xg_reg.predict(X_test)

# Compute the rmse: rmse

rmse = np.sqrt(mean_squared_error(y_test, preds))

print("RMSE: %f" % (rmse))

<script.py> output:

    [08:39:03] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    RMSE: 78847.401758

# Convert the training and testing sets into DMatrixes: DM_train, DM_test

DM_train = xgb.DMatrix(data=X_train, label=y_train)

DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params

params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg

xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds

preds = xg_reg.predict(DM_test)

# Compute and print the RMSE

rmse = np.sqrt(mean_squared_error(y_test,preds))

print("RMSE: %f" % (rmse))

<script.py> output:

    [08:43:38] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    RMSE: 44267.430424

# Create the DMatrix: housing_dmatrix

housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary: params

params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results

cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)

# Print cv_results

print(cv_results)

# Extract and print final round boosting round metric

print((cv_results["test-rmse-mean"]).tail(1))

<script.py> output:

    [08:46:00] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [08:46:00] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [08:46:00] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [08:46:00] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

       train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std

    0    141767.535156      429.442896   142980.433594    1193.789595

    1    102832.541015      322.467623   104891.392578    1223.157953

    2     75872.617188      266.473250    79478.937500    1601.344539

    3     57245.650391      273.625907    62411.924804    2220.148314

    4     44401.295899      316.422824    51348.281250    2963.379118

    4    51348.28125

    Name: test-rmse-mean, dtype: float64

均方误差评估模型

# Create the DMatrix: housing_dmatrix

housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary: params

params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results

cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="mae", as_pandas=True, seed=123)

# Print cv_results

print(cv_results)

# Extract and print final round boosting round metric

print((cv_results["test-mae-mean"]).tail(1))

<script.py> output:

    [08:49:31] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [08:49:31] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [08:49:31] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [08:49:31] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

       train-mae-mean  train-mae-std  test-mae-mean  test-mae-std

    0   127343.570313     668.341212  127633.990235   2403.991706

    1    89770.056640     456.949630   90122.496093   2107.910017

    2    63580.791016     263.405561   64278.561524   1887.563581

    3    45633.141601     151.886070   46819.167969   1459.813514

    4    33587.090821      87.001007   35670.651367   1140.608182

    4    35670.651367

    Name: test-mae-mean, dtype: float64

使用正则化

# Create the DMatrix: housing_dmatrix

housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params

params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity

rmses_l2 = []

# Iterate over reg_params

for reg in reg_params:

    # Update l2 strength

    params["lambda"] = reg

    # Pass this updated param dictionary into cv

    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)

    # Append best rmse (final round) to rmses_l2

    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param

print("Best rmse as a function of l2:")

print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2","rmse"]))

<script.py> output:

    [09:41:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [09:41:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [09:41:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [09:41:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [09:41:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    [09:41:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

    Best rmse as a function of l2:

        l2          rmse

    0    1  52275.357421

    1   10  57746.064453

    2  100  76624.628907

可视化

https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost/regression-with-xgboost?ex=9

xgb.plot_tree

xgb.plot_importance

# Create the DMatrix: housing_dmatrix

housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params

params = {"objective":"reg:linear", "max_depth":4}

# Train the model: xg_reg

xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the feature importances

xgb.plot_importance(xg_reg)

plt.show()

XGBoost学习笔记2的更多相关文章

XGBoost学习笔记1
XGBoost XGBoost这个网红大杀器,似乎很好用,完事儿还是自己推导一遍吧,datacamp上面有辅助的课程,但是不太涉及原理它究竟有多好用呢?我还没用过,先搞清楚原理,hahaha~ 参考 ...
[ML学习笔记] XGBoost算法
[ML学习笔记] XGBoost算法回归树决策树可用于分类和回归,分类的结果是离散值(类别),回归的结果是连续值(数值),但本质都是特征(feature)到结果/标签(label)之间的映射. 这 ...
学习笔记之Data Science
Data science - Wikipedia https://en.wikipedia.org/wiki/Data_science Data science, also known as data ...
概率图模型学习笔记：HMM、MEMM、CRF
作者:Scofield链接:https://www.zhihu.com/question/35866596/answer/236886066来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商 ...
CTR预估模型演变及学习笔记
[说在前面]本人博客新手一枚,象牙塔的老白,职业场的小白.以下内容仅为个人见解,欢迎批评指正,不喜勿喷![握手][握手] [再啰嗦一下]如果你对智能推荐感兴趣,欢迎先浏览我的另一篇随笔:智能推荐算法演 ...
js学习笔记：webpack基础入门（一）
之前听说过webpack,今天想正式的接触一下,先跟着webpack的官方用户指南走: 在这里有: 如何安装webpack 如何使用webpack 如何使用loader 如何使用webpack的开发者 ...
PHP-自定义模板-学习笔记
1. 开始这几天,看了李炎恢老师的<PHP第二季度视频>中的“章节7:创建TPL自定义模板”,做一个学习笔记,通过绘制架构图.UML类图和思维导图,来对加深理解. 2. 整体架构图 ...
PHP-会员登录与注册例子解析-学习笔记
1.开始最近开始学习李炎恢老师的<PHP第二季度视频>中的“章节5:使用OOP注册会员”,做一个学习笔记,通过绘制基本页面流程和UML类图,来对加深理解. 2.基本页面流程 3.通过UM ...
2014年暑假c#学习笔记目录
2014年暑假c#学习笔记一.C#编程基础 1. c#编程基础之枚举 2. c#编程基础之函数可变参数 3. c#编程基础之字符串基础 4. c#编程基础之字符串函数 5.c#编程基础之ref.ou ...

随机推荐

实验17：NAT
实验14-1:静态NAT 配置 Ø 实验目的通过本实验可以掌握(1)静态NAT 的特征(2)静态NAT 基本配置和调试 Ø 拓扑结构实验步骤n 步骤1:配置路由器R1 提供NAT ...
在VMware中如何清理多余的空间
问题描述平时用的编程计算机只有250G空间,c盘和d盘,今天准备做实验,发现删除虚拟机中系统的内容不但没有减少空间,反而增加了,这时我意识到虚拟机内部可能与咱们想象的操作模式不一样. 解决办法我的 ...
有道词典 Andriod 版本数据格式分析
其实很简单无聊基于版本 5.3 分析. 其实也简单分析了有道词典iOS版本,必应词典的各个版本,以及金山词典的各个版本,还有那个一直逍遥法外的林格斯词典. 由于在各个平台上的限制,同一词典的不同版本 ...
Java.work6 stasic、this、包总结作业20194651
题目一: 编写一个类Computer,类中含有一个求n的阶乘的方法.将该类打包,并在另一包中德Java文件App.java中引入包,在主类中定义Computer类的对象,调用求n的阶乘的方法(n值由参 ...
ELK：收集k8s容器日志最佳实践
简介关于日志收集这个主题,这已经是第三篇了,为什么一再研究这个课题,因为这个课题实在太重要,而当今优秀的开源解决方案还不是很明朗: 就docker微服务化而言,研发有需求标准输出,也有需求文件输出, ...
幸存者偏差Survivorship Bias
"最不符合逻辑的地方,一定埋藏着最深刻的逻辑."——余秋雨<行者无疆> 为什么要说幸存者偏差? 因为2018年全国II卷的描述即为典型的“幸存者偏差”,且这一例子被引入 ...
Python3(十) 函数式编程：匿名函数、高阶函数、装饰器
一.匿名函数 1.定义:定义函数的时候不需要定义函数名 2.具体例子: #普通函数 def add(x,y): return x + y #匿名函数 lambda x,y: x + y 调用匿名函数: ...
使用github--stanfordnlp--glove训练自己的数据词向量
1.准备语料准备好自己的语料,保存为txt,每行一个句子或一段话,注意要分好词.将分好词的语料保存为×××.txt 2.准备源码下载地址:https://github.com/stanfordnl ...
Unable to update index for nexus-publish | http://ip:port/repository/maven-public/
问题描述:Unable to update index for nexus-publish | http://ip:port/repository/maven-public/ 解决方案:进入工作空间. ...
[MacOS-Memcached]安装
查看memcached信息 $ brew info memcached memcached: stable 1.5.22 (bottled), HEAD High performance, distr ...

XGBoost学习笔记2

XGBoost API

参数

分类问题

使用逻辑回归

使用交叉验证划分数据集

使用AUC评价分类效果

回归问题

线性分类器liner

均方误差评估模型

使用正则化

可视化

xgb.plot_tree

xgb.plot_importance

XGBoost学习笔记2的更多相关文章

随机推荐

热门专题