@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

Setup

To follow this post you’ll need the following packages:

# Install (if necessary)
install.packages(c("xgboost", "tidyverse", "devtools"))
devtools::install_github("drsimonj/pipelearner") # Attach
library(tidyverse)
library(xgboost)
library(pipelearner)
library(lazyeval)

Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'

d <- read_csv(
data_url,
col_names = c('id', 'thinkness', 'size_uniformity',
'shape_uniformity', 'adhesion', 'epith_size',
'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
select(-id) %>% # Remove id; not useful here
filter(nuclei != '?') %>% # Remove records with missing data
mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
mutate_all(as.numeric) # All to numeric; needed for XGBoost d
#> # A tibble: 683 × 10
#> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 2 1
#> 2 5 4 4 5 7 10
#> 3 3 1 1 1 2 2
#> 4 6 8 8 1 3 4
#> 5 4 1 1 3 2 1
#> 6 8 10 10 8 7 10
#> 7 1 1 1 1 2 10
#> 8 2 1 2 1 2 1
#> 9 2 1 1 1 2 1
#> 10 4 2 1 1 2 1
#> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
#> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>

pipelearner

pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

pipelearner(d, rpart::rpart, cancer ~ .,
minsplit = c(2, 4, 6, 8, 10),
maxdepth = c(2, 3, 4, 5))

The challenge for xgboost:

pipelearner expects a model function that has two arguments: data andformula

xgboost

Here’s an xgboost model:

# Prep data (X) and labels (y)
X <- select(d, -cancer) %>% as.matrix()
y <- d$cancer # Fit the model
fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858 # Examine accuracy
predicted <- as.numeric(predict(fit, X) >= .5)
mean(predicted == y)
#> [1] 0.9838946

Look like we have a model with 98.39% accuracy on the training data!

Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

Wrapper function to parse data and formula

To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

pl_xgboost <- function(data, formula, ...) {
data <- as.data.frame(data) X_names <- as.character(f_rhs(formula))
y_name <- as.character(f_lhs(formula)) if (X_names == '.') {
X_names <- names(data)[names(data) != y_name]
} X <- data.matrix(data[, X_names])
y <- data[[y_name]] xgboost(data = X, label = y, ...)
}

Let’s try it out:

pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858 # Examine accuracy
pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
mean(pl_predicted == y)
#> [1] 0.9838946

Perfect!

Bringing it all together

We can now use pipelearner and pl_xgboost() for easy grid searching:

pl <- pipelearner(d, pl_xgboost, cancer ~ .,
nrounds = c(5, 10, 25),
eta = c(.1, .3),
max_depth = c(4, 6)) fits <- pl %>% learn()
#> [1] train-rmse:0.453832
#> [2] train-rmse:0.412548
#> ... fits
#> # A tibble: 12 × 9
#> models.id cv_pairs.id train_p fit target model
#> <chr> <chr> <dbl> <list> <chr> <chr>
#> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> # ... with 3 more variables: params <list>, train <list>, test <list>

Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

accuracy <- function(fit, data, target_var) {
# Convert resample object to data frame
data <- as.data.frame(data)
# Get feature matrix and labels
X <- data %>%
select(-matches(target_var)) %>%
as.matrix()
y <- data[[target_var]]
# Obtain predicted class
y_hat <- as.numeric(predict(fit, X) > .5)
# Return accuracy
mean(y_hat == y)
} results <- fits %>%
mutate(
# hyperparameters
nrounds = map_dbl(params, "nrounds"),
eta = map_dbl(params, "eta"),
max_depth = map_dbl(params, "max_depth"),
# Accuracy
accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
accuracy_test = pmap_dbl(list(fit, test, target), accuracy)
) %>%
# Select columns and order rows
select(nrounds, eta, max_depth, contains("accuracy")) %>%
arrange(desc(accuracy_test), desc(accuracy_train)) results
#> # A tibble: 12 × 5
#> nrounds eta max_depth accuracy_train accuracy_test
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 25 0.3 6 1.0000000 0.9489051
#> 2 25 0.3 4 1.0000000 0.9489051
#> 3 10 0.3 6 0.9981685 0.9489051
#> 4 5 0.3 6 0.9945055 0.9489051
#> 5 10 0.1 6 0.9945055 0.9489051
#> 6 25 0.1 6 0.9945055 0.9489051
#> 7 5 0.1 6 0.9926740 0.9489051
#> 8 25 0.1 4 0.9890110 0.9489051
#> 9 10 0.3 4 0.9871795 0.9489051
#> 10 5 0.3 4 0.9853480 0.9489051
#> 11 10 0.1 4 0.9853480 0.9416058
#> 12 5 0.1 4 0.9835165 0.9416058

Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

Bonus: bootstrapped cross validation

For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
learn_cvpairs(n = 100) %>%
learn() %>%
mutate(
test_accuracy = pmap_dbl(list(fit, test, target), accuracy)
)
#> [1] train-rmse:0.357471
#> [2] train-rmse:0.256735
#> ... results %>%
ggplot(aes(test_accuracy)) +
geom_histogram(bins = 30) +
scale_x_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Accuracy", y = "Number of samples",
title = "Test accuracy distribution for\n100 bootstrapped samples")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner

With our powers combined! xgboost and pipelearner的更多相关文章

  1. xgboost入门与实战(原理篇)

    sklearn实战-乳腺癌细胞数据挖掘 https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campai ...

  2. xgboost原理及并行实现

    XGBoost训练: It is not easy to train all the trees at once. Instead, we use an additive strategy: fix ...

  3. 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优

    摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...

  4. 在Windows10 64位 Anaconda4 Python3.5下安装XGBoost

    系统环境: Windows10 64bit Anaconda4 Python3.5.1 软件安装: Git for Windows MINGW 在安装的时候要改一个选择(Architecture选择x ...

  5. CF Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined)

    1. Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined) B. Batch Sort    暴力枚举,水 1.题意:n*m的数组, ...

  6. 【原创】xgboost 特征评分的计算原理

    xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算: 而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性, 调用的源码就不准备详述,本文主要侧重的 ...

  7. Ubuntu: ImportError: No module named xgboost

    ImportError: No module named xgboost 解决办法: git clone --recursive https://github.com/dmlc/xgboost cd ...

  8. windows下安装xgboost

    Note that as of the most recent release the Microsoft Visual Studio instructions no longer seem to a ...

  9. xgboost原理及应用

    1.背景 关于xgboost的原理网络上的资源很少,大多数还停留在应用层面,本文通过学习陈天奇博士的PPT 地址和xgboost导读和实战 地址,希望对xgboost原理进行深入理解. 2.xgboo ...

随机推荐

  1. 解决Mybatis连接Sql server 出现 Cannot load JDBC driver class 'com.mysql.jdbc.Driver '的问题

    tomcat启动的时候没有错误,但是进行数据库操作就会有错误. 在网上找了很久  好不容易找到解决方法 转自 http://blog.csdn.net/ro_bot/article/details/5 ...

  2. BZOJ2818 与 BZOJ2301【euler,线性筛,莫比乌斯】

    题目大意: 给一个范围[1,n],从中找出两个数x,y,使得gcd(x,y)为质数,问有多少对(x,y有序) 解法: 不难,欧拉函数练手题,可以定义集合P ={x|x为素数},那么我们枚举gcd(x, ...

  3. python css概述

    1. 概述 css是英文Cascading Style Sheets的缩写,称为层叠样式表,用于对页面进行美化. 存在方式有三种:元素内联.页面嵌入和外部引入,比较三种方式的优缺点. 语法:style ...

  4. 由if-else,switch代替方案引起的思考

    关键词:条件判断,多态,策略模式,哈希表,字典map 笔者在用python实现事件驱动后,发现python是没有提供switch语句,python官方推荐多用字典来代替switch来实现,这让我就觉得 ...

  5. CocoaAsyncSocket + Protobuf 处理粘包和拆包问题

    在上一篇文章<iOS之ProtocolBuffer搭建和示例demo>分享环境的搭建, 我们和服务器进行IM通讯用了github有名的框架CocoaAsynSocket, 然后和服务器之间 ...

  6. Linux-进程描述(3)之进程状态僵尸进程与孤儿进程

    进程状态 进程状态反映进程执行过程的变化.这些状态随着进程的执行和外界条件的变化而转换.为了弄明正正在运行的进程是什么意思,我们需要知道进程的不同状态.一个进程可以有多个状态(在Linux内核中,进程 ...

  7. JS自定义对象,正则表达式,JQuery中的一些知识点

    一:自定义对象 1.基本概念:①对象:包含一系列无序属性和方法的集合.②键值对:对象中的数据是以键值对的形式存在的,以键取值.③属性:描述对象特征的一系列变量.[对象中的变量]④方法:描述对象行为的一 ...

  8. app测试特点

    一.安装与卸载1.软件安装后是否可以正常运行:2.安装过程中是否可以中断3.安装空间不足时是否有相应的提示4.是否可以卸载应用(桌面卸载和应用卸载)二.权限测试1.扣费风险:包括发送短信.拨打电话.连 ...

  9. Linux Shell——流程控制

    1. 创建交互式脚本 使用 echo命令的选项 关于各种命令的使用,可以使用man 命令来查看命令的详细用法介绍.例如,我想看下 echo 的用法和各种选项.可以执行 man echo.执行结果如下: ...

  10. 蓝桥杯-括号问题-java

    /* (程序头部注释开始) * 程序的版权和版本声明部分 * Copyright (c) 2016, 广州科技贸易职业学院信息工程系学生 * All rights reserved. * 文件名称: ...