@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

Setup

To follow this post you’ll need the following packages:

# Install (if necessary)
install.packages(c("xgboost", "tidyverse", "devtools"))
devtools::install_github("drsimonj/pipelearner") # Attach
library(tidyverse)
library(xgboost)
library(pipelearner)
library(lazyeval)

Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'

d <- read_csv(
data_url,
col_names = c('id', 'thinkness', 'size_uniformity',
'shape_uniformity', 'adhesion', 'epith_size',
'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
select(-id) %>% # Remove id; not useful here
filter(nuclei != '?') %>% # Remove records with missing data
mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
mutate_all(as.numeric) # All to numeric; needed for XGBoost d
#> # A tibble: 683 × 10
#> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 2 1
#> 2 5 4 4 5 7 10
#> 3 3 1 1 1 2 2
#> 4 6 8 8 1 3 4
#> 5 4 1 1 3 2 1
#> 6 8 10 10 8 7 10
#> 7 1 1 1 1 2 10
#> 8 2 1 2 1 2 1
#> 9 2 1 1 1 2 1
#> 10 4 2 1 1 2 1
#> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
#> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>

pipelearner

pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

pipelearner(d, rpart::rpart, cancer ~ .,
minsplit = c(2, 4, 6, 8, 10),
maxdepth = c(2, 3, 4, 5))

The challenge for xgboost:

pipelearner expects a model function that has two arguments: data andformula

xgboost

Here’s an xgboost model:

# Prep data (X) and labels (y)
X <- select(d, -cancer) %>% as.matrix()
y <- d$cancer # Fit the model
fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858 # Examine accuracy
predicted <- as.numeric(predict(fit, X) >= .5)
mean(predicted == y)
#> [1] 0.9838946

Look like we have a model with 98.39% accuracy on the training data!

Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

Wrapper function to parse data and formula

To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

pl_xgboost <- function(data, formula, ...) {
data <- as.data.frame(data) X_names <- as.character(f_rhs(formula))
y_name <- as.character(f_lhs(formula)) if (X_names == '.') {
X_names <- names(data)[names(data) != y_name]
} X <- data.matrix(data[, X_names])
y <- data[[y_name]] xgboost(data = X, label = y, ...)
}

Let’s try it out:

pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858 # Examine accuracy
pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
mean(pl_predicted == y)
#> [1] 0.9838946

Perfect!

Bringing it all together

We can now use pipelearner and pl_xgboost() for easy grid searching:

pl <- pipelearner(d, pl_xgboost, cancer ~ .,
nrounds = c(5, 10, 25),
eta = c(.1, .3),
max_depth = c(4, 6)) fits <- pl %>% learn()
#> [1] train-rmse:0.453832
#> [2] train-rmse:0.412548
#> ... fits
#> # A tibble: 12 × 9
#> models.id cv_pairs.id train_p fit target model
#> <chr> <chr> <dbl> <list> <chr> <chr>
#> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> # ... with 3 more variables: params <list>, train <list>, test <list>

Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

accuracy <- function(fit, data, target_var) {
# Convert resample object to data frame
data <- as.data.frame(data)
# Get feature matrix and labels
X <- data %>%
select(-matches(target_var)) %>%
as.matrix()
y <- data[[target_var]]
# Obtain predicted class
y_hat <- as.numeric(predict(fit, X) > .5)
# Return accuracy
mean(y_hat == y)
} results <- fits %>%
mutate(
# hyperparameters
nrounds = map_dbl(params, "nrounds"),
eta = map_dbl(params, "eta"),
max_depth = map_dbl(params, "max_depth"),
# Accuracy
accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
accuracy_test = pmap_dbl(list(fit, test, target), accuracy)
) %>%
# Select columns and order rows
select(nrounds, eta, max_depth, contains("accuracy")) %>%
arrange(desc(accuracy_test), desc(accuracy_train)) results
#> # A tibble: 12 × 5
#> nrounds eta max_depth accuracy_train accuracy_test
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 25 0.3 6 1.0000000 0.9489051
#> 2 25 0.3 4 1.0000000 0.9489051
#> 3 10 0.3 6 0.9981685 0.9489051
#> 4 5 0.3 6 0.9945055 0.9489051
#> 5 10 0.1 6 0.9945055 0.9489051
#> 6 25 0.1 6 0.9945055 0.9489051
#> 7 5 0.1 6 0.9926740 0.9489051
#> 8 25 0.1 4 0.9890110 0.9489051
#> 9 10 0.3 4 0.9871795 0.9489051
#> 10 5 0.3 4 0.9853480 0.9489051
#> 11 10 0.1 4 0.9853480 0.9416058
#> 12 5 0.1 4 0.9835165 0.9416058

Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

Bonus: bootstrapped cross validation

For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
learn_cvpairs(n = 100) %>%
learn() %>%
mutate(
test_accuracy = pmap_dbl(list(fit, test, target), accuracy)
)
#> [1] train-rmse:0.357471
#> [2] train-rmse:0.256735
#> ... results %>%
ggplot(aes(test_accuracy)) +
geom_histogram(bins = 30) +
scale_x_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Accuracy", y = "Number of samples",
title = "Test accuracy distribution for\n100 bootstrapped samples")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner

With our powers combined! xgboost and pipelearner的更多相关文章

  1. xgboost入门与实战(原理篇)

    sklearn实战-乳腺癌细胞数据挖掘 https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campai ...

  2. xgboost原理及并行实现

    XGBoost训练: It is not easy to train all the trees at once. Instead, we use an additive strategy: fix ...

  3. 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优

    摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...

  4. 在Windows10 64位 Anaconda4 Python3.5下安装XGBoost

    系统环境: Windows10 64bit Anaconda4 Python3.5.1 软件安装: Git for Windows MINGW 在安装的时候要改一个选择(Architecture选择x ...

  5. CF Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined)

    1. Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined) B. Batch Sort    暴力枚举,水 1.题意:n*m的数组, ...

  6. 【原创】xgboost 特征评分的计算原理

    xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算: 而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性, 调用的源码就不准备详述,本文主要侧重的 ...

  7. Ubuntu: ImportError: No module named xgboost

    ImportError: No module named xgboost 解决办法: git clone --recursive https://github.com/dmlc/xgboost cd ...

  8. windows下安装xgboost

    Note that as of the most recent release the Microsoft Visual Studio instructions no longer seem to a ...

  9. xgboost原理及应用

    1.背景 关于xgboost的原理网络上的资源很少,大多数还停留在应用层面,本文通过学习陈天奇博士的PPT 地址和xgboost导读和实战 地址,希望对xgboost原理进行深入理解. 2.xgboo ...

随机推荐

  1. net.sz.framework 框架 登录服务器架构 单服2 万 TPS(QPS)

    前言 无论我们做什么系统,95%的系统都离不开注册,登录: 而游戏更加关键,频繁登录,并发登录,导量登录:如果登录承载不起来,那么游戏做的再好,都是徒然,进不去啊: 序言 登录所需要的承载,包含程序和 ...

  2. 在ASP.NET Core中使用Apworks开发数据服务:对HAL的支持

    HAL,全称为Hypertext Application Language,它是一种简单的数据格式,它能以一种简单.统一的形式,在API中引入超链接特性,使得API的可发现性(discoverable ...

  3. node服务成长之路

    我们的系统也从第一代平台开始到现在第四代平台更换中,对这四代平台做一个简单的介绍: 第一代平台,主要是集中式,以快速上线为目的:第二代平台主要是分布式改造,缓解各服务压力:第三代平台主要做服务端SOA ...

  4. Linux - PCB之task_struct结构体

     task_struct结构描述  1. 进程状态(State) 进程执行时,它会根据具体情况改变状态 .进程状态是调度和对换的依据.Linux中的进程主要有如下状态,如表4.1所示. 内核表示 含义 ...

  5. Elasticserach学习笔记-01基础概念

    本文系本人根据官方文档的翻译,能力有限.水平一般,如果对想学习Elasticsearch的朋友有帮助,将是本人的莫大荣幸. 原文出处:https://www.elastic.co/guide/en/e ...

  6. [KISSY5系列]KISSY5安装使用(一)

    本文将从零开始安装KISSY环境 一.安装nodejs 从nodejs网站下载nodejs安装 地址: https://nodejs.org/en/download/ 二.下载KISSY 下载地址:  ...

  7. file_get_contents url

    file_get_contents (PHP 4 >= 4.3.0, PHP 5) file_get_contents — 将整个文件读入一个字符串 说明¶ string file_get_co ...

  8. mysql行列转换方法总结

    这是一道行转列并且构造交叉表的问题: http://topic.csdn.net/u/20090530/23/0b782674-4b0b-4cf5-bc1a-e8914aaee5ab.html 数据样 ...

  9. Elasticsearch【正则搜索】分析&实践

    在ES中有很多使用不是很频繁的查询,可以达到一些特殊的效果.比如基于行为路径的漏斗模型.本篇就从使用上讲述一下正则表达式查询的用法. Regexp Query regexp允许使用正则表达式进行ter ...

  10. C++模板学习:函数模板、结构体模板、类模板

    C++模板:函数.结构体.类 模板实现 1.前言:(知道有模板这回事的童鞋请忽视) 普通函数.函数重载.模板函数 认识. //学过c的童鞋们一定都写过函数sum吧,当时是这样写的: int sum(i ...