Auto-scaling scikit-learn with Apache Spark
来源:https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
Data scientists often spend hours or days tuning models to get the highest accuracy. This tuning typically involves running a large number of independent Machine Learning (ML) tasks coded in Python or R. Following some work presented at Spark Summit Europe 2015, we are excited to release scikit-learn integration package for Apache Spark that dramatically simplifies the life of data scientists using Python. This package automatically distributes the most repetitive tasks of model tuning on a Spark cluster, without impacting the workflow of data scientists:
- When used on a single machine, Spark can be used as a substitute to the default multithreading framework used by scikit-learn (Joblib).
- If a need comes to spread the work across multiple machines, no change is required in the code between the single-machine case and the cluster case.
Scale data science effortlessly
Python is one of the most popular programming languages for data exploration and data science, and this is in no small part due to high quality libraries such as Pandas for data exploration or scikit-learn for machine learning. Scikit-learn provides fast and robust implementations of standard ML algorithms such as clustering, classification, and regression.
Scikit-learn’s strength has typically been in the realm of computing on a single node, though. For some common scenarios, such as parameter tuning, a large number of small tasks can be run in parallel. These scenarios are perfect use cases for Spark.
We explored how to integrate Spark with scikit-learn, and the result is the Scikit-learn integration package for Spark. It combines the strengths of Spark and scikit-learn with no changes to users’ code. It re-implements some components of scikit-learn that benefit the most from distributed computing. Users will find a Spark-based cross-validator class that is fully compatible with scikit-learn’s cross-validation tools. By swapping out a single class import, users can distribute cross-validation for their existing scikit-learn workflows.
Distribute tuning of Random Forests
Consider a classical example of identifying digits in images. Here are a few examples of images taken from the popular digits dataset, with their labels:

We are going to train a random forest classifier to recognize the digits. This classifier has a number of parameters to adjust, and there is no easy way to know which parameters work best, other than trying out many different combinations. Scikit-learn provides GridSearchCV, a search algorithm that explores many parameter settings automatically. GridSearchCV uses selection by cross-validation, illustrated below. Each parameter setting produces one model, and the best-performing model is selected.

The original code, using only scikit-learn, is as follows:
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)
The dataset is small (in the hundreds of kilobytes), but exploring all the combinations takes about 5 minutes on a single core. The scikit-learn package for Spark provides an alternative implementation of the cross-validation algorithm that distributes the workload on a Spark cluster. Each node runs the training algorithm using a local copy of the scikit-learn library, and reports the best model back to the master:

The code is the same as before, except for a one-line change:
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
# Use spark_sklearn’s grid search instead:
from spark_sklearn import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)
This example runs under 30 seconds on a 4-node cluster (which has 16 CPUs). For larger Datasets
" style="box-sizing: border-box; color: rgb(0, 0, 0) !important; text-decoration-line: none !important; border-bottom: 1px dotted rgb(0, 0, 0) !important;">datasets and more parameter settings, the difference is even more dramatic.

Get started
If you would like to try out this package yourself, it is available as a Spark package and as a PyPI library. To get started, check out this example notebook on Databricks.
In addition to distributing ML tasks in Python across a cluster, Scikit-learn integration package for Spark provides additional tools to export data from Spark to python and vice-versa. You can find methods to convert Spark DataFrames
" style="box-sizing: border-box; color: rgb(0, 0, 0) !important; text-decoration-line: none !important; border-bottom: 1px dotted rgb(0, 0, 0) !important;">DataFrames to Pandas dataframes and numpy arrays. More details can be found in this Spark Summit Europe presentation and in the API documentation.
Auto-scaling scikit-learn with Apache Spark的更多相关文章
- (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...
- Offset Management For Apache Kafka With Apache Spark Streaming
An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark Streaming ...
- Why Apache Spark is a Crossover Hit for Data Scientists [FWD]
Spark is a compelling multi-purpose platform for use cases that span investigative, as well as opera ...
- Apache Spark 章节1
作者:jiangzz 电话:15652034180 微信:jiangzz_wx 微信公众账号:jiangzz_wy 背景介绍 Spark是一个快如闪电的统一分析引擎(计算框架)用于大规模数据集的处理. ...
- APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL
What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...
- What’s new for Spark SQL in Apache Spark 1.3(中英双语)
文章标题 What’s new for Spark SQL in Apache Spark 1.3 作者介绍 Michael Armbrust 文章正文 The Apache Spark 1.3 re ...
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets(中英双语)
文章标题 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets 且谈Apache Spark的API三剑客:RDD.Dat ...
- Real Time Credit Card Fraud Detection with Apache Spark and Event Streaming
https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming/ Editor ...
- How-to: Tune Your Apache Spark Jobs (Part 1)
Learn techniques for tuning your Apache Spark jobs for optimal efficiency. When you write Apache Spa ...
- Using Apache Spark and MySQL for Data Analysis
What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...
随机推荐
- 基于H5与webGL的 3d 电子围栏展示
前言 现代工业化的推进在极大加速现代化进程的同时也带来的相应的安全隐患,在传统的可视化监控领域,一般都是基于 Web SCADA 的前端技术来实现 2D 可视化监控,本系统采用 Hightopo 的 ...
- 06讲案例篇:系统的CPU使用率很高,但为啥却找不到高CPU的应用
小结 碰到常规问题无法解释的 CPU 使用率情况时,首先要想到有可能是短时应用导致的问题,比如有可能是下面这两种情况. 第一,应用里直接调用了其他二进制程序,这些程序通常运行时间比较短,通过 top ...
- ImageButton去边框&Button或者ImageButton的背景透明
在ImageButton中载入图片后,很多人会觉得有图片周围的白边会影响到美观,其实解决这个问题有两种方法 一种方法是将ImageButton的背景改为所需要的图片.如:android:backgro ...
- java11类和对象
import java.util.Scanner; public class jh_01_如何认识事物 { public static void main(String[] args) { Scann ...
- I fullly understand why can not set "auto commit off" in sqlserver
This is xxxxx Because MES guy mistaken , the data was wrong and made system error then. After that I ...
- Spark Streaming运行流程及源码解析(一)
本系列主要描述Spark Streaming的运行流程,然后对每个流程的源码分别进行解析 之前总听同事说Spark源码有多么棒,咱也不知道,就是疯狂点头.今天也来撸一下Spark源码. 对Spark的 ...
- 动手学习Pytorch(4)--过拟合欠拟合及其解决方案
过拟合.欠拟合及其解决方案 过拟合.欠拟合的概念 权重衰减 丢弃法 模型选择.过拟合和欠拟合 训练误差和泛化误差 在解释上述现象之前,我们需要区分训练误差(training error)和泛化误差 ...
- C语言实现matlab的interp2()函数
项目要用到matlab中的Vq = interp2(X,Y,V,Xq,Yq)函数,即把一个已知经纬度和对应值的矩阵,插值变换到一个给定经纬度网格中,也就是对给定网格填值,需要用到插值,这里使用双线性内 ...
- [Effective Java 读书笔记] 第二章 创建和销毁对象 第五条
第五条 避免创建不必要的对象 书中一开始举例: String s = new String("stringette"); // don't do this //应该使用下面,只会创 ...
- LVM 逻辑卷 (logica volume manager)
逻辑卷轴管理员 (Logical Volume Manager) 想像一个情况,你在当初规划主机的时候将 /home 只给他 50G ,等到使用者众多之后导致这个 filesystem 不够大, 此时 ...