Machine and statistical learning wizards are becoming more eager to perform analysis with Spark MLlibrary if this is only possible. It’s trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster - let’s say 100 machines on YARN hadoop cluster makes you the real data cruncher! In this post I presentsparklyr package (by RStudio), the connector that will transform you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it - user sparklyr!

If you don’t know much about Spark yet, you can read my April post Answers to FAQ about SparkR for R users - where I explained how could we use SparkR package that is distributed with Spark. Many things (code) might have changed since that time, due to the rapid development caused by great popularity of Spark. Now we can use version 2.0.0 of Spark. If you are migrating from previous versions I suggest you should look at Migration Guide - Upgrading From SparkR 1.6.x to 2.0.

sparklyr basics

This packages is based on sparkapi package that enables to run Spark applications locally or on YARN cluster just from R. It translates R code to bash invocation of spark-shell. It’s biggest advantage is dplyrinterface for working with Spark Data Frames (that might be Hive Tables) and possibility to invoke algorithms from Spark ML library.

Installation of sparklyr, then Spark itself and simple application initiation is described by this code

library(devtools)
install_github('rstudio/sparklyr')
library(sparklyr)
spark_install(version = "2.0.0")
sc <-
spark_connect(master="yarn",
config = list(
default = list(
spark.submit.deployMode= "client",
spark.executor.instances= 20,
spark.executor.memory= "2G",
spark.executor.cores= 4,
spark.driver.memory= "4G")))

One don’t have to specify config by himself, but if this is desired then remember that you could also specify parameters for Spark application with config.yml files so that you can benefit from many profiles (development, production). In version 2.0.0 it is desired to name master yarn instead of yarn-client and passing the deployMode parameter, which is different from version 1.6.x. All available parameters can be found in Running Spark on YARN documentation page.

dplyr and DBI interface on Spark

When connecting to YARN, it is most probable that you would like to use data tables that are stored on Hive. Remember that

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

where conf/ is set as HADOOP_CONF_DIR. Read more about using Hive tables from Spark

If everything is set up and the application runs properly, you can use dplyr interface to provide lazy evaluation for data manipulations. Data are stored on Hive, Spark application runs on YARN cluster, and the code is invoked from R in the simple language of data transformations (dplyr) - everything thanks to sparklyr team great job! Easy example is below

library(dplyr)
# give the list of tables
src_tbls(sc)
# copies iris from R to Hive
iris_tbl <- copy_to(sc, iris, "iris")
# create a hook for data stored on Hive
data_tbl <- tbl(sc, "table_name")
data_tbl2 <- tbl(sc, sql("SELECT * from table_name"))

You can also perform any operation on datasets use by Spark

iris_tbl %>%
select(Petal_Length, Petal_Width) %>%
top_n(40, Petal_Width) %>%
arrange(Petal_Length)

Note that original commas in iris names have been translated to _.

This package also provides interface for functions defined in DBI package

library(DBI)
dbListTables(sc)
dbGetQuery(sc, "use database_name")
data_tbl3 <- dbGetQuery(sc, "SELECT * from table_name")
dbListFields(sc, data_tbl3)

Running Spark ML Machine Learning K-means Algorithm from R

The basic example on how sparklyr invokes Scala code from Spark ML will be presented on K-means algorithm. If you check the code of sparklyr::ml_kmeans function you will see that for inputtbl_spark object, named x and character vector containing features’ names (featuers)

envir <- new.env(parent = emptyenv())
df <- spark_dataframe(x)
sc <- spark_connection(df)
df <- ml_prepare_features(df, features)
tdf <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)

sparklyr ensures that you have proper connection to spark data frame and prepares features in convenient form and naming convention. At the end it prepares a Spark DataFrame for Spark ML routines.

This is done in a new environment, so that we can store arguments for future ML algorithm and the model itself in its own environment. This is safe and clean solution. You can construct a simple model calling a Spark ML class like this

envir$model <- "org.apache.spark.ml.clustering.KMeans"
kmeans <- invoke_new(sc, envir$model)

which invokes new object of class KMeans on which we can invoke parameters setters to change default parameters like this

model <- kmeans %>%
invoke("setK", centers) %>%
invoke("setMaxIter", iter.max) %>%
invoke("setTol", tolerance) %>%
invoke("setFeaturesCol", envir$features)
# features where set in ml_prepare_dataframe

For an existing object of KMeans class we can invoke its method called fit that is responsible for starting the K-means clustering algorithm

fit <- model %>%
invoke("fit", tdf)

which returns new object on which we can compute, e.g centers of outputted clustering

kmmCenters <- invoke(fit, "clusterCenters")

or the Within Set Sum of Squared Errors (called Cost) (which is mine small contribution #173 )

kmmCost <- invoke(fit, "computeCost", tdf)

This sometimes helps to decide how many clusters should we specify for clustering problem

and is presented in print method for ml_model_kmeans object

iris_tbl %>%
select(Petal_Width, Petal_Length) %>%
ml_kmeans(centers = 3, compute.cost = TRUE) %>%
print() K-means clustering with 3 clusters Cluster centers:
Petal_Width Petal_Length
1 1.359259 4.292593
2 2.047826 5.626087
3 0.246000 1.462000 Within Set Sum of Squared Errors = 31.41289

All that can be better understood if we’ll have a look on Spark ML docuemtnation for KMeans (be carefull not to confuse with Spark MLlib where methods and parameters have different names than those in Spark ML). This enabled me to provide simple update for ml_kmeans() (#179) so that we can specify tol (tolerance) parameter in ml_kmeans() to support tolerance of convergence.

 

inShare37

BioC 2016 Conference Overview and Few Ways of Downloading TCGA Data

Few weeks ago I have a great pleasure of attending BioC 2016: Where Software and Biology Connect Conference at Stanford, where I have learned a lot! It wouldn’t be possible without the scholarship that I received from Bioconductor (organizers), which I deeply appreciate. It was an excellent place for software developers, statisticians and biologists to exchange their experiences and to better explain their work, as the understanding between collaborators in interdisciplinary teams is essential. In this post I present my thoughts and feelings about the event and I share the knowledge that I have learned during the event, i.e. about many ways of downloading The Cancer Genome Atlas data.

转自:http://r-addict.com/2016/08/25/Extending-Sparklyr.html

Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Library的更多相关文章

  1. KNN 与 K - Means 算法比较

    KNN K-Means 1.分类算法 聚类算法 2.监督学习 非监督学习 3.数据类型:喂给它的数据集是带label的数据,已经是完全正确的数据 喂给它的数据集是无label的数据,是杂乱无章的,经过 ...

  2. 软件——机器学习与Python,聚类,K——means

    K-means是一种聚类算法: 这里运用k-means进行31个城市的分类 城市的数据保存在city.txt文件中,内容如下: BJ,2959.19,730.79,749.41,513.34,467. ...

  3. [C2P3] Andrew Ng - Machine Learning

    ##Advice for Applying Machine Learning Applying machine learning in practice is not always straightf ...

  4. hr员工数据分析(实战)

    hr员工数据分析项目实战 (数据已脱敏) 背景说明 某公司最近公司发生多起重要员工意外离职.部分员工工作缺乏积极性等问题,受hr部门委托,开展数据分析工作. 经与hr部门沟通,确定以下需求: 制定数据 ...

  5. Python Machine Learning: Scikit-Learn Tutorial

    这是一篇翻译的博客,原文链接在这里.这是我看的为数不多的介绍scikit-learn简介而全面的文章,特别适合入门.我这里把这篇文章翻译一下,英语好的同学可以直接看原文. 大部分喜欢用Python来学 ...

  6. Sklearn 速查

    ## 版权所有,转帖注明出处 章节 SciKit-Learn 加载数据集 SciKit-Learn 数据集基本信息 SciKit-Learn 使用matplotlib可视化数据 SciKit-Lear ...

  7. Extending the Yahoo! Streaming Benchmark

    could accomplish with Flink back at Twitter. I had an application in mind that I knew I could make m ...

  8. TensorFlow训练神经网络cost一直为0

    问题描述 这几天在用TensorFlow搭建一个神经网络来做一个binary classifier,搭建一个典型的神经网络的基本思路是: 定义神经网络的layers(层)以及初始化每一层的参数 然后迭 ...

  9. 网络费用流-最小k路径覆盖

    多校联赛第一场(hdu4862) Jump Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Ot ...

随机推荐

  1. C++中的继承详解(3)作用域与重定义,赋值兼容规则

    作用域与同名隐藏 一样的,先上代码 1 class A 2 { 3 public: 4 int a_data; 5 void a() 6 { 7 cout << "A" ...

  2. Java与面向对象之随感(1)

    大一下学期上完了c++课程,当时自我感觉很良好,认为对面向对象编程已经是身经百战了,但是上了院里HuangYu老师的Java课之后,才发现自己对于面向对象的编程风格的理解只在皮毛,着实惭愧不已. 假设 ...

  3. 跟着刚哥梳理java知识点——数组(七)

    数组:数组是多个相同类型数据类型的集合,实现对这些数据的统一管理. 元素:数组中的元素可以是任何数据类型,包括基本数据类型和引用类型. 特点:属于引用类型,数组型数据是对象object,数组中的每个元 ...

  4. .exe简单的更新软件demo

    百度网盘源码加文件:http://pan.baidu.com/s/1qYe2Vgg 功能:通过网站更新用户的软件,需要联网,也可以通过本地网站更新局域网用户软件. 根本实现:1.一个网站(本地就可以) ...

  5. ABCD多选正则表达式

    正则表达式: 4个选项,可单选可多选不允许重复 ABCD正则: regexp : /^(?!.*((A.*){2,}|(B.*){2,}|(C.*){2,}|(D.*){2,})$)[A-D]{1,4 ...

  6. spring的MVC执行原理

    spring的MVC执行原理 1.spring mvc将所有的请求都提交给DispatcherServlet,它会委托应用系统的其他模块负责对请求 进行真正的处理工作. 2.DispatcherSer ...

  7. 秒懂JS对象、构造器函数和原型对象之间的关系

    学习JS的过程中,想要掌握面向对象的程序设计风格,对象模型(原型和继承)是其中的重点和难点,拜读了各类经典书籍和各位前辈的技术文章,感觉都太过高深,花费了不少时间才搞明白(个人智商是硬伤/(ㄒoㄒ)/ ...

  8. Linux防火墙配置—SNAT2

    1.实验目标 以实验"Linux防火墙配置-SNAT1"为基础,为网关增加外网IP地址,为eth1创建虚拟接口,使外网测试主机在Wireshark中捕获到的地址为eth1虚拟接口的 ...

  9. 对于Bootstrap的介绍以及如何使用

    Bootstrap是HTML.CSS 和 JS 框架,用于开发响应式布局.移动设备优先的 WEB 项目. 可以自动适配任何设备,解决了响应式实现的繁琐问题,可以修改其中的各种样式,同样,其内部功能的强 ...

  10. Python中使用with语句同时打开多个文件

    下午小伙伴问了一个有趣的问题, 怎么用 Python 的 with 语句同时打开多个文件? 首先, Python 本身是支持同时在 with 中打开多个文件的 with open('a.txt', ' ...