Machine and statistical learning wizards are becoming more eager to perform analysis with Spark MLlibrary if this is only possible. It’s trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster - let’s say 100 machines on YARN hadoop cluster makes you the real data cruncher! In this post I presentsparklyr package (by RStudio), the connector that will transform you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it - user sparklyr!

If you don’t know much about Spark yet, you can read my April post Answers to FAQ about SparkR for R users - where I explained how could we use SparkR package that is distributed with Spark. Many things (code) might have changed since that time, due to the rapid development caused by great popularity of Spark. Now we can use version 2.0.0 of Spark. If you are migrating from previous versions I suggest you should look at Migration Guide - Upgrading From SparkR 1.6.x to 2.0.

sparklyr basics

This packages is based on sparkapi package that enables to run Spark applications locally or on YARN cluster just from R. It translates R code to bash invocation of spark-shell. It’s biggest advantage is dplyrinterface for working with Spark Data Frames (that might be Hive Tables) and possibility to invoke algorithms from Spark ML library.

Installation of sparklyr, then Spark itself and simple application initiation is described by this code

library(devtools)

install_github('rstudio/sparklyr')

library(sparklyr)

spark_install(version = "2.0.0")

sc <-

spark_connect(master="yarn",

   config = list(

     default = list(

       spark.submit.deployMode= "client",

       spark.executor.instances= 20,

       spark.executor.memory= "2G",

       spark.executor.cores= 4,

       spark.driver.memory= "4G")))

One don’t have to specify config by himself, but if this is desired then remember that you could also specify parameters for Spark application with config.yml files so that you can benefit from many profiles (development, production). In version 2.0.0 it is desired to name master yarn instead of yarn-client and passing the deployMode parameter, which is different from version 1.6.x. All available parameters can be found in Running Spark on YARN documentation page.

dplyr and DBI interface on Spark

When connecting to YARN, it is most probable that you would like to use data tables that are stored on Hive. Remember that

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

where conf/ is set as HADOOP_CONF_DIR. Read more about using Hive tables from Spark

If everything is set up and the application runs properly, you can use dplyr interface to provide lazy evaluation for data manipulations. Data are stored on Hive, Spark application runs on YARN cluster, and the code is invoked from R in the simple language of data transformations (dplyr) - everything thanks to sparklyr team great job! Easy example is below

library(dplyr)

# give the list of tables

src_tbls(sc)

# copies iris from R to Hive

iris_tbl <- copy_to(sc, iris, "iris")

# create a hook for data stored on Hive

data_tbl <- tbl(sc, "table_name")

data_tbl2 <- tbl(sc, sql("SELECT * from table_name"))

You can also perform any operation on datasets use by Spark

iris_tbl %>%

   select(Petal_Length, Petal_Width) %>%

   top_n(40, Petal_Width) %>%

   arrange(Petal_Length)

Note that original commas in iris names have been translated to _.

This package also provides interface for functions defined in DBI package

library(DBI)

dbListTables(sc)

dbGetQuery(sc, "use database_name")

data_tbl3 <- dbGetQuery(sc, "SELECT * from table_name")

dbListFields(sc, data_tbl3)

Running Spark ML Machine Learning K-means Algorithm from R

The basic example on how sparklyr invokes Scala code from Spark ML will be presented on K-means algorithm. If you check the code of sparklyr::ml_kmeans function you will see that for inputtbl_spark object, named x and character vector containing features’ names (featuers)

envir <- new.env(parent = emptyenv())

df <- spark_dataframe(x)

sc <- spark_connection(df)

df <- ml_prepare_features(df, features)

tdf <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)

sparklyr ensures that you have proper connection to spark data frame and prepares features in convenient form and naming convention. At the end it prepares a Spark DataFrame for Spark ML routines.

This is done in a new environment, so that we can store arguments for future ML algorithm and the model itself in its own environment. This is safe and clean solution. You can construct a simple model calling a Spark ML class like this

envir$model <- "org.apache.spark.ml.clustering.KMeans"

kmeans <- invoke_new(sc, envir$model)

which invokes new object of class KMeans on which we can invoke parameters setters to change default parameters like this

model <- kmeans %>%

    invoke("setK", centers) %>%

    invoke("setMaxIter", iter.max) %>%

    invoke("setTol", tolerance) %>%

    invoke("setFeaturesCol", envir$features)

# features where set in ml_prepare_dataframe

For an existing object of KMeans class we can invoke its method called fit that is responsible for starting the K-means clustering algorithm

fit <- model %>%

invoke("fit", tdf)

which returns new object on which we can compute, e.g centers of outputted clustering

kmmCenters <- invoke(fit, "clusterCenters")

or the Within Set Sum of Squared Errors (called Cost) (which is mine small contribution #173 )

kmmCost <- invoke(fit, "computeCost", tdf)

This sometimes helps to decide how many clusters should we specify for clustering problem

and is presented in print method for ml_model_kmeans object

iris_tbl %>%

   select(Petal_Width, Petal_Length) %>%

   ml_kmeans(centers = 3, compute.cost = TRUE) %>%

   print()

K-means clustering with 3 clusters

Cluster centers:

  Petal_Width Petal_Length

1    1.359259     4.292593

2    2.047826     5.626087

3    0.246000     1.462000

Within Set Sum of Squared Errors =  31.41289

All that can be better understood if we’ll have a look on Spark ML docuemtnation for KMeans (be carefull not to confuse with Spark MLlib where methods and parameters have different names than those in Spark ML). This enabled me to provide simple update for ml_kmeans() (#179) so that we can specify tol (tolerance) parameter in ml_kmeans() to support tolerance of convergence.

inShare37

BioC 2016 Conference Overview and Few Ways of Downloading TCGA Data

Few weeks ago I have a great pleasure of attending BioC 2016: Where Software and Biology Connect Conference at Stanford, where I have learned a lot! It wouldn’t be possible without the scholarship that I received from Bioconductor (organizers), which I deeply appreciate. It was an excellent place for software developers, statisticians and biologists to exchange their experiences and to better explain their work, as the understanding between collaborators in interdisciplinary teams is essential. In this post I present my thoughts and feelings about the event and I share the knowledge that I have learned during the event, i.e. about many ways of downloading The Cancer Genome Atlas data.

转自：http://r-addict.com/2016/08/25/Extending-Sparklyr.html

Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Library的更多相关文章

KNN 与 K - Means 算法比较
KNN K-Means 1.分类算法聚类算法 2.监督学习非监督学习 3.数据类型:喂给它的数据集是带label的数据,已经是完全正确的数据喂给它的数据集是无label的数据,是杂乱无章的,经过 ...
软件——机器学习与Python，聚类，K——means
K-means是一种聚类算法: 这里运用k-means进行31个城市的分类城市的数据保存在city.txt文件中,内容如下: BJ,2959.19,730.79,749.41,513.34,467. ...
[C2P3] Andrew Ng - Machine Learning
##Advice for Applying Machine Learning Applying machine learning in practice is not always straightf ...
hr员工数据分析（实战）
hr员工数据分析项目实战 (数据已脱敏) 背景说明某公司最近公司发生多起重要员工意外离职.部分员工工作缺乏积极性等问题,受hr部门委托,开展数据分析工作. 经与hr部门沟通,确定以下需求: 制定数据 ...
Python Machine Learning: Scikit-Learn Tutorial
这是一篇翻译的博客,原文链接在这里.这是我看的为数不多的介绍scikit-learn简介而全面的文章,特别适合入门.我这里把这篇文章翻译一下,英语好的同学可以直接看原文. 大部分喜欢用Python来学 ...
Extending the Yahoo! Streaming Benchmark
could accomplish with Flink back at Twitter. I had an application in mind that I knew I could make m ...
TensorFlow训练神经网络cost一直为0
问题描述这几天在用TensorFlow搭建一个神经网络来做一个binary classifier,搭建一个典型的神经网络的基本思路是: 定义神经网络的layers(层)以及初始化每一层的参数然后迭 ...
网络费用流-最小k路径覆盖
多校联赛第一场(hdu4862) Jump Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Ot ...

随机推荐

Unity3D C#中使用LINQ查询（与 SQL的区别）
学过SQL的一看就懂 LINQ代码很直观但是,LINQ却又跟SQL完全不同首先来看一下调用LINQ的代码 int[] badgers = {36,5,91,3,41,69,8}; var skun ...
苹果新手MacBook 目录认识
最近,开发平台从windows转型到mac. 刚开始还真不适应不过使用了几天之后还是很不错的. 那么我们来认识一下目录,用过linux的应该很好适应unix的mac MAC是Unix系统和Win ...
C++ count_if/erase/remove_if 用法详解
每次使用这几个算法时都要去查CPP reference,为了能够加深印象,整理一下基本应用. cout/cout_if: return the number of elements satisfyi ...
HYML / CSS和Javascript 部分
1 CSS实现垂直水平居中 HTML结构: <div class="wrapper"> <div class="content">&l ...
SQL Server 中统计信息直方图中对于没有覆盖到谓词预估以及预估策略的变化（SQL2012-->SQL2014-->SQL2016）
本位出处:http://www.cnblogs.com/wy123/p/6770258.html 统计信息写过几篇了相关的文章了,感觉还是不过瘾,关于统计信息的问题,最近又踩坑了,该问题虽然不算很常见 ...
新建Android项目，会出现两个项目一个是自己创建的项目，另一个是“appcompat_v7”项目，这是怎么回事呢？该怎么解决呢？
做Android开发的朋友最近会发现,更新ADT至22.6.0版本之后,创建新的安装项目,会出现appcompat_v7的内容.并且是创建一个新的内容就会出现.这到底是怎么回事呢?原来appcompa ...
（函数封装）domReady
一般的我们用window.onload()来判断文档是否加载完成,我们一般采用下面的做法: 当文档加载全部完后,我们在执行代码块(很显然,当需要加载的文档及节点庞大时,用户体验可能会变很差) wind ...
HTML5 进阶系列：拖放 API 实现拖放排序
前言 HTML5 中提供了直接拖放的 API,极大的方便我们实现拖放效果,不需要去写一大堆的 js,只需要通过监听元素的拖放事件就能实现各种拖放功能. 想要拖放某个元素,必须设置该元素的 dragga ...
Hadoop集群
你可以用以下三种支持的模式中的一种启动Hadoop集群: 单机模式伪分布式模式完全分布式模式单机模式的操作方法默认情况下,Hadoop被配置成以非分布式模式运行的一个独立Java进程.这对调试 ...
微信小程序框架
框架小程序开发框架的目标是通过尽可能简单.高效的方式让开发者可以在微信中开发具有原生 APP 体验的服务. 框架提供了自己的视图层描述语言 WXML 和 WXSS,以及基于 JavaScript 的 ...