R formulas in Spark and un-nesting data in SparklyR: Nice and handy!

Intro
In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyRinterface. However, using Spark through the pySparkinterface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelinesin Spark very elegant.
In using Spark I like to share two little tricks described below with you.
The RFormula feature selector
As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula
R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+, – , . and :) , but it is enough to be useful. The two figures below show a simple example.
|
1
2
3
4
|
from pyspark.ml.feature import RFormulaf1 = "Targetf ~ paidDuration + Gender "formula = RFormula(formula = f1)train2 = formula.fit(train).transform(train) |

A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.
Nested (hierarchical) data in sparklyR
Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:
Then to flatten the data you could use:
In SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.
Fortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.
After registering the output object, it is visible in the Spark interface and you can view the content.
Thanks for reading my two tricks. Cheers, Longhow.
转自:https://longhowlam.wordpress.com/2017/02/15/r-formulas-in-spark-and-un-nesting-data-in-sparklyr-nice-and-handy/
R formulas in Spark and un-nesting data in SparklyR: Nice and handy!的更多相关文章
- Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)
文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...
- 大数据工具比较:R 语言和 Spark 谁更胜一筹?
本文有两重目的,一是在性能方面快速对比下R语言和Spark,二是想向大家介绍下Spark的机器学习库 背景介绍 由于R语言本身是单线程的,所以可能从性能方面对比Spark和R并不是很明智的做法.即使这 ...
- Using Apache Spark and MySQL for Data Analysis
What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...
- 【译】Using .NET for Apache Spark to Analyze Log Data
.NET for Spark可用于处理成批数据.实时流.机器学习和ad-hoc查询.在这篇博客文章中,我们将探讨如何使用.NET for Spark执行一个非常流行的大数据任务,即日志分析. 1 什么 ...
- Spark性能优化之道——解决Spark数据倾斜(Data Skew)的N种姿势
原创文章,同步首发自作者个人博客转载请务必在文章开头处注明出处. 摘要 本文结合实例详细阐明了Spark数据倾斜的几种场景以及对应的解决方案,包括避免数据源倾斜,调整并行度,使用自定义Partitio ...
- R class of subset of matrix and data.frame
a = matrix( c(2, 4, 3, 1, 5, 7), # the data elements nrow=2, # number of rows ...
- Spark性能调优之道——解决Spark数据倾斜(Data Skew)的N种姿势
原文:http://blog.csdn.net/tanglizhe1105/article/details/51050974 背景 很多使用Spark的朋友很想知道rdd里的元素是怎么存储的,它们占用 ...
- Spark SQL is a Spark module for structured data processing.
http://spark.apache.org/docs/latest/sql-programming-guide.html
- Managing Spark data handles in R
When working with big data with R (say, using Spark and sparklyr) we have found it very convenient t ...
随机推荐
- ASP.NET自定义模块
要创建自定义模块,类需要实现IHttpModule接口.这个接口定义了Init和Dispose方法. Init方法在启动Web应用程序时调用,其参数的类型是HttpContext,可以添加应用程序处理 ...
- 跟着刚哥梳理java知识点——HelloWorld和常见问题(一)
1.按照国际惯例,写一段输出HelloWorld的java语句: public class HelloWorld { //这是main方法,程序的主入口 public static void main ...
- 什么是javascript的回调函数?
回调函数(callback) 基本上每本书里都会提一提实际上我们几乎每天都在用回调函数,那么如果问你到底什么是回调函数呢? 1. 回调函数是作为参数传递给另一个函数 2. 函数运行到某种程度时,执行回 ...
- php数组--2017-04-16
一.定义数组 (1)索引数组 $arr=array(1,2,3,3); (2)关联数组 类似于集合 $arr1=array("one"=>"111",& ...
- 关于sql、mysql语句的模糊查询分类与详解,包括基本用法和mapper.xml文件里插入写法
欢迎猿类加qq:2318645572,共同学习进步 实际例子: ssm框架:service业务层->dao层->mappers.xml->junit/test测试 1:service ...
- 第三章 Docker的镜像
3.1.获取镜像 获取镜像 docker pull name[:TAG] #默认是从网络下载镜像,不指定tag会人下载latest标签下的镜像. 1 2 docker search ubuntu do ...
- 2017-4-26 winform tab和无边框窗体制作
TabIndex-----------------------------------确定此控件将占用的Tab键顺序索引 Tabstop-------------------------------指 ...
- JDBC访问数据库
一.准备条件 外界条件 在数据库中首先创建表空间 在创建的表中添加数据 代码部分 导入数据库的驱动包(jar) 加载数据库驱动 获取数据库连接 编写sql语句 利用prepareStatement进行 ...
- macOS 中使用 phpize 动态添加 PHP 扩展的错误解决方法
使用 phpize 动态添加 PHP 扩展是开发中经常需要做的事情,但是在 macOS 中,首次使用该功能必然会碰到一些错误,本文列出了这些错误的解决方法. 问题一: 执行 phpize 报错如下: ...
- PHP的虚拟域名的配置
由于本人的自己搭建的php环境,Wamp环境.虚拟域名是写程序的一个最基本的配置,也对项目的调试有一定的真实感.我在学习虚拟域名的时候是受到了一些的问题,所以说写这个是为了帮助新人少走弯路, 也是为了 ...