R formulas in Spark and un-nesting data in SparklyR: Nice and handy!

Intro
In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyRinterface. However, using Spark through the pySparkinterface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelinesin Spark very elegant.
In using Spark I like to share two little tricks described below with you.
The RFormula feature selector
As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula
R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+, – , . and :) , but it is enough to be useful. The two figures below show a simple example.
|
1
2
3
4
|
from pyspark.ml.feature import RFormulaf1 = "Targetf ~ paidDuration + Gender "formula = RFormula(formula = f1)train2 = formula.fit(train).transform(train) |

A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.
Nested (hierarchical) data in sparklyR
Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:
Then to flatten the data you could use:
In SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.
Fortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.
After registering the output object, it is visible in the Spark interface and you can view the content.
Thanks for reading my two tricks. Cheers, Longhow.
转自:https://longhowlam.wordpress.com/2017/02/15/r-formulas-in-spark-and-un-nesting-data-in-sparklyr-nice-and-handy/
R formulas in Spark and un-nesting data in SparklyR: Nice and handy!的更多相关文章
- Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)
文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...
- 大数据工具比较:R 语言和 Spark 谁更胜一筹?
本文有两重目的,一是在性能方面快速对比下R语言和Spark,二是想向大家介绍下Spark的机器学习库 背景介绍 由于R语言本身是单线程的,所以可能从性能方面对比Spark和R并不是很明智的做法.即使这 ...
- Using Apache Spark and MySQL for Data Analysis
What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...
- 【译】Using .NET for Apache Spark to Analyze Log Data
.NET for Spark可用于处理成批数据.实时流.机器学习和ad-hoc查询.在这篇博客文章中,我们将探讨如何使用.NET for Spark执行一个非常流行的大数据任务,即日志分析. 1 什么 ...
- Spark性能优化之道——解决Spark数据倾斜(Data Skew)的N种姿势
原创文章,同步首发自作者个人博客转载请务必在文章开头处注明出处. 摘要 本文结合实例详细阐明了Spark数据倾斜的几种场景以及对应的解决方案,包括避免数据源倾斜,调整并行度,使用自定义Partitio ...
- R class of subset of matrix and data.frame
a = matrix( c(2, 4, 3, 1, 5, 7), # the data elements nrow=2, # number of rows ...
- Spark性能调优之道——解决Spark数据倾斜(Data Skew)的N种姿势
原文:http://blog.csdn.net/tanglizhe1105/article/details/51050974 背景 很多使用Spark的朋友很想知道rdd里的元素是怎么存储的,它们占用 ...
- Spark SQL is a Spark module for structured data processing.
http://spark.apache.org/docs/latest/sql-programming-guide.html
- Managing Spark data handles in R
When working with big data with R (say, using Spark and sparklyr) we have found it very convenient t ...
随机推荐
- 淘宝内部分享:怎么跳出MySQL的10个大坑
编者按:淘宝自从2010开始规模使用MySQL,替换了之前商品.交易.用户等原基于IOE方案的核心数据库,目前已部署数千台规模.同时和Oracle, Percona, Mariadb等上游厂商有良好合 ...
- ubuntu中文字符集格式转换
- C#中的DateTime是值类型还是引用类型
近期遇到了DateTime到底是值类型还是引用类型的疑惑,顺势较深入地了解一下DateTime相关的内容 结论:DateTime是值类型,因为DateTime是结构体,而结构体继承自Syste.Val ...
- 图解函数重载以及arguments
- 利用 Forcing InnoDB Recovery 特性解决 MySQL 重启失败的问题
小明同学在本机上安装了 MySQL 5.7.17 配合项目进行开发,并且已经有了一部分重要数据.某天小明在开发的时候,需要出去一趟就直接把电脑关掉了,没有让 MySQL 正常关闭,重启 MySQL 的 ...
- 笔记整理:计算CPU使用率 ----linux 环境编程 从应用到内核
linux 提供time命令统计进程在用户态和内核态消耗的CPU时间: [root@localhost ~]# time sleep real 0m2.001s user 0m0.001s sys 0 ...
- OpenGL 的空间变换(下):空间变换
通过本文的上篇 OpenGL 的空间变换(上):矩阵在空间几何中的应用 ,我们了解到矩阵的基础概念.并且掌握了矩阵在空间几何中的应用.接下来,我们将结合矩阵来了解 OpenGL 的空间变换. 在使用 ...
- 使用validator-api来验证spring-boot的参数
作为服务端开发,验证前端传入的参数的合法性是一个必不可少的步骤,但是验证参数是一个基本上是一个体力活,而且冗余代码繁多,也影响代码的可阅读性,所以有没有一个比较优雅的方式来解决这个问题? 这么简单的问 ...
- Eclipse导入Android签名
本篇主要参照http://blog.csdn.net/wuxy_shenzhen/article/details/20946839 在安装安卓apk时经常会出现类似INSTALL_FAILED_SHA ...
- C语言学习第三章
写在课前,提醒自己写代码的时候一定要注意不能漏写符号!提醒自己写代码的时候一定要注意不能漏写符号!提醒自己写代码的时候一定要注意不能漏写符号! 今天主要学习掌握if...else条件结构,多重if条件 ...