R formulas in Spark and un-nesting data in SparklyR: Nice and handy!

Intro
In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyRinterface. However, using Spark through the pySparkinterface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelinesin Spark very elegant.
In using Spark I like to share two little tricks described below with you.
The RFormula feature selector
As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula
R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+, – , . and :) , but it is enough to be useful. The two figures below show a simple example.
|
1
2
3
4
|
from pyspark.ml.feature import RFormulaf1 = "Targetf ~ paidDuration + Gender "formula = RFormula(formula = f1)train2 = formula.fit(train).transform(train) |

A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.
Nested (hierarchical) data in sparklyR
Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:
Then to flatten the data you could use:
In SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.
Fortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.
After registering the output object, it is visible in the Spark interface and you can view the content.
Thanks for reading my two tricks. Cheers, Longhow.
转自:https://longhowlam.wordpress.com/2017/02/15/r-formulas-in-spark-and-un-nesting-data-in-sparklyr-nice-and-handy/
R formulas in Spark and un-nesting data in SparklyR: Nice and handy!的更多相关文章
- Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)
文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...
- 大数据工具比较:R 语言和 Spark 谁更胜一筹?
本文有两重目的,一是在性能方面快速对比下R语言和Spark,二是想向大家介绍下Spark的机器学习库 背景介绍 由于R语言本身是单线程的,所以可能从性能方面对比Spark和R并不是很明智的做法.即使这 ...
- Using Apache Spark and MySQL for Data Analysis
What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...
- 【译】Using .NET for Apache Spark to Analyze Log Data
.NET for Spark可用于处理成批数据.实时流.机器学习和ad-hoc查询.在这篇博客文章中,我们将探讨如何使用.NET for Spark执行一个非常流行的大数据任务,即日志分析. 1 什么 ...
- Spark性能优化之道——解决Spark数据倾斜(Data Skew)的N种姿势
原创文章,同步首发自作者个人博客转载请务必在文章开头处注明出处. 摘要 本文结合实例详细阐明了Spark数据倾斜的几种场景以及对应的解决方案,包括避免数据源倾斜,调整并行度,使用自定义Partitio ...
- R class of subset of matrix and data.frame
a = matrix( c(2, 4, 3, 1, 5, 7), # the data elements nrow=2, # number of rows ...
- Spark性能调优之道——解决Spark数据倾斜(Data Skew)的N种姿势
原文:http://blog.csdn.net/tanglizhe1105/article/details/51050974 背景 很多使用Spark的朋友很想知道rdd里的元素是怎么存储的,它们占用 ...
- Spark SQL is a Spark module for structured data processing.
http://spark.apache.org/docs/latest/sql-programming-guide.html
- Managing Spark data handles in R
When working with big data with R (say, using Spark and sparklyr) we have found it very convenient t ...
随机推荐
- 在阿里云Linux服务器上安装MySQL
申请阿里云Linux服务器 昨天在阿里云申请了一个免费试用5天的Linux云服务器. 操作系统:Red Hat Enterprise Linux Server 5.4 64位. CPU:1核 内存:5 ...
- 重新认识JavaScript里的数据类型
一.序 数据类型,平时天天在用,今日闲暇便重新阅读了JavaScript数据类型这块,才发现平时用的时候有许些错误和不足,且对此深有感悟,便写下这篇文章加以巩固基础知识并有空翻出来温故而知新. 二.概 ...
- 在 redhat 6.4上安装Python 2.7.5
在工作环境中使用的是python 2.7.*,但是CentOS 6.4中默认使用的python版本是2.6.6,故需要升级版本. 安装步骤如下: 1,先安装GCC,用如下命令yum install g ...
- python 线程与进程
线程和进程简介 应用程序和进程以及线程的关系? 一个应用程序里可以有多个进程,一个进程里可以有多个线程 最原始的计算机是如何运行的? CPU是什么?为什么要使用多个CPU? 为什么要使用多线程? 为什 ...
- Java --- JSP2新特性
自从03年发布了jsp2.0之后,新增了一些额外的特性,这些特性使得动态网页设计变得更加容易.jsp2.0以后的版本统称jsp2.主要的新增特性有如下几个: 直接配置jsp属性 表达式语言(EL) 标 ...
- 如何选择合适的PHP开发框架
PHP作为一门成熟的WEB应用开发语言,已经深受广大开发者的青睐.与此同时,各式各样的PHP开发框架也从出不穷,面对如此多而且良莠不齐的开发框架,开发者们想必都会眼花缭乱,不知道该选择用哪个.其实并没 ...
- angular ng-bind
<body ng-app=""> <div ng-controller="firstController"> <input typ ...
- 使用canvas实现擦除效果
HTML5 的 canvas 元素使用 JavaScript 在网页上绘制图像.画布是一个矩形区域,您可以控制其每一像素.canvas 拥有多种绘制路径.矩形.圆形.字符以及添加图像的方法. html ...
- Eclipse 中 Java 项目中 .settings 文件夹作用
今天工作时,因对 .settings 文件夹误操作,耗时 6 个多小时,才了解到原因就出在 .settings 文件夹.经查阅资料,对 .settings 做如下整理: 就如setting这个名字,就 ...
- Java中的多线程Demo
一.关于Java多线程中的一些概念 1.1 线程基本概念 从JDK1.5开始,Java提供了3中方式来创建.启动多线程: 方式一(不推荐).通过继承Thread类来创建线程类,重写run()方法作为线 ...