Intro

In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyRinterface. However, using Spark through the pySparkinterface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelinesin Spark very elegant.

In using Spark I like to share two little tricks described below with you.

The RFormula feature selector

As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula

R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+ , . and :) , but it is enough to be useful. The two figures below show a simple example.

1
2
3
4
from pyspark.ml.feature import RFormula
f1 = "Targetf ~ paidDuration + Gender "
formula = RFormula(formula = f1)
train2 = formula.fit(train).transform(train)

A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.

Nested (hierarchical) data in sparklyR

Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:Then to flatten the data you could use:In SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.Fortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.After registering the output object, it is visible in the Spark interface and you can view the content.

Thanks for reading my two tricks. Cheers, Longhow.

转自:https://longhowlam.wordpress.com/2017/02/15/r-formulas-in-spark-and-un-nesting-data-in-sparklyr-nice-and-handy/

R formulas in Spark and un-nesting data in SparklyR: Nice and handy!的更多相关文章

  1. Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)

    文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...

  2. 大数据工具比较:R 语言和 Spark 谁更胜一筹?

    本文有两重目的,一是在性能方面快速对比下R语言和Spark,二是想向大家介绍下Spark的机器学习库 背景介绍 由于R语言本身是单线程的,所以可能从性能方面对比Spark和R并不是很明智的做法.即使这 ...

  3. Using Apache Spark and MySQL for Data Analysis

    What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...

  4. 【译】Using .NET for Apache Spark to Analyze Log Data

    .NET for Spark可用于处理成批数据.实时流.机器学习和ad-hoc查询.在这篇博客文章中,我们将探讨如何使用.NET for Spark执行一个非常流行的大数据任务,即日志分析. 1 什么 ...

  5. Spark性能优化之道——解决Spark数据倾斜(Data Skew)的N种姿势

    原创文章,同步首发自作者个人博客转载请务必在文章开头处注明出处. 摘要 本文结合实例详细阐明了Spark数据倾斜的几种场景以及对应的解决方案,包括避免数据源倾斜,调整并行度,使用自定义Partitio ...

  6. R class of subset of matrix and data.frame

    a = matrix(     c(2, 4, 3, 1, 5, 7), # the data elements     nrow=2,              # number of rows   ...

  7. Spark性能调优之道——解决Spark数据倾斜(Data Skew)的N种姿势

    原文:http://blog.csdn.net/tanglizhe1105/article/details/51050974 背景 很多使用Spark的朋友很想知道rdd里的元素是怎么存储的,它们占用 ...

  8. Spark SQL is a Spark module for structured data processing.

    http://spark.apache.org/docs/latest/sql-programming-guide.html

  9. Managing Spark data handles in R

    When working with big data with R (say, using Spark and sparklyr) we have found it very convenient t ...

随机推荐

  1. 插入排序的优化非希尔【不靠谱地讲可以优化到O(nlogn)】 USACO 丑数

    首先我们先介绍一下普通的插排,就是我们现在一般写的那种,效率是O(n^2)的. 普通的插排基于的思想就是找位置,然后插入进去,其他在它后面的元素全部后移,下面是普通插排的代码: #include< ...

  2. javaScript 基础学习笔记

    边看视频和书记得有点杂. 1.插入JS标签 一种是在文档中插入<script></script>标签.另一种是把javaScript代码放在.js文件中.放在head中如. & ...

  3. Java中常用来处理时间的三个类:Date、Calendar、SimpleDateFormate,以及Java中的单例设计模式:懒汉式、饿汉式以及静态内部类式

    (一)java.util.Date类 1.该类有一个long类型的属性:用来存放时间,是用毫秒数的形式表示,开始的日期是从1970年1月1号 00:00:00.    2.该类的很多方法都已经过时,不 ...

  4. 【C和指针】笔记1

    数据 基本数据类型 整型家族:包含字符,短整型,整型和长整型 整型相互之间大小规定如下: 长整型至少和整型一样长,而整型至少应该和短整型一样长. short int至少16位,long int至少32 ...

  5. ASP.NET CORE部署到Linux

    ASP.NET CORE部署到CentOS中 在Linux上安装.NET Core 参考:https://www.microsoft.com/net/core#linuxcentos 配置Nginx ...

  6. redis的安装和测试

    redis一直都是调用别人部署好的,近日想要自己从灵开始搭建一次.其中也生出不少枝节,与各位猿友共同分享,望少走些弯路! 1.提前准备的资源 redis安装包(本人上传到csdn不需积分即可下载): ...

  7. js中将yyyy-MM-dd格式的日期转换

    1.转换为yyyy年MM月dd日 var str = "2017-02-16"; var reg =/(\d{4})\-(\d{2})\-(\d{2})/; var date = ...

  8. Linux-粘滞位的使用

    粘滞位(Stickybit),又称粘着位,是Unix文件系统权限的一个旗标.最常见的用法在目录上设置粘滞位, 也只能针对⽬录设置,对于⽂件⽆效.则设置了粘滞位后,只有目录内文件的所有者或者root才可 ...

  9. 【转载】stm32定时器-----珍藏版

    今天看到一个讲解定时器特别细致入微的文章,真是难得... 原文地址:http://www.cnblogs.com/zjvskn/p/5751591.html 一.STM32通用定时器原理 STM32  ...

  10. java上转型和下转型(对象的多态性)

    /*上转型和下转型(对象的多态性) *上转型:是子类对象由父类引用,格式:parent p=new son *也就是说,想要上转型的前提必须是有继承关系的两个类. *在调用方法的时候,上转型对象只能调 ...