Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list ordata_frame.

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:

library("sparklyr")

#packageVersion('sparklyr')

suppressPackageStartupMessages(library("dplyr"))

#packageVersion('dplyr')

suppressPackageStartupMessages(library("tidyr"))

# Please see the following video for installation help

#  https://youtu.be/qnINvPqcRvE

# spark_install(version = "2.0.2")

# set up a local "practice" Spark instance

sc <- spark_connect(master = "local",

                    version = "2.0.2")

#print(sc)

Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths ofparquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths.

userSpecification <- read.csv('tableCollection.csv',

                             header = TRUE,

                 strip.white = TRUE,

                 stringsAsFactors = FALSE)

print(userSpecification)

##   tableName tablePath

## 1   data_01   data_01

## 2   data_02   data_02

## 3   data_03   data_03

We can now read these parquet files (usually stored in Hadoop) into ourSpark environment as follows.

readParquets <- function(userSpecification) {

  userSpecification <- as_data_frame(userSpecification)

  userSpecification$handle <- lapply(

    seq_len(nrow(userSpecification)),

    function(i) {

      spark_read_parquet(sc,

                         name = userSpecification$tableName[[i]],

                         path = userSpecification$tablePath[[i]])

    }

  )

  userSpecification

}

tableCollection <- readParquets(userSpecification)

print(tableCollection)

## # A tibble: 3 x 3

##   tableName tablePath          handle

##       <chr>     <chr>          <list>

## 1   data_01   data_01 <S3: tbl_spark>

## 2   data_02   data_02 <S3: tbl_spark>

## 3   data_03   data_03 <S3: tbl_spark>

A data.frame is a great place to keep what you know about your Sparkhandles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) {

  tableCollection <- as_data_frame(tableCollection)

  # get the references

  tableCollection$handle <-

    lapply(tableCollection$tableName,

           function(tableNamei) {

             dplyr::tbl(sc, tableNamei)

           })

  # and tableNames to handles for convenience

  # and printing

  names(tableCollection$handle) <-

    tableCollection$tableName

  # add in some details (note: nrow can be expensive)

  tableCollection$nrow <- vapply(tableCollection$handle,

                                 nrow,

                                 numeric(1))

  tableCollection$ncol <- vapply(tableCollection$handle,

                                 ncol,

                                 numeric(1))

  tableCollection

}

tableCollection <- addDetails(userSpecification)

# convenient printing

print(tableCollection)

## # A tibble: 3 x 5

##   tableName tablePath          handle  nrow  ncol

##       <chr>     <chr>          <list> <dbl> <dbl>

## 1   data_01   data_01 <S3: tbl_spark>    10     1

## 2   data_02   data_02 <S3: tbl_spark>    10     2

## 3   data_03   data_03 <S3: tbl_spark>    10     3

# look at the top of each table (also forces

# evaluation!).

lapply(tableCollection$handle,

       head)

## $data_01

## Source:   query [6 x 1]

## Database: spark connection master=local[4] app=sparklyr local=TRUE

##

## # A tibble: 6 x 1

##        a_01

##       <dbl>

## 1 0.8274947

## 2 0.2876151

## 3 0.6638404

## 4 0.1918336

## 5 0.9111187

## 6 0.8802026

##

## $data_02

## Source:   query [6 x 2]

## Database: spark connection master=local[4] app=sparklyr local=TRUE

##

## # A tibble: 6 x 2

##        a_02       b_02

##       <dbl>      <dbl>

## 1 0.3937457 0.34936496

## 2 0.0195079 0.74376380

## 3 0.9760512 0.00261368

## 4 0.4388773 0.70325800

## 5 0.9747534 0.40327283

## 6 0.6054003 0.53224218

##

## $data_03

## Source:   query [6 x 3]

## Database: spark connection master=local[4] app=sparklyr local=TRUE

##

## # A tibble: 6 x 3

##         a_03      b_03        c_03

##        <dbl>     <dbl>       <dbl>

## 1 0.59512263 0.2615939 0.592753768

## 2 0.72292799 0.7287428 0.003926143

## 3 0.51846687 0.3641869 0.874463146

## 4 0.01174093 0.9648346 0.177722575

## 5 0.86250126 0.3891915 0.857614579

## 6 0.33082723 0.2633013 0.233822140

A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) {

  tableCollection$columns <-

    lapply(tableCollection$handle,

           colnames)

  columnMap <- tableCollection %>%

    select(tableName, columns) %>%

    unnest(columns)

  columnMap

}

columnMap <- columnDictionary(tableCollection)

print(columnMap)

## # A tibble: 6 x 2

##   tableName columns

##       <chr>   <chr>

## 1   data_01    a_01

## 2   data_02    a_02

## 3   data_02    b_02

## 4   data_03    a_03

## 5   data_03    b_03

## 6   data_03    c_03

The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

转自：http://www.win-vector.com/blog/2017/05/managing-spark-data-handles-in-r/

Managing Spark data handles in R的更多相关文章

7 Tools for Data Visualization in R, Python, and Julia
7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...
Managing Hierarchical Data in MySQL
Managing Hierarchical Data in MySQL Introduction Most users at one time or another have dealt with h ...
An Introduction to Stock Market Data Analysis with R (Part 1)
Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...
Best packages for data manipulation in R
dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...
sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决
use this val data = sc.textFile("/home/spark/data.txt") this should work and set master as ...
Data Science With R In Visual Studio
R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...
mysql 树形数据，层级数据Managing Hierarchical Data in MySQL
原文:http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ 引言大多数用户都曾在数据库中处理过分层数据(hiera ...
Managing Hierarchical Data in MySQL(邻接表模型）[转载]
原文在:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html 来源: http://www.cnblogs.com/p ...
Java生成-zipf分布的数据集（自定义倾斜度，用作spark data skew测试）
1.代码 import java.io.Serializable; import java.util.NavigableMap; import java.util.Random; import jav ...

随机推荐

Java线程池使用和分析(二) - execute()原理
相关文章目录: Java线程池使用和分析(一) Java线程池使用和分析(二) - execute()原理 execute()是 java.util.concurrent.Executor接口中唯一的 ...
在 redhat 6.4上安装Python 2.7.5
在工作环境中使用的是python 2.7.*,但是CentOS 6.4中默认使用的python版本是2.6.6,故需要升级版本. 安装步骤如下: 1,先安装GCC,用如下命令yum install g ...
【redis专题(8)】命令语法介绍之通用KEY
select num 数据库选择默认有16[0到15]个数据库,默认自动选择0号数据库 move key num 移动key到num服务器 del key [key ...] 删除给定的一个或多个 ...
微信和支付宝支付模式详解及实现（.Net标准库）- OSS开源系列
支付基本上是很多产品都必须的一个模块,大家最熟悉的应该就是微信和支付宝支付了,不过更多的可能还是停留在直接sdk的调用上,甚至和业务系统高度耦合,网上也存在各种解决方案,但大多形式各异,东拼西凑而成. ...
读书笔记 effective c++ Item 50 了解何时替换new和delete 是有意义的
1. 自定义new和delete的三个常见原因我们先回顾一下基本原理.为什么人们一开始就想去替换编译器提供的operator new和operator delete版本?有三个最常见的原因: 为了检 ...
css3 felx布局
一.Flex布局是什么? Flex是Flexible Box的缩写,意为"弹性布局",用来为盒状模型提供最大的灵活性. 任何一个容器都可以指定为Flex布局. [css] .box ...
Linux Shell——流程控制
1. 创建交互式脚本使用 echo命令的选项关于各种命令的使用,可以使用man 命令来查看命令的详细用法介绍.例如,我想看下 echo 的用法和各种选项.可以执行 man echo.执行结果如下: ...
Python 基础四面向对象杂谈
Python 基础四面向对象杂谈一.isinstance(obj,cls) 与issubcalss(sub,super) isinstance(obj,cls)检查是否obj是否是类 cls ...
mysql数据库面试总结
数据库优化建表优化 1)数据库范式 l 第一范式(1NF):强调的是列的原子性,即列不能够再分成其他几列. 如电话列可进行拆分---家庭电话.公司电话 l 第二范式(2NF):首先是 1NF,另外包 ...
System.Data.SqlClient.SqlException (0x80131904): EXECUTE 后的事务计数指示 BEGIN 和 COMMIT 语句的数目不匹配。上一计数 = 1，当前计数 = 0。 EXECUTE 后的事务计数指示 BEGIN 和 COMMIT 语句的数目不匹配。上一计数 = 1，当前计数 = 0。
EF使用ExecuteSqlCommand(db.Database.ExecuteSqlCommand("exec proc_DeleteCaseInfo_Output @caseID&qu ...

Managing Spark data handles in R

Managing Spark data handles in R的更多相关文章

随机推荐

热门专题