When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list ordata_frame.

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:

library("sparklyr")
#packageVersion('sparklyr')
suppressPackageStartupMessages(library("dplyr"))
#packageVersion('dplyr')
suppressPackageStartupMessages(library("tidyr")) # Please see the following video for installation help
# https://youtu.be/qnINvPqcRvE
# spark_install(version = "2.0.2") # set up a local "practice" Spark instance
sc <- spark_connect(master = "local",
version = "2.0.2")
#print(sc)

Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths ofparquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths.
userSpecification <- read.csv('tableCollection.csv',
header = TRUE,
strip.white = TRUE,
stringsAsFactors = FALSE)
print(userSpecification)
##   tableName tablePath
## 1 data_01 data_01
## 2 data_02 data_02
## 3 data_03 data_03

We can now read these parquet files (usually stored in Hadoop) into ourSpark environment as follows.

readParquets <- function(userSpecification) {
userSpecification <- as_data_frame(userSpecification)
userSpecification$handle <- lapply(
seq_len(nrow(userSpecification)),
function(i) {
spark_read_parquet(sc,
name = userSpecification$tableName[[i]],
path = userSpecification$tablePath[[i]])
}
)
userSpecification
} tableCollection <- readParquets(userSpecification)
print(tableCollection)
## # A tibble: 3 x 3
## tableName tablePath handle
## <chr> <chr> <list>
## 1 data_01 data_01 <S3: tbl_spark>
## 2 data_02 data_02 <S3: tbl_spark>
## 3 data_03 data_03 <S3: tbl_spark>

data.frame is a great place to keep what you know about your Sparkhandles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) {
tableCollection <- as_data_frame(tableCollection)
# get the references
tableCollection$handle <-
lapply(tableCollection$tableName,
function(tableNamei) {
dplyr::tbl(sc, tableNamei)
}) # and tableNames to handles for convenience
# and printing
names(tableCollection$handle) <-
tableCollection$tableName # add in some details (note: nrow can be expensive)
tableCollection$nrow <- vapply(tableCollection$handle,
nrow,
numeric(1))
tableCollection$ncol <- vapply(tableCollection$handle,
ncol,
numeric(1))
tableCollection
} tableCollection <- addDetails(userSpecification) # convenient printing
print(tableCollection)
## # A tibble: 3 x 5
## tableName tablePath handle nrow ncol
## <chr> <chr> <list> <dbl> <dbl>
## 1 data_01 data_01 <S3: tbl_spark> 10 1
## 2 data_02 data_02 <S3: tbl_spark> 10 2
## 3 data_03 data_03 <S3: tbl_spark> 10 3
# look at the top of each table (also forces
# evaluation!).
lapply(tableCollection$handle,
head)
## $data_01
## Source: query [6 x 1]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 1
## a_01
## <dbl>
## 1 0.8274947
## 2 0.2876151
## 3 0.6638404
## 4 0.1918336
## 5 0.9111187
## 6 0.8802026
##
## $data_02
## Source: query [6 x 2]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 2
## a_02 b_02
## <dbl> <dbl>
## 1 0.3937457 0.34936496
## 2 0.0195079 0.74376380
## 3 0.9760512 0.00261368
## 4 0.4388773 0.70325800
## 5 0.9747534 0.40327283
## 6 0.6054003 0.53224218
##
## $data_03
## Source: query [6 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 3
## a_03 b_03 c_03
## <dbl> <dbl> <dbl>
## 1 0.59512263 0.2615939 0.592753768
## 2 0.72292799 0.7287428 0.003926143
## 3 0.51846687 0.3641869 0.874463146
## 4 0.01174093 0.9648346 0.177722575
## 5 0.86250126 0.3891915 0.857614579
## 6 0.33082723 0.2633013 0.233822140

A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) {
tableCollection$columns <-
lapply(tableCollection$handle,
colnames)
columnMap <- tableCollection %>%
select(tableName, columns) %>%
unnest(columns)
columnMap
} columnMap <- columnDictionary(tableCollection)
print(columnMap)
## # A tibble: 6 x 2
## tableName columns
## <chr> <chr>
## 1 data_01 a_01
## 2 data_02 a_02
## 3 data_02 b_02
## 4 data_03 a_03
## 5 data_03 b_03
## 6 data_03 c_03

The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

  • Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
  • Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

转自:http://www.win-vector.com/blog/2017/05/managing-spark-data-handles-in-r/

Managing Spark data handles in R的更多相关文章

  1. 7 Tools for Data Visualization in R, Python, and Julia

    7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...

  2. Managing Hierarchical Data in MySQL

    Managing Hierarchical Data in MySQL Introduction Most users at one time or another have dealt with h ...

  3. An Introduction to Stock Market Data Analysis with R (Part 1)

    Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...

  4. Best packages for data manipulation in R

    dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...

  5. sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决

    use this val data = sc.textFile("/home/spark/data.txt") this should work and set master as ...

  6. Data Science With R In Visual Studio

    R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...

  7. mysql 树形数据,层级数据Managing Hierarchical Data in MySQL

    原文:http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ 引言 大多数用户都曾在数据库中处理过分层数据(hiera ...

  8. Managing Hierarchical Data in MySQL(邻接表模型)[转载]

    原文在:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html 来源: http://www.cnblogs.com/p ...

  9. Java生成-zipf分布的数据集(自定义倾斜度,用作spark data skew测试)

    1.代码 import java.io.Serializable; import java.util.NavigableMap; import java.util.Random; import jav ...

随机推荐

  1. chrome插件推荐

    分享自己一直在用的chrome插件 1. Adblock Plus 广告屏蔽插件,能够屏蔽YouTube视频广告.Facebook广告.弹出窗口和其他显眼的广告,个人认为非常强大. 2.AutoPag ...

  2. python作业设计:多级菜单,并可依次进入各级子菜单

    '''作业三:多级菜单 三级菜单 可依次选择进入各子菜单 所需新知识点:列表.字典 ''' data = { "北京":{ "昌平":{ "沙河&qu ...

  3. Java 基础知识总结 2

    11.Java常用类: StringBuffer StringBuffer 是使用缓冲区的,本身也是操作字符串的,但是与String类不同,String类的内容一旦声明之后则不可以改变,改变的只是其内 ...

  4. 20+个很棒的Android开源项目

    20+个很棒的Android开源项目 本文摘自文章: 20+ Awesome Open-Source Android Apps To Boost Your Development Skills. 考虑 ...

  5. 关于css禁止文本复制属性

    最近在做DHTMLX框架替换,新框架dhx的grid是不能选中内容复制的 虽然相对来说是安全些的,但是客户体验度一定会大打折扣 网页上禁止复制主要靠JavaScript来实现.<BODY onc ...

  6. Ubuntu下安装Tomcat7

    第一部分:基本安装 1.打开http://tomcat.apache.org/download-70.cgi,下载apache-tomcat-7.0.68.zip. 2.拷贝至合适位置,如/usr/l ...

  7. CSAcademy Beta Round #4 Swap Pairing

    题目链接:https://csacademy.com/contest/arhiva/#task/swap_pairing/ 大意是给2*n个包含n种数字,每种数字出现恰好2次的数列,每一步操作可以交换 ...

  8. dotnetcore中的IOptionsSnapshot<>的自动更新原理

    1.首先讲讲ChangeToken.OnChange方法: 原理是给一个CancellationToken注册一个消费者委托,调用CancellationToken的Cancel的时候会调用这个Can ...

  9. 推送一个已有的代码到新的 gerrit 服务器

    1.指定项目代码库中迭代列出全部ProductList(.git)到pro.log文件中 repo forall -c 'echo $REPO_PROJECT' | tee pro.log pro.l ...

  10. 新建Android项目,会出现两个项目一个是自己创建的项目,另一个是“appcompat_v7”项目,这是怎么回事呢?该怎么解决呢?

    做Android开发的朋友最近会发现,更新ADT至22.6.0版本之后,创建新的安装项目,会出现appcompat_v7的内容.并且是创建一个新的内容就会出现.这到底是怎么回事呢?原来appcompa ...