When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list ordata_frame.

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:

library("sparklyr")
#packageVersion('sparklyr')
suppressPackageStartupMessages(library("dplyr"))
#packageVersion('dplyr')
suppressPackageStartupMessages(library("tidyr")) # Please see the following video for installation help
# https://youtu.be/qnINvPqcRvE
# spark_install(version = "2.0.2") # set up a local "practice" Spark instance
sc <- spark_connect(master = "local",
version = "2.0.2")
#print(sc)

Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths ofparquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths.
userSpecification <- read.csv('tableCollection.csv',
header = TRUE,
strip.white = TRUE,
stringsAsFactors = FALSE)
print(userSpecification)
##   tableName tablePath
## 1 data_01 data_01
## 2 data_02 data_02
## 3 data_03 data_03

We can now read these parquet files (usually stored in Hadoop) into ourSpark environment as follows.

readParquets <- function(userSpecification) {
userSpecification <- as_data_frame(userSpecification)
userSpecification$handle <- lapply(
seq_len(nrow(userSpecification)),
function(i) {
spark_read_parquet(sc,
name = userSpecification$tableName[[i]],
path = userSpecification$tablePath[[i]])
}
)
userSpecification
} tableCollection <- readParquets(userSpecification)
print(tableCollection)
## # A tibble: 3 x 3
## tableName tablePath handle
## <chr> <chr> <list>
## 1 data_01 data_01 <S3: tbl_spark>
## 2 data_02 data_02 <S3: tbl_spark>
## 3 data_03 data_03 <S3: tbl_spark>

data.frame is a great place to keep what you know about your Sparkhandles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) {
tableCollection <- as_data_frame(tableCollection)
# get the references
tableCollection$handle <-
lapply(tableCollection$tableName,
function(tableNamei) {
dplyr::tbl(sc, tableNamei)
}) # and tableNames to handles for convenience
# and printing
names(tableCollection$handle) <-
tableCollection$tableName # add in some details (note: nrow can be expensive)
tableCollection$nrow <- vapply(tableCollection$handle,
nrow,
numeric(1))
tableCollection$ncol <- vapply(tableCollection$handle,
ncol,
numeric(1))
tableCollection
} tableCollection <- addDetails(userSpecification) # convenient printing
print(tableCollection)
## # A tibble: 3 x 5
## tableName tablePath handle nrow ncol
## <chr> <chr> <list> <dbl> <dbl>
## 1 data_01 data_01 <S3: tbl_spark> 10 1
## 2 data_02 data_02 <S3: tbl_spark> 10 2
## 3 data_03 data_03 <S3: tbl_spark> 10 3
# look at the top of each table (also forces
# evaluation!).
lapply(tableCollection$handle,
head)
## $data_01
## Source: query [6 x 1]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 1
## a_01
## <dbl>
## 1 0.8274947
## 2 0.2876151
## 3 0.6638404
## 4 0.1918336
## 5 0.9111187
## 6 0.8802026
##
## $data_02
## Source: query [6 x 2]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 2
## a_02 b_02
## <dbl> <dbl>
## 1 0.3937457 0.34936496
## 2 0.0195079 0.74376380
## 3 0.9760512 0.00261368
## 4 0.4388773 0.70325800
## 5 0.9747534 0.40327283
## 6 0.6054003 0.53224218
##
## $data_03
## Source: query [6 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 3
## a_03 b_03 c_03
## <dbl> <dbl> <dbl>
## 1 0.59512263 0.2615939 0.592753768
## 2 0.72292799 0.7287428 0.003926143
## 3 0.51846687 0.3641869 0.874463146
## 4 0.01174093 0.9648346 0.177722575
## 5 0.86250126 0.3891915 0.857614579
## 6 0.33082723 0.2633013 0.233822140

A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) {
tableCollection$columns <-
lapply(tableCollection$handle,
colnames)
columnMap <- tableCollection %>%
select(tableName, columns) %>%
unnest(columns)
columnMap
} columnMap <- columnDictionary(tableCollection)
print(columnMap)
## # A tibble: 6 x 2
## tableName columns
## <chr> <chr>
## 1 data_01 a_01
## 2 data_02 a_02
## 3 data_02 b_02
## 4 data_03 a_03
## 5 data_03 b_03
## 6 data_03 c_03

The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

  • Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
  • Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

转自:http://www.win-vector.com/blog/2017/05/managing-spark-data-handles-in-r/

Managing Spark data handles in R的更多相关文章

  1. 7 Tools for Data Visualization in R, Python, and Julia

    7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...

  2. Managing Hierarchical Data in MySQL

    Managing Hierarchical Data in MySQL Introduction Most users at one time or another have dealt with h ...

  3. An Introduction to Stock Market Data Analysis with R (Part 1)

    Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...

  4. Best packages for data manipulation in R

    dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...

  5. sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决

    use this val data = sc.textFile("/home/spark/data.txt") this should work and set master as ...

  6. Data Science With R In Visual Studio

    R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...

  7. mysql 树形数据,层级数据Managing Hierarchical Data in MySQL

    原文:http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ 引言 大多数用户都曾在数据库中处理过分层数据(hiera ...

  8. Managing Hierarchical Data in MySQL(邻接表模型)[转载]

    原文在:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html 来源: http://www.cnblogs.com/p ...

  9. Java生成-zipf分布的数据集(自定义倾斜度,用作spark data skew测试)

    1.代码 import java.io.Serializable; import java.util.NavigableMap; import java.util.Random; import jav ...

随机推荐

  1. 微信小程序登录数据解密以及状态维持

    学习过小程序的朋友应该知道,在小程序中是不支持cookie的,借助小程序中的缓存我们也可以存储一些信息,但是对于一些比较重要的信息,我们需要通过登录状态维持来保存,同时,为了安全起见,用户的敏感信息, ...

  2. JavaScript ES5面向对象实现一个todolist

    todo-list 前言 遵守 开始 布局 设计对象 对象的属性 事件绑定 业务逻辑单元的操作 实例化对象 参考 todo-list 前言 最近阅读了JavaScript设计模式的面向对象篇,但是又苦 ...

  3. [编织消息框架][传输协议]stcp简单开发

    测试代码 public class ServerSTCP { static int SERVER_PORT = 3456; static int US_STREAM = 0; static int F ...

  4. shell 分割字符串存至数组

    shell 分割字符串存至数组 shell编程中,经常需要将由特定分割符分割的字符串分割成数组,多数情况下我们首先会想到使用awk但是实际上用shell自带的分割数组功能会更方便.假如a=”one,t ...

  5. ie8兼容background-size属性

    满心欢喜地写代码,最后测试兼容性的时候发现Logo图片在IE8下特别大.明显是background-size在ie8一下不兼容. 我懂得,IE8还是个孩子,我就加几句你独有的代码让你兼容吧,司空见惯了 ...

  6. 关于li标签之间的间隔如何消除!

    问题:li标签用了display:inline之后虽然成功的合并在一行,但是li标签之间出现了间距. 原因:按enter键换行之后li标签之间存在着空格,正是这些空格占据了li标签之间的空间. 解决方 ...

  7. 【JavaScript】让事件支持先发布后订阅

    之前写过一个的事件管理器,就是普通的先订阅后发布模式.但实际场景中我们需要做到后订阅的也能收到发布的消息.比如我们关注微信公众号,还是能看到历史消息的.类似于qq离线消息,我先发给你,你登录了就能收到 ...

  8. jQuery中的选择器(上)

    从学习jquery开始,现在已经是第三遍看锋利的jQuery这本书了,现在打算对jQuery中的各种选择器进行一下总结,主要是是为了进一步系统的巩固自己对知识的掌握,另外也可以为那些学习jQuery并 ...

  9. ural 1297. Palindrome

    题目链接:http://acm.timus.ru/problem.aspx?space=1&num=1297 求最长回文子串 典型的后缀数组的入门题目,但是可以用更简单的方法解决,毕竟数据量比 ...

  10. 学习笔记:JavaScript-进阶篇

    1.二维数组   二维数组的表示: myarray[ ][ ] var myarr=new Array();  //先声明一维 for(var i=0;i<2;i++){  //一维长度为2   ...