When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list ordata_frame.

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:

library("sparklyr")
#packageVersion('sparklyr')
suppressPackageStartupMessages(library("dplyr"))
#packageVersion('dplyr')
suppressPackageStartupMessages(library("tidyr")) # Please see the following video for installation help
# https://youtu.be/qnINvPqcRvE
# spark_install(version = "2.0.2") # set up a local "practice" Spark instance
sc <- spark_connect(master = "local",
version = "2.0.2")
#print(sc)

Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths ofparquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths.
userSpecification <- read.csv('tableCollection.csv',
header = TRUE,
strip.white = TRUE,
stringsAsFactors = FALSE)
print(userSpecification)
##   tableName tablePath
## 1 data_01 data_01
## 2 data_02 data_02
## 3 data_03 data_03

We can now read these parquet files (usually stored in Hadoop) into ourSpark environment as follows.

readParquets <- function(userSpecification) {
userSpecification <- as_data_frame(userSpecification)
userSpecification$handle <- lapply(
seq_len(nrow(userSpecification)),
function(i) {
spark_read_parquet(sc,
name = userSpecification$tableName[[i]],
path = userSpecification$tablePath[[i]])
}
)
userSpecification
} tableCollection <- readParquets(userSpecification)
print(tableCollection)
## # A tibble: 3 x 3
## tableName tablePath handle
## <chr> <chr> <list>
## 1 data_01 data_01 <S3: tbl_spark>
## 2 data_02 data_02 <S3: tbl_spark>
## 3 data_03 data_03 <S3: tbl_spark>

data.frame is a great place to keep what you know about your Sparkhandles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) {
tableCollection <- as_data_frame(tableCollection)
# get the references
tableCollection$handle <-
lapply(tableCollection$tableName,
function(tableNamei) {
dplyr::tbl(sc, tableNamei)
}) # and tableNames to handles for convenience
# and printing
names(tableCollection$handle) <-
tableCollection$tableName # add in some details (note: nrow can be expensive)
tableCollection$nrow <- vapply(tableCollection$handle,
nrow,
numeric(1))
tableCollection$ncol <- vapply(tableCollection$handle,
ncol,
numeric(1))
tableCollection
} tableCollection <- addDetails(userSpecification) # convenient printing
print(tableCollection)
## # A tibble: 3 x 5
## tableName tablePath handle nrow ncol
## <chr> <chr> <list> <dbl> <dbl>
## 1 data_01 data_01 <S3: tbl_spark> 10 1
## 2 data_02 data_02 <S3: tbl_spark> 10 2
## 3 data_03 data_03 <S3: tbl_spark> 10 3
# look at the top of each table (also forces
# evaluation!).
lapply(tableCollection$handle,
head)
## $data_01
## Source: query [6 x 1]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 1
## a_01
## <dbl>
## 1 0.8274947
## 2 0.2876151
## 3 0.6638404
## 4 0.1918336
## 5 0.9111187
## 6 0.8802026
##
## $data_02
## Source: query [6 x 2]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 2
## a_02 b_02
## <dbl> <dbl>
## 1 0.3937457 0.34936496
## 2 0.0195079 0.74376380
## 3 0.9760512 0.00261368
## 4 0.4388773 0.70325800
## 5 0.9747534 0.40327283
## 6 0.6054003 0.53224218
##
## $data_03
## Source: query [6 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 3
## a_03 b_03 c_03
## <dbl> <dbl> <dbl>
## 1 0.59512263 0.2615939 0.592753768
## 2 0.72292799 0.7287428 0.003926143
## 3 0.51846687 0.3641869 0.874463146
## 4 0.01174093 0.9648346 0.177722575
## 5 0.86250126 0.3891915 0.857614579
## 6 0.33082723 0.2633013 0.233822140

A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) {
tableCollection$columns <-
lapply(tableCollection$handle,
colnames)
columnMap <- tableCollection %>%
select(tableName, columns) %>%
unnest(columns)
columnMap
} columnMap <- columnDictionary(tableCollection)
print(columnMap)
## # A tibble: 6 x 2
## tableName columns
## <chr> <chr>
## 1 data_01 a_01
## 2 data_02 a_02
## 3 data_02 b_02
## 4 data_03 a_03
## 5 data_03 b_03
## 6 data_03 c_03

The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

  • Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
  • Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

转自:http://www.win-vector.com/blog/2017/05/managing-spark-data-handles-in-r/

Managing Spark data handles in R的更多相关文章

  1. 7 Tools for Data Visualization in R, Python, and Julia

    7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...

  2. Managing Hierarchical Data in MySQL

    Managing Hierarchical Data in MySQL Introduction Most users at one time or another have dealt with h ...

  3. An Introduction to Stock Market Data Analysis with R (Part 1)

    Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...

  4. Best packages for data manipulation in R

    dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...

  5. sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决

    use this val data = sc.textFile("/home/spark/data.txt") this should work and set master as ...

  6. Data Science With R In Visual Studio

    R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...

  7. mysql 树形数据,层级数据Managing Hierarchical Data in MySQL

    原文:http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ 引言 大多数用户都曾在数据库中处理过分层数据(hiera ...

  8. Managing Hierarchical Data in MySQL(邻接表模型)[转载]

    原文在:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html 来源: http://www.cnblogs.com/p ...

  9. Java生成-zipf分布的数据集(自定义倾斜度,用作spark data skew测试)

    1.代码 import java.io.Serializable; import java.util.NavigableMap; import java.util.Random; import jav ...

随机推荐

  1. linux awk 命令详解

    awk是一个非常棒的数字处理工具.相比于sed常常作用于一整行的处理,awk则比较倾向于将一行分为数个"字段"来处理.运行效率高,而且代码简单,对格式化的文本处理能力超强.先来一个 ...

  2. 图解CSS选择器之nth家族

    1 nth-last-of-type  顾名思义从最后开始的元素开始选取可接受 数字 关键词 公式  比如4n+0就是每隔四个 odd even关键词表示奇偶数 .classify-item:nth- ...

  3. PAT 1047

    1049. Counting Ones (30) The task is simple: given any positive integer N, you are supposed to count ...

  4. C++小技巧之CONTAINING_RECORD

    CONTAINING_RECORD Containing record是一个在C++编程中用处很大的一种技巧,它的功能为已知结构体或类的某一成员.对象中该成员的地址以及这一结构体名或类名,从而得到该对 ...

  5. [Linux] PHP程序员玩转Linux系列-telnet轻松使用邮箱

    1.PHP程序员玩转Linux系列-怎么安装使用CentOS 2.PHP程序员玩转Linux系列-lnmp环境的搭建 3.PHP程序员玩转Linux系列-搭建FTP代码开发环境 4.PHP程序员玩转L ...

  6. 根据GPS经纬度判断当前所属的市区

    这个事情分两步走 1. 拿到行政区划的地理围栏数据 2. 根据GPS定位判断一个点是否落在地理围栏的多边形区域里. 1. 获取行政区划的地理围栏数据可以利用百度API.打开以前我的一个例子在chrom ...

  7. js表白心形特效

    好久没有仔细钻研技术了,闲下来借鉴一下做出一些效果 友情链接: http://tiepeng.applinzi.com/love_you/ ;;background:#ffe;font-size:12 ...

  8. java集合的核心知识

    1.    集合 1.1. 什么是集合 存储对象的容器,面向对象语言对事物的体现都是以对象的形式,所以为了方便对多个对象的操作,存储对象,集合是存储对象最常用的一种方式. 集合的出现就是为了持有对象. ...

  9. 多个php版本的composer使用

    由于系统环境变量之前同事安装的laravel是5.1...php默认的环境变量是: 不想破话原有环境变量,因为现在新的项目是laravel5.4...所以在用默认composer require安装时 ...

  10. [ABP实战开源项目]---ABP实时服务-通知系统.发布模式

    简介 在ABP中,提供了通知服务.它是一个基于实时通知的基础设施.分为订阅模式和发布模式. 本次会在项目中使用发布模式来演示一个用户注册后,收到的欢迎信息. 发布模式 首先我们在领域层建立" ...