Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list ordata_frame.

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:

library("sparklyr")

#packageVersion('sparklyr')

suppressPackageStartupMessages(library("dplyr"))

#packageVersion('dplyr')

suppressPackageStartupMessages(library("tidyr"))

# Please see the following video for installation help

#  https://youtu.be/qnINvPqcRvE

# spark_install(version = "2.0.2")

# set up a local "practice" Spark instance

sc <- spark_connect(master = "local",

                    version = "2.0.2")

#print(sc)

Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths ofparquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths.

userSpecification <- read.csv('tableCollection.csv',

                             header = TRUE,

                 strip.white = TRUE,

                 stringsAsFactors = FALSE)

print(userSpecification)

##   tableName tablePath

## 1   data_01   data_01

## 2   data_02   data_02

## 3   data_03   data_03

We can now read these parquet files (usually stored in Hadoop) into ourSpark environment as follows.

readParquets <- function(userSpecification) {

  userSpecification <- as_data_frame(userSpecification)

  userSpecification$handle <- lapply(

    seq_len(nrow(userSpecification)),

    function(i) {

      spark_read_parquet(sc,

                         name = userSpecification$tableName[[i]],

                         path = userSpecification$tablePath[[i]])

    }

  )

  userSpecification

}

tableCollection <- readParquets(userSpecification)

print(tableCollection)

## # A tibble: 3 x 3

##   tableName tablePath          handle

##       <chr>     <chr>          <list>

## 1   data_01   data_01 <S3: tbl_spark>

## 2   data_02   data_02 <S3: tbl_spark>

## 3   data_03   data_03 <S3: tbl_spark>

A data.frame is a great place to keep what you know about your Sparkhandles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) {

  tableCollection <- as_data_frame(tableCollection)

  # get the references

  tableCollection$handle <-

    lapply(tableCollection$tableName,

           function(tableNamei) {

             dplyr::tbl(sc, tableNamei)

           })

  # and tableNames to handles for convenience

  # and printing

  names(tableCollection$handle) <-

    tableCollection$tableName

  # add in some details (note: nrow can be expensive)

  tableCollection$nrow <- vapply(tableCollection$handle,

                                 nrow,

                                 numeric(1))

  tableCollection$ncol <- vapply(tableCollection$handle,

                                 ncol,

                                 numeric(1))

  tableCollection

}

tableCollection <- addDetails(userSpecification)

# convenient printing

print(tableCollection)

## # A tibble: 3 x 5

##   tableName tablePath          handle  nrow  ncol

##       <chr>     <chr>          <list> <dbl> <dbl>

## 1   data_01   data_01 <S3: tbl_spark>    10     1

## 2   data_02   data_02 <S3: tbl_spark>    10     2

## 3   data_03   data_03 <S3: tbl_spark>    10     3

# look at the top of each table (also forces

# evaluation!).

lapply(tableCollection$handle,

       head)

## $data_01

## Source:   query [6 x 1]

## Database: spark connection master=local[4] app=sparklyr local=TRUE

##

## # A tibble: 6 x 1

##        a_01

##       <dbl>

## 1 0.8274947

## 2 0.2876151

## 3 0.6638404

## 4 0.1918336

## 5 0.9111187

## 6 0.8802026

##

## $data_02

## Source:   query [6 x 2]

## Database: spark connection master=local[4] app=sparklyr local=TRUE

##

## # A tibble: 6 x 2

##        a_02       b_02

##       <dbl>      <dbl>

## 1 0.3937457 0.34936496

## 2 0.0195079 0.74376380

## 3 0.9760512 0.00261368

## 4 0.4388773 0.70325800

## 5 0.9747534 0.40327283

## 6 0.6054003 0.53224218

##

## $data_03

## Source:   query [6 x 3]

## Database: spark connection master=local[4] app=sparklyr local=TRUE

##

## # A tibble: 6 x 3

##         a_03      b_03        c_03

##        <dbl>     <dbl>       <dbl>

## 1 0.59512263 0.2615939 0.592753768

## 2 0.72292799 0.7287428 0.003926143

## 3 0.51846687 0.3641869 0.874463146

## 4 0.01174093 0.9648346 0.177722575

## 5 0.86250126 0.3891915 0.857614579

## 6 0.33082723 0.2633013 0.233822140

A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) {

  tableCollection$columns <-

    lapply(tableCollection$handle,

           colnames)

  columnMap <- tableCollection %>%

    select(tableName, columns) %>%

    unnest(columns)

  columnMap

}

columnMap <- columnDictionary(tableCollection)

print(columnMap)

## # A tibble: 6 x 2

##   tableName columns

##       <chr>   <chr>

## 1   data_01    a_01

## 2   data_02    a_02

## 3   data_02    b_02

## 4   data_03    a_03

## 5   data_03    b_03

## 6   data_03    c_03

The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

转自：http://www.win-vector.com/blog/2017/05/managing-spark-data-handles-in-r/

Managing Spark data handles in R的更多相关文章

7 Tools for Data Visualization in R, Python, and Julia
7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...
Managing Hierarchical Data in MySQL
Managing Hierarchical Data in MySQL Introduction Most users at one time or another have dealt with h ...
An Introduction to Stock Market Data Analysis with R (Part 1)
Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...
Best packages for data manipulation in R
dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...
sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决
use this val data = sc.textFile("/home/spark/data.txt") this should work and set master as ...
Data Science With R In Visual Studio
R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...
mysql 树形数据，层级数据Managing Hierarchical Data in MySQL
原文:http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ 引言大多数用户都曾在数据库中处理过分层数据(hiera ...
Managing Hierarchical Data in MySQL(邻接表模型）[转载]
原文在:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html 来源: http://www.cnblogs.com/p ...
Java生成-zipf分布的数据集（自定义倾斜度，用作spark data skew测试）
1.代码 import java.io.Serializable; import java.util.NavigableMap; import java.util.Random; import jav ...

随机推荐

Unity编译Android的原理解析和apk打包分析
作者介绍:张坤最近由于想在Scene的脚本组件中,调用Android的Activity的相关接口,就需要弄明白Scene和Activity的实际对应关系,并对Unity调用Android的部分原理进 ...
在jsp中用一数组存储了数据库表中某一字段的值，然后在页面中输出其中的值。
List<String> list = new ArrayList<String>(); String sql = "select userName from us ...
ubuntu中文字符集格式转换
myeclipse2015复制项目需要修改的地方
项目下 D:\Workspaces\MyEclipse 2015\angular001\.settings 的org.eclipse.wst.common.component文件,修改里面未原来的 ...
概率检索模型及BM25
概率排序原理以往的向量空间模型是将query和文档使用向量表示然后计算其内容相似性来进行相关性估计的,而概率检索模型是一种直接对用户需求进行相关性的建模方法,一个query进来,将所有的文档分为两类 ...
CSS开发框架技术OOCSS编写和管理CSS的方法
目前最流行的CSS开发框架技术当属OOCSS,尽管还有其他类似技术(如BEM).这些方法试图对CSS采用面向对象的编程原则.样式语言与面向对象的设计原则在概念之间存在一定的问题.欠缺经验的人员可能不会 ...
Python -堆的实现
最小(大)堆是按完全二叉树的排序顺序的方式排布堆中元素的,并且满足:ai >a(2i+1) and ai>a(2i+2)( ai <a(2i+1) and ai<a(2 ...
输入一个数字n 如果n为偶数则除以2，若为奇数则加1或者减1，直到n为1，求最少次数写出一个函数
题目: 输入一个数字n 如果n为偶数则除以2,若为奇数则加1或者减1,直到n为1,求最少次数写出一个函数首先,这道题肯定可以用动态规划来解, n为整数时,n的解为 n/2 的解加1 n为奇数时 ...
迁移 SQL Server 到 Azure SQL 实战
最近有个维护的项目需要把 SQL Server 2012 的数据库迁移到 Azure SQL 上去.主要是因为租用的主机到期,而运营商停止了主机租赁业务,看来向云端的迁移是大势所趋啊!经过一番折腾最终 ...
python练习_12
题目:敏感词文本文件 filtered_words.txt,里面的内容和 0011题一样,当用户输入敏感词语,则用星号 * 替换,例如当用户输入「北京是个好城市」,则变成「**是个好城市」.(11 ...

Managing Spark data handles in R

Managing Spark data handles in R的更多相关文章

随机推荐

热门专题