Spark: Best practice for retrieving big data from RDD to local machine
've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide. UDP: commited toLocalIterator method |
|||||||||
|
Update: TL;DR And the original answer might give a rough idea how it works: First of all, get the array of partition indexes:
Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:
I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with |
|||||||||||||
|

Did you find this question interesting? Try our newsletter
Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).
Wildfire answer seems semantically correct, but I'm sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don't see why you can't using
Or this
|
|||||||||||||
|
Here is the same approach as suggested by @Wildlife but written in pyspark. The nice thing about this approach - it lets user access records in RDD in order. I'm using this code to feed data from RDD into STDIN of the machine learning tool's process.
Produces output:
|
||||
Map/filter/reduce using Spark and download the results later? I think usual Hadoop approach will work. Api says that there are map - filter - saveAsFile commands:https://spark.incubator.apache.org/docs/0.8.1/scala-programming-guide.html#transformations |
|||||||||||||
|
For Spark 1.3.1 , the format is as follows
|
Spark: Best practice for retrieving big data from RDD to local machine的更多相关文章
- Why Apache Spark is a Crossover Hit for Data Scientists [FWD]
Spark is a compelling multi-purpose platform for use cases that span investigative, as well as opera ...
- [Spark] 02 - Practice Spark
开发环境 教学视频:Spark的环境搭建,需安装配置环境:Java, Hadoop 环境配置:玩转大数据分析!Spark2.X+Python 精华实战课程(免费)[其实只是环境搭建] 进入pyspar ...
- spark SQL (四)数据源 Data Source----Parquet 文件的读取与加载
spark SQL Parquet 文件的读取与加载 是由许多其他数据处理系统支持的柱状格式.Spark SQL支持阅读和编写自动保留原始数据模式的Parquet文件.在编写Parquet文件时,出于 ...
- Spark菜鸟学习营Day1 从Java到RDD编程
Spark菜鸟学习营Day1 从Java到RDD编程 菜鸟训练营主要的目标是帮助大家从零开始,初步掌握Spark程序的开发. Spark的编程模型是一步一步发展过来的,今天主要带大家走一下这段路,让我 ...
- The ‘Microsoft.ACE.OLEDB.12.0′ provider is not registered on the local machine. (System.Data)
When you try to import Excel 2007 or later “.xlsx” files into an SQL Server 2008 database you may ge ...
- Microsoft SQL Server 17导出xlsx文件时报错:The 'Microsoft.ACE.OLEDB.12.0' provider is not registered on the local machine. (System.Data)
导出数据时报错: 如果你是导出office 2007格式 TITLE: SQL Server Import and Export Wizard ---------------------------- ...
- Spark学习之键值对(pair RDD)操作(3)
Spark学习之键值对(pair RDD)操作(3) 1. 我们通常从一个RDD中提取某些字段(如代表事件时间.用户ID或者其他标识符的字段),并使用这些字段为pair RDD操作中的键. 2. 创建 ...
- <Spark><Programming><Loading and Saving Your Data>
Motivation Spark是基于Hadoop可用的生态系统构建的,因此Spark可以通过Hadoop MapReduce的InputFormat和OutputFormat接口存取数据. Spar ...
- spark SQL (五)数据源 Data Source----json hive jdbc等数据的的读取与加载
1,JSON数据集 Spark SQL可以自动推断JSON数据集的模式,并将其作为一个Dataset[Row].这个转换可以SparkSession.read.json()在一个Dataset[Str ...
随机推荐
- 从最简单的实例学习ARM 指令集(三)
上一篇讲到赋值运算,这篇讲讲子函数调用.先看最简单范例:test4.c #include <stdio.h> void f1() { } void main() { int d = 4; ...
- html页面禁止选择复制剪切
在body加入 onselectstart="return false" oncopy="return false;" oncut="return f ...
- component和bean区别
@Component and @Bean do two quite different things, and shouldn't be confused. @Component (and @Serv ...
- nodejs实现拉钩网爬虫
概述 通过nodejs+mysql+cheerio+request实现拉钩网特定公司的所有招聘信息的抓取,并将抓取的信息保存到数据库中.抓取内容包括:薪酬福利,工作地,职位要求,工作性质等几乎所有的内 ...
- [转载]打造自己喜欢的Linux桌面----archlinux
原文地址:打造自己喜欢的Linux桌面----archlinux作者:三尺椴 打造自己的Linux桌面----Archlinux 2011-01-16 文/s_cd ( 常用桌面组合:Archlin ...
- iOS获取手机型号,Swift获取手机型号(类似iphone 7这种,检测机型具体型号)
获取手机设备信息,如name.model.version等, 但如果想获取具体的手机型号,如iphone5.5s这种,就需要如下这种(含Swift和OC两种写法) Swift建议添加到extensio ...
- 【LeetCode】130. Surrounded Regions (2 solutions)
Surrounded Regions Given a 2D board containing 'X' and 'O', capture all regions surrounded by 'X'. A ...
- WordPress网站搬家经验总结
http://cnzhx.net/blog/move-wordpress-site-step-by-step/也许很多人都有跟我类似的经历:因为某种原因需要将自己的WordPress站点从一个空间转移 ...
- Response.Flush() Response.End()的区别
//Response.Flush() 将缓存中的内容立即显示出来//Response.End() 缓冲的输出发送到客户端 停止页面执行//例://Response.Write("520& ...
- Linux内存初始化(四) 创建系统内存地址映射
一.前言 经过内存初始化代码分析(一)和内存初始化代码分析(二)的过渡,我们终于来到了内存初始化的核心部分:paging_init.当然本文不能全部解析完该函数(那需要的篇幅太长了),我们只关注创建系 ...