Spark: Best practice for retrieving big data from RDD to local machine
|
've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide. UDP: commited toLocalIterator method |
|||||||||
|
|
Update: TL;DR And the original answer might give a rough idea how it works: First of all, get the array of partition indexes:
Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:
I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with |
|||||||||||||
|
Did you find this question interesting? Try our newsletter
Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).
|
Wildfire answer seems semantically correct, but I'm sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don't see why you can't using
Or this
|
|||||||||||||
|
|
Here is the same approach as suggested by @Wildlife but written in pyspark. The nice thing about this approach - it lets user access records in RDD in order. I'm using this code to feed data from RDD into STDIN of the machine learning tool's process.
Produces output:
|
||||
|
Map/filter/reduce using Spark and download the results later? I think usual Hadoop approach will work. Api says that there are map - filter - saveAsFile commands:https://spark.incubator.apache.org/docs/0.8.1/scala-programming-guide.html#transformations |
|||||||||||||
|
|
For Spark 1.3.1 , the format is as follows
|
Spark: Best practice for retrieving big data from RDD to local machine的更多相关文章
- Why Apache Spark is a Crossover Hit for Data Scientists [FWD]
Spark is a compelling multi-purpose platform for use cases that span investigative, as well as opera ...
- [Spark] 02 - Practice Spark
开发环境 教学视频:Spark的环境搭建,需安装配置环境:Java, Hadoop 环境配置:玩转大数据分析!Spark2.X+Python 精华实战课程(免费)[其实只是环境搭建] 进入pyspar ...
- spark SQL (四)数据源 Data Source----Parquet 文件的读取与加载
spark SQL Parquet 文件的读取与加载 是由许多其他数据处理系统支持的柱状格式.Spark SQL支持阅读和编写自动保留原始数据模式的Parquet文件.在编写Parquet文件时,出于 ...
- Spark菜鸟学习营Day1 从Java到RDD编程
Spark菜鸟学习营Day1 从Java到RDD编程 菜鸟训练营主要的目标是帮助大家从零开始,初步掌握Spark程序的开发. Spark的编程模型是一步一步发展过来的,今天主要带大家走一下这段路,让我 ...
- The ‘Microsoft.ACE.OLEDB.12.0′ provider is not registered on the local machine. (System.Data)
When you try to import Excel 2007 or later “.xlsx” files into an SQL Server 2008 database you may ge ...
- Microsoft SQL Server 17导出xlsx文件时报错:The 'Microsoft.ACE.OLEDB.12.0' provider is not registered on the local machine. (System.Data)
导出数据时报错: 如果你是导出office 2007格式 TITLE: SQL Server Import and Export Wizard ---------------------------- ...
- Spark学习之键值对(pair RDD)操作(3)
Spark学习之键值对(pair RDD)操作(3) 1. 我们通常从一个RDD中提取某些字段(如代表事件时间.用户ID或者其他标识符的字段),并使用这些字段为pair RDD操作中的键. 2. 创建 ...
- <Spark><Programming><Loading and Saving Your Data>
Motivation Spark是基于Hadoop可用的生态系统构建的,因此Spark可以通过Hadoop MapReduce的InputFormat和OutputFormat接口存取数据. Spar ...
- spark SQL (五)数据源 Data Source----json hive jdbc等数据的的读取与加载
1,JSON数据集 Spark SQL可以自动推断JSON数据集的模式,并将其作为一个Dataset[Row].这个转换可以SparkSession.read.json()在一个Dataset[Str ...
随机推荐
- PHP 循环
PHP 中的循环语句用于执行相同的代码块指定的次数. 循环 在您编写代码时,您经常需要让相同的代码块运行很多次.您可以在代码中使用循环语句来完成这个任务. 在 PHP 中,我们可以使用下列循环语句: ...
- eclipse/STS 切换目录视图
- Eclipse的tomcat插件
下载Tomcat Eclipse插件 http://www.eclipsetotale.com/tomcatPlugin.html 或者我的网盘 将tomcatPluginV321.zip内容解压到 ...
- 恭喜您成为2014年度Microsoft MVP!
- oracle 批量更新之update case when then
oracle 批量更新之update case when then CreationTime--2018年8月7日15点51分 Author:Marydon 1.情景描述 根据表中同一字段不同情况 ...
- css中clear属性的认识
今天在看博客园的页面布局时发现有不少空白的div只有css属性:clear:both. 然后去W3C文档里和百度补脑了一下,总结如下: 这是之前我写的一段测试代码: <div style=&qu ...
- Java多线程系列目录(共43篇)(转)
Java多线程系列目录(共43篇) http://www.cnblogs.com/skywang12345/p/java_threads_category.html
- wordpress调用函数大全
WordPress模板基本文件 style.css 样式表文件 index.php 主页文件 single.php 日志单页文件 page.php 页面文件 archvie.php 分类和日期存档页文 ...
- IOS 内存优化和调试技巧
基础部分 1: 图片内存大小小结 a: 图片:是占用内存的大户,尤其是手机游戏图片资源众多.对图片资源在内存中占用量的计算成为J2ME游戏开发者的经常性工作,CoCoMo来解释一下如何计算图片在内存中 ...
- MongoDB Windows环境安装及配置[转]
MongoDB一般安装 1.首先到官网(http://www.mongodb.org/downloads )下载合适的安装包,目前的最新版本为2.6 安装包有zip和msi格式的,这里推荐下载zip格 ...
