've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide.

UDP: commited toLocalIterator method

asked Feb 11 '14 at 9:55
epahomov

111117
 
    
toLocalIterator is not ideal if you want to iterate locally over a partition at a time – Landon Kuhn Oct 29 '14 at 2:25
2  
@LandonKuhn why not? – Tom Yubing Dong Aug 4 '15 at 23:02

5 Answers

Update: RDD.toLocalIterator method that appeared after the original answer has been written is a more efficient way to do the job. It uses runJob to evaluate only a single partition on each step.

TL;DR And the original answer might give a rough idea how it works:

First of all, get the array of partition indexes:

val parts = rdd.partitions

Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:

for (p <- parts) {
val idx = p.index
val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
//The second argument is true to avoid rdd reshuffling
val data = partRdd.collect //data contains all values from a single partition
//in the form of array
//Now you can do with the data whatever you want: iterate, save to a file, etc.
}

I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with rdd.coalesce(numParts, true).

answered Feb 15 '14 at 18:33
Wildfire

4,53811739
 
    
does this code cause each partition to be computed in serial when it loops through and call mapPartitionsWithIndex? What's the best way to remedy this? – foboi1122 Nov 18 '15 at 0:42
    
@foboi1122 Please see updated answer – Wildfire Nov 18 '15 at 8:36 
    
@Wildfire Will this approach resolve this. Else how to resolve using any or might be this approach. – ChikuMiku 2 days ago 

Did you find this question interesting? Try our newsletter

Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).

Wildfire answer seems semantically correct, but I'm sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don't see why you can't using map/filter/reduce/reduceByKey/mapPartitions operations. The only time you'd want to have everything in one place in one array is when your going to perform a non-monoidal operation - but that doesn't seem to be what you want. You should be able to do something like:

rdd.mapPartitions(recordsIterator => your code that processes a single chunk)

Or this

rdd.foreachPartition(partition => {
partition.toArray
// Your code
})
answered Mar 30 '14 at 11:05
samthebest

10.4k54369
 
    
Is't these operators execute on cluster? – epahomov Apr 3 '14 at 7:05
1  
Yes it will, but why are you avoiding that? If you can process each chunk in turn, you should be able to write the code in such a way so it can distribute - like using aggregate. – samthebest Apr 3 '14 at 15:54
    
Is not the iterator returned by forEachPartitition the data iterator for a single partition - and not an iterator of all partitions? – javadba May 20 at 8:23

Here is the same approach as suggested by @Wildlife but written in pyspark.

The nice thing about this approach - it lets user access records in RDD in order. I'm using this code to feed data from RDD into STDIN of the machine learning tool's process.

rdd = sc.parallelize(range(100), 10)
def make_part_filter(index):
def part_filter(split_index, iterator):
if split_index == index:
for el in iterator:
yield el
return part_filter for part_id in range(rdd.getNumPartitions()):
part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)
data_from_part_rdd = part_rdd.collect()
print "partition id: %s elements: %s" % (part_id, data_from_part_rdd)

Produces output:

partition id: 0 elements: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
partition id: 1 elements: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
partition id: 2 elements: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
partition id: 3 elements: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
partition id: 4 elements: [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
partition id: 5 elements: [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
partition id: 6 elements: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
partition id: 7 elements: [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
partition id: 8 elements: [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
partition id: 9 elements: [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
answered Jun 5 '15 at 20:07
vvladymyrov

2,9781124
 

Map/filter/reduce using Spark and download the results later? I think usual Hadoop approach will work.

Api says that there are map - filter - saveAsFile commands:https://spark.incubator.apache.org/docs/0.8.1/scala-programming-guide.html#transformations

answered Feb 11 '14 at 10:09
ya_pulser

1,2601715
 
    
Bad option. I don't want to do serialization/deserialization. So I want this data retrieving from spark – epahomov Feb 11 '14 at 10:37
    
How do you intend to get 1gb without serde(i.e. storing on the disk.) ? on a node with 512mb ? – scrapcodesFeb 12 '14 at 9:13
1  
By iterating over the RDD. You should be able to get each partition in sequence to send each data item in sequence to the master, which can then pull them off the network and work on them. – interfect Feb 12 '14 at 18:07

For Spark 1.3.1 , the format is as follows

val parts = rdd.partitions
for (p <- parts) {
val idx = p.index
val partRdd = data.mapPartitionsWithIndex {
case(index:Int,value:Iterator[(String,String,Float)]) =>
if (index == idx) value else Iterator()}
val dataPartitioned = partRdd.collect
//Apply further processing on data
}

 

Spark: Best practice for retrieving big data from RDD to local machine的更多相关文章

  1. Why Apache Spark is a Crossover Hit for Data Scientists [FWD]

    Spark is a compelling multi-purpose platform for use cases that span investigative, as well as opera ...

  2. [Spark] 02 - Practice Spark

    开发环境 教学视频:Spark的环境搭建,需安装配置环境:Java, Hadoop 环境配置:玩转大数据分析!Spark2.X+Python 精华实战课程(免费)[其实只是环境搭建] 进入pyspar ...

  3. spark SQL (四)数据源 Data Source----Parquet 文件的读取与加载

    spark SQL Parquet 文件的读取与加载 是由许多其他数据处理系统支持的柱状格式.Spark SQL支持阅读和编写自动保留原始数据模式的Parquet文件.在编写Parquet文件时,出于 ...

  4. Spark菜鸟学习营Day1 从Java到RDD编程

    Spark菜鸟学习营Day1 从Java到RDD编程 菜鸟训练营主要的目标是帮助大家从零开始,初步掌握Spark程序的开发. Spark的编程模型是一步一步发展过来的,今天主要带大家走一下这段路,让我 ...

  5. The ‘Microsoft.ACE.OLEDB.12.0′ provider is not registered on the local machine. (System.Data)

    When you try to import Excel 2007 or later “.xlsx” files into an SQL Server 2008 database you may ge ...

  6. Microsoft SQL Server 17导出xlsx文件时报错:The 'Microsoft.ACE.OLEDB.12.0' provider is not registered on the local machine. (System.Data)

    导出数据时报错: 如果你是导出office 2007格式 TITLE: SQL Server Import and Export Wizard ---------------------------- ...

  7. Spark学习之键值对(pair RDD)操作(3)

    Spark学习之键值对(pair RDD)操作(3) 1. 我们通常从一个RDD中提取某些字段(如代表事件时间.用户ID或者其他标识符的字段),并使用这些字段为pair RDD操作中的键. 2. 创建 ...

  8. <Spark><Programming><Loading and Saving Your Data>

    Motivation Spark是基于Hadoop可用的生态系统构建的,因此Spark可以通过Hadoop MapReduce的InputFormat和OutputFormat接口存取数据. Spar ...

  9. spark SQL (五)数据源 Data Source----json hive jdbc等数据的的读取与加载

    1,JSON数据集 Spark SQL可以自动推断JSON数据集的模式,并将其作为一个Dataset[Row].这个转换可以SparkSession.read.json()在一个Dataset[Str ...

随机推荐

  1. MVC MVP MVVM 图解

    1.MVC (1)图解 解释: 视图(View):用户界面. 控制器(Controller):业务逻辑 模型(Model):数据保存 各部分之间的通信方式如下: View 传送指令到 Controll ...

  2. jdeveloper优化:

    D:\jdevstudio10133\jdev\bin\jdev.conf末尾加上下面的AddVMOption -Dsun.java2d.noddraw=true AddVMOption -Dsun. ...

  3. win7,8,10取得|取消管理员权限

    取得: Windows Registry Editor Version 5.00[HKEY_CLASSES_ROOT\*\shell\runas]@=”管理员取得所有权”“NoWorkingDirec ...

  4. 转义字符 HTML 字符实体 < >: &等

    在 HTML 中,某些字符是预留的. 在 HTML 中不能使用小于号(<)和大于号(>),这是因为浏览器会误认为它们是标签. 如果希望正确地显示预留字符,我们必须在 HTML 源代码中使用 ...

  5. JSON path

    https://github.com/itguang/gitbook-smile/blob/master/springboot-fastjson/fastjson%E4%B9%8BJSONPath%E ...

  6. oracle 建表时显示ORA-00984: 列在此处不允许

      oracle 建表时显示ORA-00984: 列在此处不允许 CreationTime--2018年7月19日16点10分 Author:Marydon 1.情景展示 使用plsql建表时,报错 ...

  7. 将Gradle项目公布到maven仓库

    将Gradle项目公布到maven仓库 1 Gradle简单介绍 1.1 Ant.Maven还是Gradle? 1.1.1 Ant和Maven介绍 全称为Apache Maven,是一个软件(特别是J ...

  8. jquery api 常见 事件操作

    change.html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html ...

  9. PC上的番茄工作法软件 Pomodairo 1.9 详细攻略

    http://www.zhantuo.com/archives/673155 番茄钟软件 Pomodairo 1.9: 我觉得这款软件特别好,完全符合番茄工作法的要求. 你可以通过add new 来增 ...

  10. 设置当前Activity的屏幕亮度

    设置当前的Activity的屏幕亮度,而不是设置系统的屏幕亮度,退出当前的Activity后恢复系统的亮度. 直接看代码好了 Java代码 WindowManager.LayoutParams lp  ...