4.RDD常用算子之transformations

RDD Opertions

transformations:create a new dataset from an existing one

RDDA --> RDDB

actions: return a value to the driver program after running a computation on the dataset

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away.

Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program

This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

def my_map():

data = [1,2,3,4,5]

rdd1 = sc.parallelize(data)

rdd2 = rdd1.map(lambda x: x * 2 )

print(rdd2.collect())

def my_filter():

data = [1, 2, 3, 4, 5]

# rdd1 = sc.parallelize(data)

# rdd2 = rdd1.map(lambda x: x * 2)

# rdd3 = rdd2.filter(lambda x:x > 5)

# print(rdd3.collect())

print(sc.parallelize(data).map(lambda x:x*2).filter(lambda x:x>5).collect())

def my_flatMap():

data = ["hello spark","hello ming","hello clay"]

print(sc.parallelize(data).flatMap(lambda line:line.split(" ")).collect())

def my_reduceByKey():

data = ["hello spark","hello ming","hello clay"]

rdd = sc.parallelize(data)

mapRdd = rdd.flatMap(lambda line: line.split(" ")).map(lambda x:(x,1))

my_reduceByKeyRdd = mapRdd.reduceByKey(lambda a,b:a+b)

print(my_reduceByKeyRdd.collect())

union:

distinct:

join:

4.RDD常用算子之transformations的更多相关文章

Spark Core核心----RDD常用算子编程
1.RDD常用操作2.Transformations算子3.Actions算子4.SparkRDD案例实战 1.Transformations算子(lazy) 含义:create a new data ...
Spark学习之路（四）—— RDD常用算子详解
一.Transformation spark常用的Transformation算子如下表: Transformation算子 Meaning(含义) map(func) 对原RDD中每个元素运用 fu ...
Spark 系列（四）—— RDD常用算子详解
一.Transformation spark 常用的 Transformation 算子如下表: Transformation 算子 Meaning(含义) map(func) 对原 RDD 中每个元 ...
spark学习(10)-RDD的介绍和常用算子
RDD(弹性分布式数据集,里面并不存储真正要计算的数据,你对RDD的操作,他会在Driver端转换成Task,下发到Executor计算分散在多台集群上的数据) RDD是一个代理,你对代理进行操作,他 ...
sparkRDD：第3节 RDD常用的算子操作
4. RDD编程API 4.1 RDD的算子分类 Transformation(转换):根据数据集创建一个新的数据集,计算后返回一个新RDD:例如:一个rdd进行map操作后生了一个新的rd ...
RDD(弹性分布式数据集)及常用算子
RDD(弹性分布式数据集)及常用算子 RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是 Spark 中最基本的数据处理模型.代码中是一个抽象类,它代表一个 ...
SparkRDD简介/常用算子/依赖/缓存
SparkRDD简介/常用算子/依赖/缓存 RDD简介 RDD(Resilient Distributed Dataset)叫做分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区. ...
spark常用算子总结
算子分为value-transform, key-value-transform, action三种.f是输入给算子的函数,比如lambda x: x**2 常用算子: keys: 取pair rdd ...
大数据学习day19-----spark02-------0 零碎知识点（分区，分区和分区器的区别） 1. RDD的使用（RDD的概念，特点，创建rdd的方式以及常见rdd的算子） 2.Spark中的一些重要概念
0. 零碎概念 (1) 这个有点疑惑,有可能是错误的. (2) 此处就算地址写错了也不会报错,因为此操作只是读取数据的操作(元数据),表示从此地址读取数据但并没有进行读取数据的操作 (3)分区(有时间 ...

随机推荐

[JZOJ 5600] Arg
题意:求最少LIS覆盖... 思路: 计算$LIS$时我们一般用$dp$表示到当先位置时以当前位置结尾的$LIS$最长长度. 那么这个数组保证单调不降,我们考虑二进制表示. 然后就是转移了 ...
秦曾昌人工智能课程---5、KNN和朴素贝叶斯
秦曾昌人工智能课程---5.KNN和朴素贝叶斯一.总结一句话总结: 拟合和概率:构建机器学习模型,一般有拟合和概率两种方式轻学无用:一定要保证学有所用,要深入学习,比如之前做的安卓,一定要学通, ...
EnhanceFunc__增强函数集
想将经常用到的功能函数写在一起,花时间精心维护,然后以后就用起来就舒服很多了目前就写了进程调试权限,远程线程注入,远程线程释放这三个函数.还有很多功能,以后慢慢加 // last code by g ...
jQuery 基本选择器
1 基本选择器 $(‘#id属性值’) ----------->document.getElementById() $(‘tag标签名称’)----------->document.ge ...
Python生成Windows可执行exe文件
环境 python3.6.5 pyinstaller3.5 windows 10 下载地址 python:https://www.python.org/ftp/python/3.6.5/python- ...
ECMAScript1.3 数组 | 函数 | 作用域 | 预解析
数组array 数组可以存储很多项,有顺序,很多项形成一个集合,就是数组. 数组字面量是:[] 如何获取数组中的数据:索引/下标,数组中的第一项的索引是从0开始的. ['kay', 'andy', 1 ...
python 19 lambda函数
转自http://www.cnblogs.com/BeginMan/p/3178103.html 一.lambda函数 1.lambda函数基础: lambda函数也叫匿名函数,即,函数没有具体的名称 ...
《数据结构与算法分析——C语言描述》ADT实现(NO.02) : 队列(Queue)
第三个结构——队列(Queue) 队列与上次的栈相反,是一种先进先出(FIFO)的线性表.写入时只暴露尾部,读取时只暴露头部. 本次只实现了数组形式的队列.原因是链表形式的队列极为简单,只需要实现简单 ...
dubbo jar 配置文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/20 ...
Flutter 打包报错 : Unknown FLUTTER_BUILD_MODE: xxx
概要在集成flutter 工程之后,我们的工程在debug 和release 模式下都没什么问题,一切都很顺利.但是我们在打企业包的时候却出现了错误: Showing Recent Errors O ...

4.RDD常用算子之transformations

4.RDD常用算子之transformations的更多相关文章

随机推荐

热门专题