1.map

一条一条读取

def map(): Unit ={
val list = List("张无忌", "赵敏", "周芷若")
val listRDD = sc.parallelize(list)
val nameRDD = listRDD.map(name => "Hello " + name)
nameRDD.foreach(name => println(name))
}

2.flatMap

扁平化

def flatMap(): Unit ={
val list = List("张无忌 赵敏","宋青书 周芷若")
val listRDD = sc.parallelize(list) val nameRDD = listRDD.flatMap(line => line.split(" ")).map(name => "Hello " + name)
nameRDD.foreach(name => println(name))
}

3.mapPartitions

一次读取一个分区数据

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1, 2, 3, 4, 5, 6)
val rdd = spark.parallelize(list, 2)
rdd.foreach(println)
val rdd2 = rdd.mapPartitions(iterator => {
val newList = new ListBuffer[String]
while (iterator.hasNext) {
newList.append("hello" + iterator.next())
}
newList.toIterator
}) rdd2.foreach(name => println(name))
} }

4.mapPartitionsWithIndex

一次读取一个分区数据,并且知道是哪个分区的

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1, 2, 3, 4, 5, 6)
val rdd = spark.parallelize(list, 2)
val rdd2 = rdd.mapPartitionsWithIndex((index, iterator) => {
val newList = new ListBuffer[String]
while (iterator.hasNext) {
newList.append(index + "_" + iterator.next())
}
newList.toIterator
}) rdd2.foreach(name => println(name))
} }

5.reduce

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1, 2, 3, 4, 5, 6)
val rdd = spark.parallelize(list)
val result = rdd.reduce((x, y) => x + y)
println(result)
} }

6.reduceBykey

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(("武当", 99), ("少林", 97), ("武当", 89), ("少林", 77))
val rdd = spark.parallelize(list)
val rdd2 = rdd.reduceByKey(_ + _)
rdd2.foreach(tuple => println(tuple._1 + ":" + tuple._2))
}
}

7.union

合并,但不去重

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list1 = List(1,2,3,4)
val list2 = List(3,4,5,6)
val rdd1 = spark.parallelize(list1)
val rdd2 = spark.parallelize(list2)
rdd1.union(rdd2).foreach(println)
}
}

8.join

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list1 = List((1, "东方不败"), (2, "令狐冲"), (3, "林平之"))
val list2 = List((1, 99), (2, 98), (3, 97))
val rdd1 = spark.parallelize(list1)
val rdd2 = spark.parallelize(list2)
val rdd3 = rdd1.join(rdd2)
rdd3.foreach(tuple => {
val id = tuple._1
val new_tuple = tuple._2
val name = new_tuple._1
val score = new_tuple._2
println("学号:" + id + " 姓名:" + name + " 成绩:" + score)
})
}
}

9.groupbyKey

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(("武当", "张三丰"), ("峨眉", "灭绝师太"), ("武当", "宋青书"), ("峨眉", "周芷若"))
val rdd1 = spark.parallelize(list)
val rdd2 = rdd1.groupByKey()
rdd2.foreach(t => {
val menpai = t._1
val iterator = t._2.iterator
var people = ""
while (iterator.hasNext) people = people + iterator.next + " "
println("门派:" + menpai + "人员:" + people)
})
}
}

10.cartesian

笛卡尔积

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list1 = List("A", "B")
val list2 = List(1, 2, 3)
val list1RDD = spark.parallelize(list1)
val list2RDD = spark.parallelize(list2)
list1RDD.cartesian(list2RDD).foreach(t => println(t._1 + "->" + t._2))
}
}

11.filter

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1,2,3,4,5,6,7,8,9,10)
val listRDD = spark.parallelize(list)
listRDD.filter(num => num % 2 ==0).foreach(print(_))
}
}

12.distinct

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1,1,2,2,3,3,4,5)
val rdd = spark.parallelize(list)
rdd.distinct().foreach(println)
}
}

13.intersection

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list1 = List(1,2,3,4)
val list2 = List(3,4,5,6)
val list1RDD = spark.parallelize(list1)
val list2RDD = spark.parallelize(list2)
list1RDD.intersection(list2RDD).foreach(println(_))
}
}

14.coalesce

分区有多-->少

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1,2,3,4,5)
spark.parallelize(list,3).coalesce(1).foreach(println(_))
}
}

15.repartition

进行重分区

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1,2,3,4)
val listRDD = spark.parallelize(list,1)
listRDD.repartition(2).foreach(println(_))
}
}

16.repartitionAndSortWithinPartitions

在给定的partitioner内部进行排序,性能比repartition要高。

import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(1, 4, 55, 66, 33, 48, 23)
val listRDD = spark.parallelize(list, 1)
listRDD.map(num => (num, num))
.repartitionAndSortWithinPartitions(new HashPartitioner(2))
.mapPartitionsWithIndex((index, iterator) => {
val listBuffer: ListBuffer[String] = new ListBuffer
while (iterator.hasNext) {
listBuffer.append(index + "_" + iterator.next())
}
listBuffer.iterator
}, false)
.foreach(println(_))
}
}

17.cogroup

import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list1 = List((1, "www"), (2, "bbs"))
val list2 = List((1, "cnblog"), (2, "cnblog"), (3, "very"))
val list3 = List((1, "com"), (2, "com"), (3, "good")) val list1RDD = spark.parallelize(list1)
val list2RDD = spark.parallelize(list2)
val list3RDD = spark.parallelize(list3) list1RDD.cogroup(list2RDD,list3RDD).foreach(tuple =>
println(tuple._1 + " " + tuple._2._1 + " " + tuple._2._2 + " " + tuple._2._3))
}
}

18.sortByKey

import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List((99, "张三丰"), (96, "东方不败"), (66, "林平之"), (98, "聂风"))
spark.parallelize(list).sortByKey(false).foreach(tuple => println(tuple._2 + "->" + tuple._1))
}
}

19.aggregateByKey

import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List("you,jump", "i,jump")
spark.parallelize(list)
.flatMap(_.split(","))
.map((_, 1))
.aggregateByKey(0)(_ + _, _ + _)
.foreach(tuple => println(tuple._1 + "->" + tuple._2))
}
}
apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession import scala.collection.mutable.ListBuffer object Demo {
val conf = new SparkConf().setAppName("Demo").setMaster("local");
// val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val spark = new SparkContext(conf) def main(args: Array[String]): Unit = {
val list = List(("武当", "张三丰"), ("峨眉", "灭绝师太"), ("武当", "宋青书"), ("峨眉", "周芷若"))
val rdd1 = spark.parallelize(list)
val rdd2 = rdd1.groupByKey()
rdd2.foreach(t => {
val menpai = t._1
val iterator = t._2.iterator
var people = ""
while (iterator.hasNext) people = people + iterator.next + " "
println("门派:" + menpai + "人员:" + people)
})
}
}

spark算子的更多相关文章

  1. (转)Spark 算子系列文章

    http://lxw1234.com/archives/2015/07/363.htm Spark算子:RDD基本转换操作(1)–map.flagMap.distinct Spark算子:RDD创建操 ...

  2. Spark算子总结及案例

    spark算子大致上可分三大类算子: 1.Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Value型的数据. 2.Key-Value数据类型的Tran ...

  3. UserView--第二种方式(避免第一种方式Set饱和),基于Spark算子的java代码实现

      UserView--第二种方式(避免第一种方式Set饱和),基于Spark算子的java代码实现   测试数据 java代码 package com.hzf.spark.study; import ...

  4. UserView--第一种方式set去重,基于Spark算子的java代码实现

    UserView--第一种方式set去重,基于Spark算子的java代码实现 测试数据 java代码 package com.hzf.spark.study; import java.util.Ha ...

  5. spark算子之DataFrame和DataSet

    前言 传统的RDD相对于mapreduce和storm提供了丰富强大的算子.在spark慢慢步入DataFrame到DataSet的今天,在算子的类型基本不变的情况下,这两个数据集提供了更为强大的的功 ...

  6. Spark算子总结(带案例)

    Spark算子总结(带案例) spark算子大致上可分三大类算子: 1.Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Value型的数据. 2.Key ...

  7. Spark算子---实战应用

    Spark算子实战应用 数据集 :http://grouplens.org/datasets/movielens/ MovieLens 1M Datase 相关数据文件 : users.dat --- ...

  8. spark算子集锦

    Spark 是大数据领域的一大利器,花时间总结了一下 Spark 常用算子,正所谓温故而知新. Spark 算子按照功能分,可以分成两大类:transform 和 action.Transform 不 ...

  9. Spark算子使用

    一.spark的算子分类 转换算子和行动算子 转换算子:在使用的时候,spark是不会真正执行,直到需要行动算子之后才会执行.在spark中每一个算子在计算之后就会产生一个新的RDD. 二.在编写sp ...

  10. Spark:常用transformation及action,spark算子详解

    常用transformation及action介绍,spark算子详解 一.常用transformation介绍 1.1 transformation操作实例 二.常用action介绍 2.1 act ...

随机推荐

  1. SpringMVC中使用Interceptor拦截器顺序

    一.简介 SpringMVC 中的Interceptor 拦截器也是相当重要和相当有用的,它的主要作用是拦截用户的请求并进行相应的处理.比如通过它来进行权限验 证,或者是来判断用户是否登陆,或者是像1 ...

  2. ovs-qos配置

    QoS配置 在许多网络场景中,都需要根据需求对网络流量部署服务质量(QoS)保障策略,比如限制指定主机的最大接入带宽等需求.本节将介绍如何在OVS上添加队列,并完成数据的入队操作,从而完成QoS策略部 ...

  3. Loj #528. 「LibreOJ β Round #4」求和 (莫比乌斯反演)

    题目链接:https://loj.ac/problem/528 题目:给定两个正整数N,M,你需要计算ΣΣu(gcd(i,j))^2 mod 998244353 ,其中i属于[1,N],j属于[1,M ...

  4. 团体程序设计天梯赛(CCCC) L3021 神坛 的一些错误做法(目前网上的方法没一个是对的) 和 一些想法

    团体程序设计天梯赛代码.体现代码技巧,比赛技巧.  https://github.com/congmingyige/cccc_code

  5. python之OpenCv(四)---人脸识别

    对特定图像进行识别,最关键的是要有识别对象的特征文件.OpenCV已经内置了人脸识别特征文件,我们只要使用OpenCV的CascadeClassifier类即可进行识别. 语法: https://gi ...

  6. hinernate-实体对象的3种状态

    瞬时状态---持久化状态---游离态 瞬时状态:实体对象中没有id,没有与session关联 持久化状态:实体对象中有id,与session有关联 游离态:实体对象中有id,没有与session关联 ...

  7. Python3:关于列表的操作(合并、拼接,嵌套排序··)

    一:# 将2个列表合并成字典,按最少个数key=['winnie','anna','lisa']value=[18,20,22] k_v=dict(zip(key,value))print(k_v) ...

  8. openwrt 加入nand flash的支持

    参考链接 :   https://blog.csdn.net/wwx0715/article/details/77189456?locationNum=9&fps=1

  9. mysql 备份报错mysqldump: [Warning] Using a password on the command line interface can be insecure.

    -------------------------------------------------------------------------------- mysql 备份报错mysqldump ...

  10. 25)django-form使用

    目录 1)django form作用 2)django form使用 一:django form 作用 django form有两个作用:一是用户输入数据验证:二是生成html 1)用户输入数据验证, ...