Dataset是一个强类型的特定领域的对象,这种对象可以函数式或者关系操作并行地转换。每个Dataset也有一个被称为一个DataFrame的类型化视图,这种DataFrame是Row类型的Dataset,即Dataset[Row]
  Dataset是“懒惰”的,只在执行行动操作时触发计算。本质上,数据集表示一个逻辑计划,该计划描述了产生数据所需的计算。当执行行动操作时,Spark的查询优化程序优化逻辑计划,并生成一个高效的并行和分布式物理计划。

示例数据字段解释

// affairs:一年来婚外情的频率 
// gender:性别 
// age:年龄 
// yearsmarried:婚龄 
// children:是否有小孩 
// religiousness:宗教信仰程度(5分制,1分表示反对,5分表示非常信仰)
// education:学历
// occupation:职业(逆向编号的戈登7种分类) 
// rating:对婚姻的自我评分(5分制,1表示非常不幸福,5表示非常幸福)
 

1.导入常用的包

import scala.math._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
import org.apache.spark.sql.DataFrameReader
import org.apache.spark.sql.functions._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.DataFrameStatFunctions

2.创建SparkSession,并导入示例数据

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._ val dataList: List[(Double, String, Double, Double, String, Double, Double, Double, Double)] = List(
(0, "male", 37, 10, "no", 3, 18, 7, 4),
(0, "female", 27, 4, "no", 4, 14, 6, 4),
(0, "female", 32, 15, "yes", 1, 12, 1, 4),
(0, "male", 57, 15, "yes", 5, 18, 6, 5),
(0, "male", 22, 0.75, "no", 2, 17, 6, 3),
(0, "female", 32, 1.5, "no", 2, 17, 5, 5),
(0, "female", 22, 0.75, "no", 2, 12, 1, 3),
(0, "male", 57, 15, "yes", 2, 14, 4, 4),
(0, "female", 32, 15, "yes", 4, 16, 1, 2),
(0, "male", 22, 1.5, "no", 4, 14, 4, 5)) val data = dataList.toDF("affairs", "gender", "age", "yearsmarried", "children", "religiousness", "education", "occupation", "rating") data.printSchema()
root
|-- affairs: double (nullable = false)
|-- gender: string (nullable = true)
|-- age: double (nullable = false)
|-- yearsmarried: double (nullable = false)
|-- children: string (nullable = true)
|-- religiousness: double (nullable = false)
|-- education: double (nullable = false)
|-- occupation: double (nullable = false)
|-- rating: double (nullable = false)

 3.操作指定的列和行

// 在Spark-shell中展示,前n条记录
data.show(7)
+-------+------+----+------------+--------+-------------+---------+----------+------+
|affairs|gender| age|yearsmarried|children|religiousness|education|occupation|rating|
+-------+------+----+------------+--------+-------------+---------+----------+------+
| 0.0| male|37.0| 10.0| no| 3.0| 18.0| 7.0| 4.0|
| 0.0|female|27.0| 4.0| no| 4.0| 14.0| 6.0| 4.0|
| 0.0|female|32.0| 15.0| yes| 1.0| 12.0| 1.0| 4.0|
| 0.0| male|57.0| 15.0| yes| 5.0| 18.0| 6.0| 5.0|
| 0.0| male|22.0| 0.75| no| 2.0| 17.0| 6.0| 3.0|
| 0.0|female|32.0| 1.5| no| 2.0| 17.0| 5.0| 5.0|
| 0.0|female|22.0| 0.75| no| 2.0| 12.0| 1.0| 3.0|
+-------+------+----+------------+--------+-------------+---------+----------+------+
only showing top 7 rows // 取前n条记录
val data3=data.limit(5) // 过滤
data.filter("age>50 and gender=='male' ").show
+-------+------+----+------------+--------+-------------+---------+----------+------+
|affairs|gender| age|yearsmarried|children|religiousness|education|occupation|rating|
+-------+------+----+------------+--------+-------------+---------+----------+------+
| 0.0| male|57.0| 15.0| yes| 5.0| 18.0| 6.0| 5.0|
| 0.0| male|57.0| 15.0| yes| 2.0| 14.0| 4.0| 4.0|
+-------+------+----+------------+--------+-------------+---------+----------+------+ // 数据框的所有列 val columnArray=data.columns
columnArray: Array[String] = Array(affairs, gender, age, yearsmarried, children, religiousness, education, occupation, rating) // 查询某些列的数据
data.select("gender", "age", "yearsmarried", "children").show(3)
+------+----+------------+--------+
|gender| age|yearsmarried|children|
+------+----+------------+--------+
| male|37.0| 10.0| no|
|female|27.0| 4.0| no|
|female|32.0| 15.0| yes|
+------+----+------------+--------+
only showing top 3 rows val colArray=Array("gender", "age", "yearsmarried", "children")
colArray: Array[String] = Array(gender, age, yearsmarried, children) data.selectExpr(colArray:_*).show(3)
+------+----+------------+--------+
|gender| age|yearsmarried|children|
+------+----+------------+--------+
| male|37.0| 10.0| no|
|female|27.0| 4.0| no|
|female|32.0| 15.0| yes|
+------+----+------------+--------+
only showing top 3 rows // 操作指定的列,并排序
// data.selectExpr("gender", "age+1","cast(age as bigint)").orderBy($"gender".desc, $"age".asc).show
data.selectExpr("gender", "age+1 as age1","cast(age as bigint) as age2").sort($"gender".desc, $"age".asc).show
+------+----+----+
|gender|age1|age2|
+------+----+----+
| male|23.0| 22|
| male|23.0| 22|
| male|38.0| 37|
| male|58.0| 57|
| male|58.0| 57|
|female|23.0| 22|
|female|28.0| 27|
|female|33.0| 32|
|female|33.0| 32|
|female|33.0| 32|
+------+----+----+

4.查看SparkSQL逻辑和物理执行计划

val data4=data.selectExpr("gender", "age+1 as age1","cast(age as bigint) as age2").sort($"gender".desc, $"age".asc)
data4: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [gender: string, age1: double ... 1 more field] // 查看物理执行计划
data4.explain()
== Physical Plan ==
*Project [gender#20, age1#135, age2#136L]
+- *Sort [gender#20 DESC, age#21 ASC], true, 0
+- Exchange rangepartitioning(gender#20 DESC, age#21 ASC, 200)
+- LocalTableScan [gender#20, age1#135, age2#136L, age#21] // 查看逻辑和物理执行计划
data4.explain(extended=true)
== Parsed Logical Plan ==
'Sort ['gender DESC, 'age ASC], true
+- Project [gender#20, (age#21 + cast(1 as double)) AS age1#135, cast(age#21 as bigint) AS age2#136L]
+- Project [_1#9 AS affairs#19, _2#10 AS gender#20, _3#11 AS age#21, _4#12 AS yearsmarried#22, _5#13 AS children#23, _6#14 AS religiousness#24, _7#15 AS education#25, _8#16 AS occupation#2
6, _9#17 AS rating#27] +- LocalRelation [_1#9, _2#10, _3#11, _4#12, _5#13, _6#14, _7#15, _8#16, _9#17] == Analyzed Logical Plan ==
gender: string, age1: double, age2: bigint
Project [gender#20, age1#135, age2#136L]
+- Sort [gender#20 DESC, age#21 ASC], true
+- Project [gender#20, (age#21 + cast(1 as double)) AS age1#135, cast(age#21 as bigint) AS age2#136L, age#21]
+- Project [_1#9 AS affairs#19, _2#10 AS gender#20, _3#11 AS age#21, _4#12 AS yearsmarried#22, _5#13 AS children#23, _6#14 AS religiousness#24, _7#15 AS education#25, _8#16 AS occupatio
n#26, _9#17 AS rating#27] +- LocalRelation [_1#9, _2#10, _3#11, _4#12, _5#13, _6#14, _7#15, _8#16, _9#17] == Optimized Logical Plan ==
Project [gender#20, age1#135, age2#136L]
+- Sort [gender#20 DESC, age#21 ASC], true
+- LocalRelation [gender#20, age1#135, age2#136L, age#21] == Physical Plan ==
*Project [gender#20, age1#135, age2#136L]
+- *Sort [gender#20 DESC, age#21 ASC], true, 0
+- Exchange rangepartitioning(gender#20 DESC, age#21 ASC, 200)
+- LocalTableScan [gender#20, age1#135, age2#136L, age#21]

Spark2 Dataset行列操作和执行计划的更多相关文章

  1. Spark2 Dataset聚合操作

    data.groupBy("gender").agg(count($"age"),max($"age").as("maxAge&q ...

  2. Spark Dataset DataFrame 操作

    Spark Dataset DataFrame 操作 相关博文参考 sparksql中dataframe的用法 一.Spark2 Dataset DataFrame空值null,NaN判断和处理 1. ...

  3. 【T-SQL进阶】03.执行计划之旅-1

    到大牛们说执行计划,总是很惶恐,是对知识的缺乏的惶恐,所以必须得学习执行计划,以减少对这一块知识的惶恐,下面是对执行计划的第一讲-理解执行计划. 本系列[T-SQL]主要是针对T-SQL的总结. T- ...

  4. 【SQL进阶】03.执行计划之旅1 - 初探

    听到大牛们说执行计划,总是很惶恐,是对知识的缺乏的惶恐,所以必须得学习执行计划,以减少对这一块知识的惶恐,下面是对执行计划的第一讲-理解执行计划. 本系列[T-SQL]主要是针对T-SQL的总结. S ...

  5. Oracle 利用执行计划来避免排序操作

    在oracle中,利用index来避免排序 SQL) NOT NULL); SQL> CREATE INDEX IND_T_NOSORT_NAME ON T_NOSORT(NAME); SQL& ...

  6. sql server 根据执行计划查询耗时操作

    with QS as( select cp.objtype as object_type, /*类型*/ db_name(st.dbid) as [database], /*数据库*/ object_ ...

  7. db2执行计划具体操作

    explain 1.如果第一次执行,请先(在dbinst用户下) connect to dbname,执行db2 -tvf $HOME/sqllib/misc/EXPLAIN.DDL建立执行计划表 2 ...

  8. es 300G 数据删除 执行计划 curl REST 操作

    es 300G 数据删除 [es union_2017执行计划] [测试执行环境]线上D服务器[测试用例]get:curl -XGET ES:9200/_cat/indices?v post:curl ...

  9. MSSQLSERVER执行计划详解

    序言 本篇主要目的有二: 1.看懂t-sql的执行计划,明白执行计划中的一些常识. 2.能够分析执行计划,找到优化sql性能的思路或方案. 如果你对sql查询优化的理解或常识不是很深入,那么推荐几骗博 ...

随机推荐

  1. Salience Model

    Who is a stakeholder? Simply anyone with a stake in the project either direct or indirect. PMBOK say ...

  2. 如何将数组中的后面m个数移动为前面m个数

    思路分析: 可以通过递归的方法实现调整: (1)将前n-m个元素的顺序颠倒. (2)将后面m个元素的顺序颠倒. (3)将n个元素的顺序全部颠倒. 通过以上3个步骤的执行,就可以把数组的元素颠倒. 代码 ...

  3. Linux常用命令总结--基础命令

    系统信息 1.arch 显示机器的处理器架构(1) 2.uname -m 显示机器的处理器架构(2) 3.lsb_release -a 查看操作系统版本 4.top 查看进程 5.free -m 查看 ...

  4. Linux oracle数据库创建表空间、用户并赋予权限

    管理员用户登录oracle数据库 1.创建临时表空间 select name from v$tempfile;查出当前数据库临时表空间,主要是使用里面的存放路径: 得到其中一条记录/opt/oracl ...

  5. Linux Redis安装,Linux如何安装Redis,Linux Redis自动启动,Redis开机启动

    Linux Redis安装,Linux如何安装Redis,Linux Redis自动启动,Redis开机启动 >>>>>>>>>>>& ...

  6. iOS开发-- 设置UIButton的文字显示位置、字体的大小、字体的颜色

    btn.frame = CGRectMake(x, y, width, height); [btn setTitle: @"search" forState: UIControlS ...

  7. Perl操作Oracle

    一. perl连接Oracle数据库 [oracle@oracle11gR2 perl_script]$ more connect.pl #!/usr/bin/perl #perl script us ...

  8. 第二篇:Hadoop 在Ubuntu Kylin系统上的搭建[图解]

    前言 本文介绍如何在Ubuntu Kylin操作系统上搭建Hadoop平台. 配置 1. 操作系统: Ubuntu Kylin 14.04 2. 编程语言: JDK 1.8 3. 通信协议: SSH ...

  9. Web程序员应该知道的Javascript prototype原理

    有同事问了我几个和Javascript的类继承的小问题,我在也不太理解的情况下,胡诌了一通. 回来以后有些内疚, 反省一下, 整理整理Javascript的prototype的原理, 自己清楚点, 也 ...

  10. 【EF框架】使用params参数传值防止SQL注入报错处理

    通过SqlParameter传时间参数,代码如下: var param = new List<SqlParameter>(); param.Add(new SqlParameter(&qu ...