RDD、DataFrame、Dataset三者三者之间转换

转化：

RDD、DataFrame、Dataset三者有许多共性，有各自适用的场景常常需要在三者之间转换

DataFrame/Dataset转RDD：

这个转换很简单

val rdd1=testDF.rdd

val rdd2=testDS.rdd


RDD转DataFrame：

import spark.implicits._

val testDF = rdd.map {line=>

      (line._1,line._2)

    }.toDF("col1","col2")

一般用元组把一行的数据写在一起，然后在toDF中指定字段名

RDD转Dataset：



import spark.implicits._

case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型

val testDS = rdd.map {line=>

      Coltest(line._1,line._2)

    }.toDS

可以注意到，定义每一行的类型（case class）时，已经给出了字段名和类型，后面只要往case class里面添加值即可


Dataset转DataFrame：

这个也很简单，因为只是把case class封装成Row

import spark.implicits._

val testDF = testDS.toDF


DataFrame转Dataset：

import spark.implicits._

case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型

val testDS = testDF.as[Coltest]

这种方法就是在给出每一列的类型后，使用as方法，转成Dataset，这在数据类型是DataFrame又需要针对各个字段处理时极为方便

特别注意：

在使用一些特殊的操作时，一定要加上 import spark.implicits._ 不然toDF、toDS无法使用

package dataframe

import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

//
// Explore interoperability between DataFrame and Dataset. Note that Dataset
// is covered in much greater detail in the 'dataset' directory.
//
object DatasetConversion {

case class Cust(id: Integer, name: String, sales: Double, discount: Double, state: String)

case class StateSales(state: String, sales: Double)

def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DatasetConversion")
.master("local[4]")
.getOrCreate()

import spark.implicits._

// create a sequence of case class objects
// (we defined the case class above)
val custs = Seq(
Cust(1, "Widget Co", 120000.00, 0.00, "AZ"),
Cust(2, "Acme Widgets", 410500.00, 500.00, "CA"),
Cust(3, "Widgetry", 410500.00, 200.00, "CA"),
Cust(4, "Widgets R Us", 410500.00, 0.0, "CA"),
Cust(5, "Ye Olde Widgete", 500.00, 0.0, "MA")
)

// Create the DataFrame without passing through an RDD
val customerDF : DataFrame = spark.createDataFrame(custs)
//
// println("*** DataFrame schema")
//
// customerDF.printSchema()
//
// println("*** DataFrame contents")
//
// customerDF.show()

// +---+---------------+--------+--------+-----+
//| id| name| sales|discount|state|
//+---+---------------+--------+--------+-----+
//| 1| Widget Co|120000.0| 0.0| AZ|
//| 2| Acme Widgets|410500.0| 500.0| CA|
//| 3| Widgetry|410500.0| 200.0| CA|
//| 4| Widgets R Us|410500.0| 0.0| CA|
//| 5|Ye Olde Widgete| 500.0| 0.0| MA|
//+---+---------------+--------+--------+-----+

//
// println("*** Select and filter the DataFrame")
//
val smallerDF =
customerDF.select("sales", "state").filter($"state".equalTo("CA"))
//
// smallerDF.show()

//
// +--------+-----+
//| sales|state|
//+--------+-----+
//|410500.0| CA|
//|410500.0| CA|
//|410500.0| CA|
//+--------+-----+

///////////////////////////////////////////////////////////////////////////////////

// Convert it to a Dataset by specifying the type of the rows -- use a case
// class because we have one and it's most convenient to work with. Notice
// you have to choose a case class that matches the remaining columns.
// BUT also notice that the columns keep their order from the DataFrame --
// later you'll see a Dataset[StateSales] of the same type where the
// columns have the opposite order, because of the way it was created.

val customerDS : Dataset[StateSales] = smallerDF.as[StateSales]
//
// println("*** Dataset schema")
//
// customerDS.printSchema()
//
// println("*** Dataset contents")
//
// customerDS.show()

// Select and other operations can be performed directly on a Dataset too,
// but be careful to read the documentation for Dataset -- there are
// "typed transformations", which produce a Dataset, and
// "untyped transformations", which produce a DataFrame. In particular,
// you need to project using a TypedColumn to gate a Dataset.

// val verySmallDS : Dataset[Double] = customerDS.select($"sales".as[Double])
//
// println("*** Dataset after projecting one column")
//
// verySmallDS.show()

//
//+--------+
//| sales|
//+--------+
//|410500.0|
//|410500.0|
//|410500.0|
//+--------+

// If you select multiple columns on a Dataset you end up with a Dataset
// of tuple type, but the columns keep their names.
val tupleDS : Dataset[(String, Double)] =
customerDS.select($"state".as[String], $"sales".as[Double])
//
// println("*** Dataset after projecting two columns -- tuple version")
//
// tupleDS.show()

//
//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+

// You can also cast back to a Dataset of a case class. Notice this time
// the columns have the opposite order than the last Dataset[StateSales]
// val betterDS: Dataset[StateSales] = tupleDS.as[StateSales]
//
// println("*** Dataset after projecting two columns -- case class version")
//
// betterDS.show()

//
//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+

// Converting back to a DataFrame without making other changes is really easy
// val backToDataFrame : DataFrame = tupleDS.toDF()
//
// println("*** This time as a DataFrame")
//
// backToDataFrame.show()
//

//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+

//
// // While converting back to a DataFrame you can rename the columns
val renamedDataFrame : DataFrame = tupleDS.toDF("MyState", "MySales")

println("*** Again as a DataFrame but with renamed columns")

renamedDataFrame.show()

// +-------+--------+
//|MyState| MySales|
//+-------+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-------+--------+

}
}

RDD、DataFrame、Dataset三者三者之间转换的更多相关文章

APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL
What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...
spark的数据结构 RDD——DataFrame——DataSet区别
转载自:http://blog.csdn.net/wo334499/article/details/51689549 RDD 优点: 编译时类型安全编译时就能检查出类型错误面向对象的编程风格直接 ...
sparkSQL中RDD——DataFrame——DataSet的区别
spark中RDD.DataFrame.DataSet都是spark的数据集合抽象,RDD针对的是一个个对象,但是DF与DS中针对的是一个个Row RDD 优点: 编译时类型安全编译时就能检查出类型 ...
RDD, DataFrame or Dataset
总结: 1.RDD是一个Java对象的集合.RDD的优点是更面向对象,代码更容易理解.但在需要在集群中传输数据时需要为每个对象保留数据及结构信息,这会导致数据的冗余,同时这会导致大量的GC. 2.Da ...
spark rdd df dataset
RDD.DataFrame.DataSet的区别和联系共性: 1)都是spark中得弹性分布式数据集,轻量级 2)都是惰性机制,延迟计算 3)根据内存情况,自动缓存,加快计算速度 4)都有parti ...
byte[] 、Bitmap与Drawbale 三者直接的转换
经常遇到这种类似头疼的问题 byte[] .Bitmap与Drawbale 三者直接的转换 1.byte[] ->Bitmap Bitmap Bitmap = BitmapFactory.dec ...
Spark入门之DataFrame/DataSet
目录 Part I. Gentle Overview of Big Data and Spark Overview 1.基本架构 2.基本概念 3.例子(可跳过) Spark工具箱 1.Dataset ...
C#中对象，字符串，dataTable、DataReader、DataSet，对象集合转换成Json字符串方法。
C#中对象,字符串,dataTable.DataReader.DataSet,对象集合转换成Json字符串方法. public class ConvertJson { #region 私有方法 /// ...
关于不同进制数之间转换的数学推导【Written By KillerLegend】
关于不同进制数之间转换的数学推导涉及范围:正整数范围内二进制(Binary),八进制(Octonary),十进制(Decimal),十六进制(hexadecimal)之间的转换数的进制有多种,比如 ...

随机推荐

ACM：油田(Oil Deposits,UVa 572)
/* Oil Deposits Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Tot ...
[性能优化] perf 高级用法：完整记录程序性能指标，并按照时间段对程序进行有针对性的性能分析
如题: 假设你已经熟悉了基本用法,知道perf是干嘛的,以及会用 perf top [性能优化] perf 背景:目标程序在运行的某时间段内会出现性能下降,需要了解这个时间内,程序发生了什么. 方法: ...
抽屉之Tornado实战（5）--点赞与评论树
点赞点赞的过程:数字增加,并在后台点赞表记录数据需要发过去的数据:用户id,新闻id 用户id从session里获得,那新闻id怎么获取呢?这想到分页是循环新闻列表来展示内容,循环的新闻id可以做 ...
Transparent HugePages（透明大页）
Transparent HugePages(透明大页) 1. 介绍从RedHat6, RedHat7, OL6, OL7 SLES11 and UEK2 kernels开始,透明大页默认是被开启的以 ...
重写toString()
重写Object的toString()之前,得到的结果是类型 @ 内存地址 demo: package cn.sasa.demo1; public class Test { public stat ...
PyCharm 常用习惯设置
1.pycharm改变选中行时改行的颜色和光标所在行的颜色 1.是光标所在行的背景颜色,写代码,每写到这一行,就会是这种颜色,所以尽量改成和你代码背景颜色相近的 2.应该是旁边行号所在背景颜色 3.是 ...
转： js实现全角半角检测的方法
//全角半角校验 function issbccase(strTmp) { for (var i=0; i<strTmp.length; i++) { if (strTmp.charCodeAt ...
tail命令输出文件后n行，默认查看文件的后10行
默认查看文件的后10行 -n 3 数字也可以忽略-n 直接加数字 tail 3 查看文件后3行 [root@localhost ~]# tail /etc/passwd // 默认查看文件的后十 ...
finecms如何批量替换文章中的关键词?
Finecms批量替换文章关键词要怎么操作呢,比如把关键词A换为B?Finecms是免费开源无商业限制的内容管理系统,个人在维护,但二次开发很灵活,我们可以通过开发插件或数据库sql语句来操作,下面就 ...
教你一步步composer安装Magento2.3
以前外贸建站一直用zencart,这段时间ytkah比较有时间,就决定用magento来创建一下站点.magento不像普通的程序一样下载就可以直接安装,需要借助composer安装,还没没compo ...

RDD、DataFrame、Dataset三者三者之间转换

RDD、DataFrame、Dataset三者三者之间转换的更多相关文章

随机推荐

热门专题