DataSet:面向对象的,从JVM进行构建,或从其它格式进行转化

DataFrame:面向SQL查询,从多种数据源进行构建,或从其它格式进行转化

RDD DataSet DataFrame互转

1.RDD -> Dataset
val ds = rdd.toDS() 2.RDD -> DataFrame
val df = spark.read.json(rdd) 3.Dataset -> RDD
val rdd = ds.rdd 4.Dataset -> DataFrame
val df = ds.toDF() 5.DataFrame -> RDD
val rdd = df.toJSON.rdd 6.DataFrame -> Dataset
val ds = df.toJSON

DataFrameTest1.scala

package com.spark.dataframe

import org.apache.spark.{SparkConf, SparkContext}

class DataFrameTest1 {
} object DataFrameTest1{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin");
val logFile = "e://temp.txt"
val conf = new SparkConf().setAppName("test").setMaster("local[4]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile,2).cache() val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs , Line with b : $numBs") sc.stop()
}
}

DataFrameTest2.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataFrameTest2 {
} object DataFrameTest2{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() val df = spark.read.json("E:\\spark\\datatemp\\people.json")
df.show() // This import is needed to use the $-notation
import spark.implicits._
df.printSchema()
df.select("name").show()
df.filter("age>21").show()
df.select($"name",$"age"+1).show() df.groupBy("age").count().show() }
}

DataFrameTest3.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataFrameTest3 {
} object DataFrameTest3{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() val df = spark.read.json("E:\\spark\\datatemp\\people.json")
// 将DataFrame注册为sql temporary view
df.createOrReplaceTempView("people") val sqlDF = spark.sql("select * from people")
sqlDF.show()
//spark.sql("select * from global_temp.people").show() }
}

DataSetTest1.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataSetTest1 {
} case class Person(name: String, age: Long) object DataSetTest1 {
def main(args : Array[String]): Unit ={ System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show() val ds = spark.read.json("E:\\spark\\datatemp\\people.json").as[Person]
ds.show() }
}

RDDToDataFrame.scala

package com.spark.dataframe

import org.apache.spark.sql.{Row, SparkSession}

class RDDToDataFrame {
} //介绍两种将RDD转换为DataFrame的方式
object RDDToDataFrame{
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ // 数据读取类可以提前定义,Person
val peopleDF =spark.sparkContext
.textFile("E:\\spark\\datatemp\\people.txt")
.map(_.split(","))
.map(attribute => Person(attribute(0),attribute(1).trim.toInt))
.toDF() peopleDF.createOrReplaceTempView("people") val teenagerDF = spark.sql("select name, age from people where age between 13 and 19")
teenagerDF.map(teenager=> "name:"+teenager(0)).show()
teenagerDF.map(teenager => "Name: "+teenager.getAs[String]("name")).show() // No pre-defined encoders for Dataset[Map[K,V]], define explicitly //隐式参数,后面需要Encoder类型的参数时时候则自动调用
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder() // row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagerDF.map(teenager => teenager.getValuesMap[Any](List("name","age"))).collect().foreach(println(_))
// Array(Map("name" -> "Justin", "age" -> 19)) //////////////////////////////////////////
//case classes 不能提前定义
/*
* When case classes cannot be defined ahead of time
* (for example, the structure of records is encoded in a string,
* or a text dataset will be parsed and fields will be projected differently for different users),
* a DataFrame can be created programmatically with three steps.
* 1. Create an RDD of Rows from the original RDD;
* 2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
* 3. Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
* */
import org.apache.spark.sql.types._ //1. 创建RDD
val peopleRDD = spark.sparkContext.textFile("e:\\spark\\datatemp\\people.txt")
//2.1 创建和RDD相匹配的schema
val schemaString = "name age"
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields) //2.2. 将RDD进行格式化
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0),attributes(1).trim)) //3. 将RDD转换为DF
val peopleDF2 = spark.createDataFrame(rowRDD, schema)
peopleDF2.createOrReplaceTempView("people")
val results = spark.sql("select name from people") results.show() }
}

GenericLoadAndSave.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession}
class GenericLoadAndSave {
} object GenericLoadAndSave{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ //保存为parquet格式的数据
val userDF = spark.read.json("e:\\spark\\datatemp\\people.json")
//userDF.select("name","age").write.save("e:\\spark\\datasave\\nameAndAge.parquet")
//数据保存时的模式设置为append
userDF.select("name","age").write.mode(SaveMode.Overwrite).save("e:\\spark\\datasave\\nameAndAge.parquet") //数据源的格式可以指定为 (json, parquet, jdbc, orc, libsvm, csv, text)
val peopleDF = spark.read.format("json").load("e:\\spark\\datatemp\\people.json")
//peopleDF.select("name","age").write.format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json")
//数据保存时的模式设置为overwrite
peopleDF.select("name","age").write.mode(SaveMode.Overwrite).format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json") //从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\nameAndAge.parquet\\")
//+"part-00000-*.snappy.parquet") //这行加上便于精准定位。事实上parquet可以根据文件路径自行发现和推断分区信息
System.out.println("------------------")
peopleDF2.select("name","age").show() //userDF.select("name","age").write.saveAsTable("e:\\spark\\datasave\\peopleSaveAsTable") //代码有错误,原因暂时未知
//val sqlDF = spark.sql("SELECT * FROM parquet.'E:\\spark\\datasave\\nameAndAge.parquet\\part-00000-c8740fc5-cba8-4ebe-a7a8-9cec3da7dfa2.snappy.parquet'")
//sqlDF.show()
}
}

ReadFromParquet.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class ReadFromParquet {
} object ReadFromParquet{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._
//从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\people") /*
* 目录结构为:
* people
* |- country=china
* |-data.parquet
* |- country=us
* |-data.parquet
*
* data.parquet内包含people的name和age。加上文件路径中的country信息,最终得到的表结构为:
* +-------+----+-------+
* | name| age|country|
* +-------+----+-------+
* */
peopleDF2.show()
}
}

SchemaMerge.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class SchemaMerge {
} object SchemaMerge{
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ val squaresDF = spark.sparkContext.makeRDD(1 to 5)
.map(i=>(i,i*i))
.toDF("value","square") squaresDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=1") val cubesDF = spark.sparkContext.makeRDD(1 to 5)
.map(i => (i,i*i*i))
.toDF("value","cube")
cubesDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=2") val mergedDF = spark.read.option("mergeSchema","true")
.parquet("E:\\spark\\datasave\\schemamerge\\test_table\\") mergedDF.printSchema()
mergedDF.show()
}
}

结果:

Spark笔记-DataSet,DataFrame的更多相关文章

  1. Spark提高篇——RDD/DataSet/DataFrame(一)

    该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 先来看下官网对RDD.DataSet.DataFrame的解释: 1.RDD ...

  2. Spark提高篇——RDD/DataSet/DataFrame(二)

    该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 该篇主要介绍DataSet与DataFrame. 一.生成DataFrame ...

  3. spark算子之DataFrame和DataSet

    前言 传统的RDD相对于mapreduce和storm提供了丰富强大的算子.在spark慢慢步入DataFrame到DataSet的今天,在算子的类型基本不变的情况下,这两个数据集提供了更为强大的的功 ...

  4. spark结构化数据处理:Spark SQL、DataFrame和Dataset

    本文讲解Spark的结构化数据处理,主要包括:Spark SQL.DataFrame.Dataset以及Spark SQL服务等相关内容.本文主要讲解Spark 1.6.x的结构化数据处理相关东东,但 ...

  5. spark RDD、DataFrame、DataSet之间的相互转化

    这三个数据集看似经常用,但是真正归纳总结的时候,很容易说不出来 三个之间的关系与区别参考我的另一篇blog  http://www.cnblogs.com/xjh713/p/7309507.html ...

  6. Spark Dataset DataFrame空值null,NaN判断和处理

    Spark Dataset DataFrame空值null,NaN判断和处理 import org.apache.spark.sql.SparkSession import org.apache.sp ...

  7. Spark Dataset DataFrame 操作

    Spark Dataset DataFrame 操作 相关博文参考 sparksql中dataframe的用法 一.Spark2 Dataset DataFrame空值null,NaN判断和处理 1. ...

  8. Spark SQL、DataFrame和Dataset——转载

    转载自:  Spark SQL.DataFrame和Datase

  9. RDD/Dataset/DataFrame互转

    1.RDD -> Dataset val ds = rdd.toDS() 2.RDD -> DataFrame val df = spark.read.json(rdd) 3.Datase ...

随机推荐

  1. deepin使用笔记-解决安装并解决gvim没有启动器的问题

    我的邮箱地址:zytrenren@163.com欢迎大家交流学习纠错! 1.安装gvim #apt-get install vim-gtk3 2.创建桌面启动器 创建/usr/share/applic ...

  2. 网络编程学习二(IP与端口)

    InetAddress类 封装计算机的ip地址,没有端口 // 使用getLocalHost方法创建InetAddress对象 InetAddress addr = InetAddress.getLo ...

  3. 【工具相关】Web-Sublime Text2-安装 Package Control

    一,打开Sublime text2---->Preferences--->若Package Settings,Package Control,没有的话,就需要安装Package Contr ...

  4. maven 技巧

    M2Eclipse Releases maven eclipse插件最新安装地址 Full Version Tag 1.0 2011-06-22 http://download.eclipse.org ...

  5. js,H5本地存储

    //存储本地存储----setItem(存储名称,数据名称) var c={name:"man",sex:"woman"}; localStorage.setI ...

  6. solr搜索引擎配置使用mongodb作为数据源

    环境说明: 操作系统:由于是使用的docker直接拉取的镜像部署的,系统是LINUX环境 mongodb: 4.0.3 solr: 7.5.0 python: 3.5 配置mongodb 1.拉取mo ...

  7. django静态文件

    django静态文件(js脚本.CSS.图片等) 默认统一放在每一个app的static文件夹下, 通过收集静态文件命令,自动将每一个app下static文件夹下的文件复制到根目录的static文件夹 ...

  8. Ext.net 3.1学习

    主页面前台代码: <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="MainF ...

  9. C#语言————第四章 深入C#的String类

    *********类型转换**************** Convert与Parse的区别: Convert可以将任何内置类型转换为其他任何内置类型 XX.Parse:只能将字符串转换为XX类型例如 ...

  10. 基于tomcat插件的maven多模块工程热部署(附插件源码)

    内容属原创,转载请注明出处 写在前面的话 最近一直比较纠结,归根结底在于工程的模块化拆分.以前也干过这事,但是一直对以前的结果不满意,这会重操旧业,希望搞出个自己满意的结果. 之前有什么不满意的呢? ...