DataSet:面向对象的,从JVM进行构建,或从其它格式进行转化

DataFrame:面向SQL查询,从多种数据源进行构建,或从其它格式进行转化

RDD DataSet DataFrame互转

1.RDD -> Dataset
val ds = rdd.toDS() 2.RDD -> DataFrame
val df = spark.read.json(rdd) 3.Dataset -> RDD
val rdd = ds.rdd 4.Dataset -> DataFrame
val df = ds.toDF() 5.DataFrame -> RDD
val rdd = df.toJSON.rdd 6.DataFrame -> Dataset
val ds = df.toJSON

DataFrameTest1.scala

package com.spark.dataframe

import org.apache.spark.{SparkConf, SparkContext}

class DataFrameTest1 {
} object DataFrameTest1{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin");
val logFile = "e://temp.txt"
val conf = new SparkConf().setAppName("test").setMaster("local[4]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile,2).cache() val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs , Line with b : $numBs") sc.stop()
}
}

DataFrameTest2.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataFrameTest2 {
} object DataFrameTest2{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() val df = spark.read.json("E:\\spark\\datatemp\\people.json")
df.show() // This import is needed to use the $-notation
import spark.implicits._
df.printSchema()
df.select("name").show()
df.filter("age>21").show()
df.select($"name",$"age"+1).show() df.groupBy("age").count().show() }
}

DataFrameTest3.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataFrameTest3 {
} object DataFrameTest3{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() val df = spark.read.json("E:\\spark\\datatemp\\people.json")
// 将DataFrame注册为sql temporary view
df.createOrReplaceTempView("people") val sqlDF = spark.sql("select * from people")
sqlDF.show()
//spark.sql("select * from global_temp.people").show() }
}

DataSetTest1.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataSetTest1 {
} case class Person(name: String, age: Long) object DataSetTest1 {
def main(args : Array[String]): Unit ={ System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show() val ds = spark.read.json("E:\\spark\\datatemp\\people.json").as[Person]
ds.show() }
}

RDDToDataFrame.scala

package com.spark.dataframe

import org.apache.spark.sql.{Row, SparkSession}

class RDDToDataFrame {
} //介绍两种将RDD转换为DataFrame的方式
object RDDToDataFrame{
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ // 数据读取类可以提前定义,Person
val peopleDF =spark.sparkContext
.textFile("E:\\spark\\datatemp\\people.txt")
.map(_.split(","))
.map(attribute => Person(attribute(0),attribute(1).trim.toInt))
.toDF() peopleDF.createOrReplaceTempView("people") val teenagerDF = spark.sql("select name, age from people where age between 13 and 19")
teenagerDF.map(teenager=> "name:"+teenager(0)).show()
teenagerDF.map(teenager => "Name: "+teenager.getAs[String]("name")).show() // No pre-defined encoders for Dataset[Map[K,V]], define explicitly //隐式参数,后面需要Encoder类型的参数时时候则自动调用
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder() // row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagerDF.map(teenager => teenager.getValuesMap[Any](List("name","age"))).collect().foreach(println(_))
// Array(Map("name" -> "Justin", "age" -> 19)) //////////////////////////////////////////
//case classes 不能提前定义
/*
* When case classes cannot be defined ahead of time
* (for example, the structure of records is encoded in a string,
* or a text dataset will be parsed and fields will be projected differently for different users),
* a DataFrame can be created programmatically with three steps.
* 1. Create an RDD of Rows from the original RDD;
* 2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
* 3. Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
* */
import org.apache.spark.sql.types._ //1. 创建RDD
val peopleRDD = spark.sparkContext.textFile("e:\\spark\\datatemp\\people.txt")
//2.1 创建和RDD相匹配的schema
val schemaString = "name age"
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields) //2.2. 将RDD进行格式化
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0),attributes(1).trim)) //3. 将RDD转换为DF
val peopleDF2 = spark.createDataFrame(rowRDD, schema)
peopleDF2.createOrReplaceTempView("people")
val results = spark.sql("select name from people") results.show() }
}

GenericLoadAndSave.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession}
class GenericLoadAndSave {
} object GenericLoadAndSave{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ //保存为parquet格式的数据
val userDF = spark.read.json("e:\\spark\\datatemp\\people.json")
//userDF.select("name","age").write.save("e:\\spark\\datasave\\nameAndAge.parquet")
//数据保存时的模式设置为append
userDF.select("name","age").write.mode(SaveMode.Overwrite).save("e:\\spark\\datasave\\nameAndAge.parquet") //数据源的格式可以指定为 (json, parquet, jdbc, orc, libsvm, csv, text)
val peopleDF = spark.read.format("json").load("e:\\spark\\datatemp\\people.json")
//peopleDF.select("name","age").write.format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json")
//数据保存时的模式设置为overwrite
peopleDF.select("name","age").write.mode(SaveMode.Overwrite).format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json") //从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\nameAndAge.parquet\\")
//+"part-00000-*.snappy.parquet") //这行加上便于精准定位。事实上parquet可以根据文件路径自行发现和推断分区信息
System.out.println("------------------")
peopleDF2.select("name","age").show() //userDF.select("name","age").write.saveAsTable("e:\\spark\\datasave\\peopleSaveAsTable") //代码有错误,原因暂时未知
//val sqlDF = spark.sql("SELECT * FROM parquet.'E:\\spark\\datasave\\nameAndAge.parquet\\part-00000-c8740fc5-cba8-4ebe-a7a8-9cec3da7dfa2.snappy.parquet'")
//sqlDF.show()
}
}

ReadFromParquet.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class ReadFromParquet {
} object ReadFromParquet{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._
//从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\people") /*
* 目录结构为:
* people
* |- country=china
* |-data.parquet
* |- country=us
* |-data.parquet
*
* data.parquet内包含people的name和age。加上文件路径中的country信息,最终得到的表结构为:
* +-------+----+-------+
* | name| age|country|
* +-------+----+-------+
* */
peopleDF2.show()
}
}

SchemaMerge.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class SchemaMerge {
} object SchemaMerge{
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ val squaresDF = spark.sparkContext.makeRDD(1 to 5)
.map(i=>(i,i*i))
.toDF("value","square") squaresDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=1") val cubesDF = spark.sparkContext.makeRDD(1 to 5)
.map(i => (i,i*i*i))
.toDF("value","cube")
cubesDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=2") val mergedDF = spark.read.option("mergeSchema","true")
.parquet("E:\\spark\\datasave\\schemamerge\\test_table\\") mergedDF.printSchema()
mergedDF.show()
}
}

结果:

Spark笔记-DataSet,DataFrame的更多相关文章

  1. Spark提高篇——RDD/DataSet/DataFrame(一)

    该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 先来看下官网对RDD.DataSet.DataFrame的解释: 1.RDD ...

  2. Spark提高篇——RDD/DataSet/DataFrame(二)

    该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 该篇主要介绍DataSet与DataFrame. 一.生成DataFrame ...

  3. spark算子之DataFrame和DataSet

    前言 传统的RDD相对于mapreduce和storm提供了丰富强大的算子.在spark慢慢步入DataFrame到DataSet的今天,在算子的类型基本不变的情况下,这两个数据集提供了更为强大的的功 ...

  4. spark结构化数据处理:Spark SQL、DataFrame和Dataset

    本文讲解Spark的结构化数据处理,主要包括:Spark SQL.DataFrame.Dataset以及Spark SQL服务等相关内容.本文主要讲解Spark 1.6.x的结构化数据处理相关东东,但 ...

  5. spark RDD、DataFrame、DataSet之间的相互转化

    这三个数据集看似经常用,但是真正归纳总结的时候,很容易说不出来 三个之间的关系与区别参考我的另一篇blog  http://www.cnblogs.com/xjh713/p/7309507.html ...

  6. Spark Dataset DataFrame空值null,NaN判断和处理

    Spark Dataset DataFrame空值null,NaN判断和处理 import org.apache.spark.sql.SparkSession import org.apache.sp ...

  7. Spark Dataset DataFrame 操作

    Spark Dataset DataFrame 操作 相关博文参考 sparksql中dataframe的用法 一.Spark2 Dataset DataFrame空值null,NaN判断和处理 1. ...

  8. Spark SQL、DataFrame和Dataset——转载

    转载自:  Spark SQL.DataFrame和Datase

  9. RDD/Dataset/DataFrame互转

    1.RDD -> Dataset val ds = rdd.toDS() 2.RDD -> DataFrame val df = spark.read.json(rdd) 3.Datase ...

随机推荐

  1. Python 练习:使用 # 号输出长方形

    使用 # 号输出一个长方形,用户可以指定宽和高 height = int(input("please input height: "))width = int(input(&quo ...

  2. 微信小程序日历课表

    最近项目中使用到了日历,在网上找了一些参考,自己改改,先看效果图 wxml <view class="date"> <image class="dire ...

  3. Mobile first! Wijmo 5 + Ionic Framework之:费用跟踪 App

    费用跟踪应用采用了Wijmo5和Ionic Framework创建,目的是构建一个hybird app. 我们基于<Mobile first! Wijmo 5 + Ionic Framework ...

  4. 《数据库系统概念》10-ER模型

    通过建立实体到概念模型的映射,Entity-Relationship Model可以表达整个数据库的逻辑结构,很多数据库产品都采用E-R模型来表达数据库设计. 一.E-R模型采用了三个基本概念:实体集 ...

  5. Android Studio 点击两次返回键,退出APP

    该功能的实现没有特别复杂,主要在onKeyDown()事件中实现,直接上代码,如下: //第一次点击事件发生的时间 private long mExitTime; /** * 点击两次返回退出app ...

  6. WRT 下 C++ wstring, string, String^ 互转

    由于项目原因,需要引入C++. wstring 与 string 的互转研究了一段时间,坑主要在于使用下面这种方式进行转换,中文会乱码 wstring ws = L"这是一段测试文字&quo ...

  7. mac挂载分区包括EFI 或者任何隐藏分区

    1.mac终端下的diskutil命令是用来操作磁盘的 diskutil list #显示当前pc所有的磁盘 2.例如我们要挂载u盘中的efi分区 ,确定你的efi分区的 identified 我的是 ...

  8. mysql中的utf8mb4、utf8mb4_unicode_ci、utf8mb4_general_ci

    1.utf8与utf8mb4(utf8 most bytes 4) MySQL 5.5.3之后增加了utfmb4字符编码 支持BMP(Basic Multilingual Plane,基本多文种平面) ...

  9. [20171121]rman backup as copy 2.txt

    [20171121]rman backup as copy 2.txt --//昨天测试backup as copy ,备份时备份文件的文件头什么时候更新.是最后完成后还是顺序写入备份文件.--//我 ...

  10. scrapy简单分布式爬虫

    经过一段时间的折腾,终于整明白scrapy分布式是怎么个搞法了,特记录一点心得. 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网 ...