在2.0版本之前,使用Spark必须先创建SparkConf和SparkContext

catalog:目录

Spark2.0中引入了SparkSession的概念,SparkConf、SparkContext 和 SQLContext 都已经被封装在 SparkSession 当中,并且可以通过 builder 的方式创建;可以通过 SparkSession 创建并操作 Dataset 和 DataFrame

SparkSession  The entry point to programming Spark with the Dataset and DataFrame API.

scala> import org.apache.spark.sql.SparkSession
SparkSession SparkSessionExtensions

scala> val spsession=SparkSession.builder().getOrCreate()
spsession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@577d07b

scala> session.
baseRelationToDataFrame conf emptyDataFrame implicits range sessionState sql streams udf
catalog createDataFrame emptyDataset listenerManager read sharedState sqlContext table version
close createDataset experimental newSession readStream sparkContext stop time

scala> spsession.read.
csv format jdbc json load option options orc parquet schema table text textFile

--------------------------------------------------------------------------------------------------------------------------------------

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

//session的导入隐式转换

scala> import spsession.implicits._
import spsession.implicits._

scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+

scala> val rowrdd=lines.map(x=>{val arr=x.split("[,]");(arr(0).toLong,arr(1),arr(2).toInt,arr(3).toInt)})
rowrdd: org.apache.spark.sql.Dataset[(Long, String, Int, Int)] = [_1: bigint, _2: string ... 2 more fields]

scala> val personDF=rowrdd.toDF("id","name","age","fv")
personDF: org.apache.spark.sql.DataFrame = [id: bigint, name: string ... 2 more fields]

scala> personDF.printSchema
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
|-- fv: integer (nullable = false)

scala> personDF.show
+---+--------+---+----+
| id| name|age| fv|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

-------------------------------------------------------------------------

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val spsession=SparkSession.builder().getOrCreate()
spsession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4c89c98a

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val rowDF=lines.map(x=>{val arr=x.split("[,]");(arr(0).toLong,arr(1),arr(2).toInt,arr(3).toInt)})
rowDF: org.apache.spark.sql.Dataset[(Long, String, Int, Int)] = [_1: bigint, _2: string ... 2 more fields]

scala> rowDF.printSchema
root
|-- _1: long (nullable = false)
|-- _2: string (nullable = true)
|-- _3: integer (nullable = false)
|-- _4: integer (nullable = false)

scala> rowDF.show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> rowDF.createTempView("Aaa")

scala> spsession.sql("select * from Aaa").show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> import spsession.implicits._
import spsession.implicits._

scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+

scala> val wordDF=lines.flatMap(_.split(","))
wordDF: org.apache.spark.sql.Dataset[String] = [value: string]

scala> wordDF.groupBy($"value" as "word").count
res24: org.apache.spark.sql.DataFrame = [word: string, count: bigint]

scala> wordDF.groupBy($"value" as "word").agg(count("*") as "count")
res30: org.apache.spark.sql.DataFrame = [word: string, count: bigint]

scala> rowDF.groupBy($"_3" as "age").agg(count("*") as "count",avg($"_4") as "avg").show
+---+-----+------+
|age|count| avg|
+---+-----+------+
| 20| 1| 565.0|
| 23| 1| 565.0|
| 50| 2|3932.0|
|522| 2| 47.5|
+---+-----+------+

scala> rowDF.groupBy($"_3" as "age").agg(count("*"),avg($"_4")).show
+---+--------+-------+
|age|count(1)|avg(_4)|
+---+--------+-------+
| 20| 1| 565.0|
| 23| 1| 565.0|
| 50| 2| 3932.0|
|522| 2| 47.5|
+---+--------+-------+

A DataFrame is a Dataset organized into named columns.

scala> val jsonDF=spsession.read.json("/tmp/pdf1json/part*")
jsonDF: org.apache.spark.sql.DataFrame = [age: bigint, fv: bigint ... 1 more field]

scala> spsession.read.json("/tmp/pdf1json/part*").show
+---+----+--------+
|age| fv| name|
+---+----+--------+
| 50|6998| xiliu|
| 50| 866|zhangsan|
| 20| 565|zhangsan|
| 23| 565| llihmj|
+---+----+--------+

scala> spsession.read.format("json").load("/tmp/pdf1json/part*").show
+---+----+--------+
|age| fv| name|
+---+----+--------+
| 50|6998| xiliu|
| 50| 866|zhangsan|
| 20| 565|zhangsan|
| 23| 565| llihmj|
+---+----+--------+

scala> val jsonDF=spsession.read.json("/tmp/pdf1json/part*")
jsonDF: org.apache.spark.sql.DataFrame = [age: bigint, fv: bigint ... 1 more field]

scala> jsonDF.cube("age").mean("fv").show
+----+-------+
| age|avg(fv)|
+----+-------+
| 20| 565.0|
|null| 2248.5|
| 50| 3932.0|
| 23| 565.0|
+----+-------+

scala> jsonDF.cube("age").agg(max("fv"),count("name"),sum("fv")).show
+----+-------+-----------+-------+
| age|max(fv)|count(name)|sum(fv)|
+----+-------+-----------+-------+
| 20| 565| 1| 565|
|null| 6998| 4| 8994|
| 50| 6998| 2| 7864|
| 23| 565| 1| 565|

---------------------------------------------------------------

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+

scala> val lineds=lines.map(x=>{val arr=x.split(",");(arr(0),arr(1),arr(2),arr(3))})
lineds: org.apache.spark.sql.Dataset[(String, String, String, String)] = [_1: string, _2: string ... 2 more fields]

scala> lineds.show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> val personDF= lineds.withColumnRenamed("_1","id").withColumnRenamed("_2","name")
personDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]

scala> personDF.show
+---+--------+---+----+
| id| name| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> personDF.sort($"id" desc).show
warning: there was one feature warning; re-run with -feature for details
+---+--------+---+----+
| id| name| _3| _4|
+---+--------+---+----+
| 7| llihmj| 23| 565|
| 6| limi|522| 65|
| 5|zhangsan| 20| 565|
| 4| laoliu|522| 30|
| 2|zhangsan| 50| 866|
| 1| xiliu| 50|6998|
+---+--------+---+----+

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

scala> lines.map(x=>{val arr= x.split(",");(arr(0),arr(1),arr(2),arr(3))}).toDF("id","name","age","fv").show
+---+--------+---+----+
| id| name|age| fv|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

SparkSession的更多相关文章

  1. 源码中的哲学——通过构建者模式创建SparkSession

    spark2.2在使用的时候使用的是SparkSession,这个SparkSession创建的时候很明显的使用了创建者模式.通过观察源代码,简单的模拟了下,可以当作以后编码风格的参考: 官方使用 i ...

  2. [Spark SQL] SparkSession、DataFrame 和 DataSet 练习

    本課主題 DataSet 实战 DataSet 实战 SparkSession 是 SparkSQL 的入口,然后可以基于 sparkSession 来获取或者是读取源数据来生存 DataFrameR ...

  3. 【sparkSQL】SparkSession的认识

    https://www.cnblogs.com/zzhangyuhang/p/9039695.html https://www.jianshu.com/p/dea6a78b9dff 在Spark1.6 ...

  4. 【spark】SparkSession的API

    SparkSession是一个比较重要的类,它的功能的实现,肯定包含比较多的函数,这里介绍下它包含哪些函数. builder函数public static SparkSession.Builder b ...

  5. pyspark SparkSession及dataframe基本操作

    from pyspark import SparkContext, SparkConf import os from pyspark.sql.session import SparkSession f ...

  6. scala学习(3)-----wordcount【sparksession】

    参考: spark中文官方网址:http://spark.apachecn.org/#/ https://www.iteblog.com/archives/1674.html 一.知识点: 1.Dat ...

  7. Spark2.0 VS Spark 1.* -------SparkSession的区别

    Spark .0以前版本: val sparkConf = new SparkConf().setAppName("soyo") val spark = new SparkCont ...

  8. SparkSession - Spark SQL 的 入口

    SparkSession - Spark SQL 的 入口 翻译自:https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/ ...

  9. spark教程(八)-SparkSession

    spark 有三大引擎,spark core.sparkSQL.sparkStreaming, spark core 的关键抽象是 SparkContext.RDD: SparkSQL 的关键抽象是 ...

随机推荐

  1. hdu2084 数塔 DP

    数字三角形,DP裸题 #include<stdio.h> #include<string.h> #define max(a,b) (a)>(b)?a:b ][],dp[] ...

  2. IIS目录

    一.目录浏览 一般网站部署后,需要禁用目录浏览, 若启用目录浏览的话,可以自定义开启哪些目录(只能根目录),和影藏哪些目录 iis中限制访问某个文件或某个类型的文件配置方法 注意:图片目录不要隐藏,不 ...

  3. day20 python sys os time json pickl 正则

    字符组 : [字符组] 在同一个位置可能出现的各种字符组成了一个字符组,在正则表达式中用[]表示 字符分为很多类,比如数字.字母.标点等等. 假如你现在要求一个位置....9这10个数之一. 量词 几 ...

  4. MySQL · 特性分析 · 优化器 MRR & BKA【转】

    MySQL · 特性分析 · 优化器 MRR & BKA 上一篇文章咱们对 ICP 进行了一次全面的分析,本篇文章小编继续为大家分析优化器的另外两个选项: MRR & batched_ ...

  5. nyoj 表达式求值

    35-表达式求值 内存限制:64MB 时间限制:3000ms Special Judge: Noaccepted:19 submit:26 题目描述: ACM队的mdd想做一个计算器,但是,他要做的不 ...

  6. ThinkPHP 一直坚挺着

    ThinkPHP 一直坚挺着 从最初的 0.6 到现在的 5.2 ThinkPHP 走过了 12 年. 从 PHP 4 迭代到 PHP 7.3,每一次更新都给开源社区注入了活力. 这次国内开源软件的投 ...

  7. FastAdmin 浏览器 JS CSS 缓存如何更新?

    由于代码修改,但文件名没有修改,因为浏览器对 JS 和 CSS 是缓存的,而且由于服务器无法控制客户端的缓存. 但是可以对 JS 和 CSS 的请求加上版本号,达到更新缓存的效果.

  8. Spring MVC学习回顾

    Spring MVC是现在新项目中使用最多的MVC框架,超越了Structs2成为MVC框架的首选.今天抽时间看了4.2.x的官网翻译文档及相关代码,博客,将印象比较深的几点记录一下. 一.应用Spr ...

  9. mibox open ports

    root@dredd:/data/data/berserker.android.apps.sshdroid/home # netstat -lnpActive Internet connections ...

  10. Netflix 是怎样的一家公司?为什么它在美国非常成功

    https://www.zhihu.com/question/19552101 作者:陈达链接:https://www.zhihu.com/question/19552101/answer/11486 ...