SparkSession
在2.0版本之前,使用Spark必须先创建SparkConf和SparkContext
catalog:目录
Spark2.0中引入了SparkSession的概念,SparkConf、SparkContext 和 SQLContext 都已经被封装在 SparkSession 当中,并且可以通过 builder 的方式创建;可以通过 SparkSession 创建并操作 Dataset 和 DataFrame
SparkSession The entry point to programming Spark with the Dataset and DataFrame API.
scala> import org.apache.spark.sql.SparkSession
SparkSession SparkSessionExtensions
scala> val spsession=SparkSession.builder().getOrCreate()
spsession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@577d07b
scala> session.
baseRelationToDataFrame conf emptyDataFrame implicits range sessionState sql streams udf
catalog createDataFrame emptyDataset listenerManager read sharedState sqlContext table version
close createDataset experimental newSession readStream sparkContext stop time
scala> spsession.read.
csv format jdbc json load option options orc parquet schema table text textFile
--------------------------------------------------------------------------------------------------------------------------------------
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]
//session的导入隐式转换
scala> import spsession.implicits._
import spsession.implicits._
scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+
scala> val rowrdd=lines.map(x=>{val arr=x.split("[,]");(arr(0).toLong,arr(1),arr(2).toInt,arr(3).toInt)})
rowrdd: org.apache.spark.sql.Dataset[(Long, String, Int, Int)] = [_1: bigint, _2: string ... 2 more fields]
scala> val personDF=rowrdd.toDF("id","name","age","fv")
personDF: org.apache.spark.sql.DataFrame = [id: bigint, name: string ... 2 more fields]
scala> personDF.printSchema
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
|-- fv: integer (nullable = false)
scala> personDF.show
+---+--------+---+----+
| id| name|age| fv|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+
-------------------------------------------------------------------------
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val spsession=SparkSession.builder().getOrCreate()
spsession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4c89c98a
scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val rowDF=lines.map(x=>{val arr=x.split("[,]");(arr(0).toLong,arr(1),arr(2).toInt,arr(3).toInt)})
rowDF: org.apache.spark.sql.Dataset[(Long, String, Int, Int)] = [_1: bigint, _2: string ... 2 more fields]
scala> rowDF.printSchema
root
|-- _1: long (nullable = false)
|-- _2: string (nullable = true)
|-- _3: integer (nullable = false)
|-- _4: integer (nullable = false)
scala> rowDF.show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+
scala> rowDF.createTempView("Aaa")
scala> spsession.sql("select * from Aaa").show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+
scala> import spsession.implicits._
import spsession.implicits._
scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+
scala> val wordDF=lines.flatMap(_.split(","))
wordDF: org.apache.spark.sql.Dataset[String] = [value: string]
scala> wordDF.groupBy($"value" as "word").count
res24: org.apache.spark.sql.DataFrame = [word: string, count: bigint]
scala> wordDF.groupBy($"value" as "word").agg(count("*") as "count")
res30: org.apache.spark.sql.DataFrame = [word: string, count: bigint]
scala> rowDF.groupBy($"_3" as "age").agg(count("*") as "count",avg($"_4") as "avg").show
+---+-----+------+
|age|count| avg|
+---+-----+------+
| 20| 1| 565.0|
| 23| 1| 565.0|
| 50| 2|3932.0|
|522| 2| 47.5|
+---+-----+------+
scala> rowDF.groupBy($"_3" as "age").agg(count("*"),avg($"_4")).show
+---+--------+-------+
|age|count(1)|avg(_4)|
+---+--------+-------+
| 20| 1| 565.0|
| 23| 1| 565.0|
| 50| 2| 3932.0|
|522| 2| 47.5|
+---+--------+-------+
A DataFrame is a Dataset organized into named columns.
scala> val jsonDF=spsession.read.json("/tmp/pdf1json/part*")
jsonDF: org.apache.spark.sql.DataFrame = [age: bigint, fv: bigint ... 1 more field]
scala> spsession.read.json("/tmp/pdf1json/part*").show
+---+----+--------+
|age| fv| name|
+---+----+--------+
| 50|6998| xiliu|
| 50| 866|zhangsan|
| 20| 565|zhangsan|
| 23| 565| llihmj|
+---+----+--------+
scala> spsession.read.format("json").load("/tmp/pdf1json/part*").show
+---+----+--------+
|age| fv| name|
+---+----+--------+
| 50|6998| xiliu|
| 50| 866|zhangsan|
| 20| 565|zhangsan|
| 23| 565| llihmj|
+---+----+--------+
scala> val jsonDF=spsession.read.json("/tmp/pdf1json/part*")
jsonDF: org.apache.spark.sql.DataFrame = [age: bigint, fv: bigint ... 1 more field]
scala> jsonDF.cube("age").mean("fv").show
+----+-------+
| age|avg(fv)|
+----+-------+
| 20| 565.0|
|null| 2248.5|
| 50| 3932.0|
| 23| 565.0|
+----+-------+
scala> jsonDF.cube("age").agg(max("fv"),count("name"),sum("fv")).show
+----+-------+-----------+-------+
| age|max(fv)|count(name)|sum(fv)|
+----+-------+-----------+-------+
| 20| 565| 1| 565|
|null| 6998| 4| 8994|
| 50| 6998| 2| 7864|
| 23| 565| 1| 565|
---------------------------------------------------------------
scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]
scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+
scala> val lineds=lines.map(x=>{val arr=x.split(",");(arr(0),arr(1),arr(2),arr(3))})
lineds: org.apache.spark.sql.Dataset[(String, String, String, String)] = [_1: string, _2: string ... 2 more fields]
scala> lineds.show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+
scala> val personDF= lineds.withColumnRenamed("_1","id").withColumnRenamed("_2","name")
personDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]
scala> personDF.show
+---+--------+---+----+
| id| name| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+
scala> personDF.sort($"id" desc).show
warning: there was one feature warning; re-run with -feature for details
+---+--------+---+----+
| id| name| _3| _4|
+---+--------+---+----+
| 7| llihmj| 23| 565|
| 6| limi|522| 65|
| 5|zhangsan| 20| 565|
| 4| laoliu|522| 30|
| 2|zhangsan| 50| 866|
| 1| xiliu| 50|6998|
+---+--------+---+----+
scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]
scala> lines.map(x=>{val arr= x.split(",");(arr(0),arr(1),arr(2),arr(3))}).toDF("id","name","age","fv").show
+---+--------+---+----+
| id| name|age| fv|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+
SparkSession的更多相关文章
- 源码中的哲学——通过构建者模式创建SparkSession
spark2.2在使用的时候使用的是SparkSession,这个SparkSession创建的时候很明显的使用了创建者模式.通过观察源代码,简单的模拟了下,可以当作以后编码风格的参考: 官方使用 i ...
- [Spark SQL] SparkSession、DataFrame 和 DataSet 练习
本課主題 DataSet 实战 DataSet 实战 SparkSession 是 SparkSQL 的入口,然后可以基于 sparkSession 来获取或者是读取源数据来生存 DataFrameR ...
- 【sparkSQL】SparkSession的认识
https://www.cnblogs.com/zzhangyuhang/p/9039695.html https://www.jianshu.com/p/dea6a78b9dff 在Spark1.6 ...
- 【spark】SparkSession的API
SparkSession是一个比较重要的类,它的功能的实现,肯定包含比较多的函数,这里介绍下它包含哪些函数. builder函数public static SparkSession.Builder b ...
- pyspark SparkSession及dataframe基本操作
from pyspark import SparkContext, SparkConf import os from pyspark.sql.session import SparkSession f ...
- scala学习(3)-----wordcount【sparksession】
参考: spark中文官方网址:http://spark.apachecn.org/#/ https://www.iteblog.com/archives/1674.html 一.知识点: 1.Dat ...
- Spark2.0 VS Spark 1.* -------SparkSession的区别
Spark .0以前版本: val sparkConf = new SparkConf().setAppName("soyo") val spark = new SparkCont ...
- SparkSession - Spark SQL 的 入口
SparkSession - Spark SQL 的 入口 翻译自:https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/ ...
- spark教程(八)-SparkSession
spark 有三大引擎,spark core.sparkSQL.sparkStreaming, spark core 的关键抽象是 SparkContext.RDD: SparkSQL 的关键抽象是 ...
随机推荐
- 腾讯的模板引擎---artTemplate
主要方法如下5种,在此不详细说artTemplate的方法,主要记录三种使用artTemplate的方法. template(id, data) 根据 id 渲染模板.内部会根据document.ge ...
- 关于凑数问题的dfs
https://www.nowcoder.com/acm/contest/42/F 首先由于是单一解问题,所以使用返回值类型为bool的dfs 然后为了保证dfs的效率性,应该把加数dfs放在前面,不 ...
- 有锁Iphone 3GS 6.0.1 降级刷到4.2.1 完美越狱+解锁
2012-12-20 百度空间所写 前言:由于之前的3GS升级到6.0.1后会有重启之后无法打电话的情况,同学觉得这样很烦,搞得手机不像个手机了.但是这也是没办法的,毕竟是不完美越狱加软解锁的.于是, ...
- 【mysql】mac上基于tar.gz包安装mysql服务
一.准备工作 (1)下载mysql-5.7.21-macos10.13-x86_64.tar.gz,并将该压缩包移动至/usr/local目录下 (2)解压压缩包 二.安装 (1)将解压的包重命名为m ...
- 直接new一个对象出来
- TensorFlow笔记-07-神经网络优化-学习率,滑动平均
TensorFlow笔记-07-神经网络优化-学习率,滑动平均 学习率 学习率 learning_rate: 表示了每次参数更新的幅度大小.学习率过大,会导致待优化的参数在最小值附近波动,不收敛:学习 ...
- JVM(上)
堆.栈 JVM内存≍Heap(堆内存)+PermGen(方法区)+Thrend(栈)Heap(堆内存)=Young(年轻代)+Old(老年代),官方文档建议整个年轻代占整个堆内存的3/8,老年代占整个 ...
- 标 题: 有什么办法快速把pc上的网址发送到手机上
标 题: 有什么办法快速把pc上的网址发送到手机上 transfer2u, pushbullet都可以实现你说的功能,还可以把图片或者选中内容/剪贴板内容发送到手机.后者功能更强,还支持在电脑之间发 ...
- DllPlugin、DllReferencePlugin 可以提取的第三方库列表
DllPlugin.DllReferencePlugin 可以提取的第三方库列表: 'vue/dist/vue.esm.js', // 'vue/dist/vue.common.js' for web ...
- 大数据学习资料之SQL与NOSQL数据库
这几年的大数据热潮带动了一激活了一大批hadoop学习爱好者.有自学hadoop的,有报名培训班学习的.所有接触过hadoop的人都知道,单独搭建hadoop里每个组建都需要运行环境.修改配置文件测试 ...