在2.0版本之前,使用Spark必须先创建SparkConf和SparkContext

catalog:目录

Spark2.0中引入了SparkSession的概念,SparkConf、SparkContext 和 SQLContext 都已经被封装在 SparkSession 当中,并且可以通过 builder 的方式创建;可以通过 SparkSession 创建并操作 Dataset 和 DataFrame

SparkSession  The entry point to programming Spark with the Dataset and DataFrame API.

scala> import org.apache.spark.sql.SparkSession
SparkSession SparkSessionExtensions

scala> val spsession=SparkSession.builder().getOrCreate()
spsession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@577d07b

scala> session.
baseRelationToDataFrame conf emptyDataFrame implicits range sessionState sql streams udf
catalog createDataFrame emptyDataset listenerManager read sharedState sqlContext table version
close createDataset experimental newSession readStream sparkContext stop time

scala> spsession.read.
csv format jdbc json load option options orc parquet schema table text textFile

--------------------------------------------------------------------------------------------------------------------------------------

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

//session的导入隐式转换

scala> import spsession.implicits._
import spsession.implicits._

scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+

scala> val rowrdd=lines.map(x=>{val arr=x.split("[,]");(arr(0).toLong,arr(1),arr(2).toInt,arr(3).toInt)})
rowrdd: org.apache.spark.sql.Dataset[(Long, String, Int, Int)] = [_1: bigint, _2: string ... 2 more fields]

scala> val personDF=rowrdd.toDF("id","name","age","fv")
personDF: org.apache.spark.sql.DataFrame = [id: bigint, name: string ... 2 more fields]

scala> personDF.printSchema
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
|-- fv: integer (nullable = false)

scala> personDF.show
+---+--------+---+----+
| id| name|age| fv|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

-------------------------------------------------------------------------

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val spsession=SparkSession.builder().getOrCreate()
spsession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4c89c98a

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val rowDF=lines.map(x=>{val arr=x.split("[,]");(arr(0).toLong,arr(1),arr(2).toInt,arr(3).toInt)})
rowDF: org.apache.spark.sql.Dataset[(Long, String, Int, Int)] = [_1: bigint, _2: string ... 2 more fields]

scala> rowDF.printSchema
root
|-- _1: long (nullable = false)
|-- _2: string (nullable = true)
|-- _3: integer (nullable = false)
|-- _4: integer (nullable = false)

scala> rowDF.show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> rowDF.createTempView("Aaa")

scala> spsession.sql("select * from Aaa").show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> import spsession.implicits._
import spsession.implicits._

scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+

scala> val wordDF=lines.flatMap(_.split(","))
wordDF: org.apache.spark.sql.Dataset[String] = [value: string]

scala> wordDF.groupBy($"value" as "word").count
res24: org.apache.spark.sql.DataFrame = [word: string, count: bigint]

scala> wordDF.groupBy($"value" as "word").agg(count("*") as "count")
res30: org.apache.spark.sql.DataFrame = [word: string, count: bigint]

scala> rowDF.groupBy($"_3" as "age").agg(count("*") as "count",avg($"_4") as "avg").show
+---+-----+------+
|age|count| avg|
+---+-----+------+
| 20| 1| 565.0|
| 23| 1| 565.0|
| 50| 2|3932.0|
|522| 2| 47.5|
+---+-----+------+

scala> rowDF.groupBy($"_3" as "age").agg(count("*"),avg($"_4")).show
+---+--------+-------+
|age|count(1)|avg(_4)|
+---+--------+-------+
| 20| 1| 565.0|
| 23| 1| 565.0|
| 50| 2| 3932.0|
|522| 2| 47.5|
+---+--------+-------+

A DataFrame is a Dataset organized into named columns.

scala> val jsonDF=spsession.read.json("/tmp/pdf1json/part*")
jsonDF: org.apache.spark.sql.DataFrame = [age: bigint, fv: bigint ... 1 more field]

scala> spsession.read.json("/tmp/pdf1json/part*").show
+---+----+--------+
|age| fv| name|
+---+----+--------+
| 50|6998| xiliu|
| 50| 866|zhangsan|
| 20| 565|zhangsan|
| 23| 565| llihmj|
+---+----+--------+

scala> spsession.read.format("json").load("/tmp/pdf1json/part*").show
+---+----+--------+
|age| fv| name|
+---+----+--------+
| 50|6998| xiliu|
| 50| 866|zhangsan|
| 20| 565|zhangsan|
| 23| 565| llihmj|
+---+----+--------+

scala> val jsonDF=spsession.read.json("/tmp/pdf1json/part*")
jsonDF: org.apache.spark.sql.DataFrame = [age: bigint, fv: bigint ... 1 more field]

scala> jsonDF.cube("age").mean("fv").show
+----+-------+
| age|avg(fv)|
+----+-------+
| 20| 565.0|
|null| 2248.5|
| 50| 3932.0|
| 23| 565.0|
+----+-------+

scala> jsonDF.cube("age").agg(max("fv"),count("name"),sum("fv")).show
+----+-------+-----------+-------+
| age|max(fv)|count(name)|sum(fv)|
+----+-------+-----------+-------+
| 20| 565| 1| 565|
|null| 6998| 4| 8994|
| 50| 6998| 2| 7864|
| 23| 565| 1| 565|

---------------------------------------------------------------

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

scala> lines.show
+-----------------+
| value|
+-----------------+
|2,zhangsan,50,866|
| 4,laoliu,522,30|
|5,zhangsan,20,565|
| 6,limi,522,65|
| 1,xiliu,50,6998|
| 7,llihmj,23,565|
+-----------------+

scala> val lineds=lines.map(x=>{val arr=x.split(",");(arr(0),arr(1),arr(2),arr(3))})
lineds: org.apache.spark.sql.Dataset[(String, String, String, String)] = [_1: string, _2: string ... 2 more fields]

scala> lineds.show
+---+--------+---+----+
| _1| _2| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> val personDF= lineds.withColumnRenamed("_1","id").withColumnRenamed("_2","name")
personDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]

scala> personDF.show
+---+--------+---+----+
| id| name| _3| _4|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

scala> personDF.sort($"id" desc).show
warning: there was one feature warning; re-run with -feature for details
+---+--------+---+----+
| id| name| _3| _4|
+---+--------+---+----+
| 7| llihmj| 23| 565|
| 6| limi|522| 65|
| 5|zhangsan| 20| 565|
| 4| laoliu|522| 30|
| 2|zhangsan| 50| 866|
| 1| xiliu| 50|6998|
+---+--------+---+----+

scala> val lines=spsession.read.textFile("/tmp/person.txt")
lines: org.apache.spark.sql.Dataset[String] = [value: string]

scala> lines.map(x=>{val arr= x.split(",");(arr(0),arr(1),arr(2),arr(3))}).toDF("id","name","age","fv").show
+---+--------+---+----+
| id| name|age| fv|
+---+--------+---+----+
| 2|zhangsan| 50| 866|
| 4| laoliu|522| 30|
| 5|zhangsan| 20| 565|
| 6| limi|522| 65|
| 1| xiliu| 50|6998|
| 7| llihmj| 23| 565|
+---+--------+---+----+

SparkSession的更多相关文章

  1. 源码中的哲学——通过构建者模式创建SparkSession

    spark2.2在使用的时候使用的是SparkSession,这个SparkSession创建的时候很明显的使用了创建者模式.通过观察源代码,简单的模拟了下,可以当作以后编码风格的参考: 官方使用 i ...

  2. [Spark SQL] SparkSession、DataFrame 和 DataSet 练习

    本課主題 DataSet 实战 DataSet 实战 SparkSession 是 SparkSQL 的入口,然后可以基于 sparkSession 来获取或者是读取源数据来生存 DataFrameR ...

  3. 【sparkSQL】SparkSession的认识

    https://www.cnblogs.com/zzhangyuhang/p/9039695.html https://www.jianshu.com/p/dea6a78b9dff 在Spark1.6 ...

  4. 【spark】SparkSession的API

    SparkSession是一个比较重要的类,它的功能的实现,肯定包含比较多的函数,这里介绍下它包含哪些函数. builder函数public static SparkSession.Builder b ...

  5. pyspark SparkSession及dataframe基本操作

    from pyspark import SparkContext, SparkConf import os from pyspark.sql.session import SparkSession f ...

  6. scala学习(3)-----wordcount【sparksession】

    参考: spark中文官方网址:http://spark.apachecn.org/#/ https://www.iteblog.com/archives/1674.html 一.知识点: 1.Dat ...

  7. Spark2.0 VS Spark 1.* -------SparkSession的区别

    Spark .0以前版本: val sparkConf = new SparkConf().setAppName("soyo") val spark = new SparkCont ...

  8. SparkSession - Spark SQL 的 入口

    SparkSession - Spark SQL 的 入口 翻译自:https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/ ...

  9. spark教程(八)-SparkSession

    spark 有三大引擎,spark core.sparkSQL.sparkStreaming, spark core 的关键抽象是 SparkContext.RDD: SparkSQL 的关键抽象是 ...

随机推荐

  1. MyBatis 与 Spring Data JPA 选择谁?

    MyBatis 与 Spring Data JPA 选择谁? https://www.v2ex.com/t/285081 jpa predicate优缺点 https://blog.csdn.net/ ...

  2. freemarker在js中的应用

    <script type="text/javascript"> //freemarker在js中的应用: var newOrganizations = []; < ...

  3. java反射+java泛型,封装BaseDaoUtil类。供应多个不同Dao使用

    当项目是ssh框架时,每一个Action会对应一个Service和一个Dao.但是所有的Ation对应的Dao中的方法是相同的,只是要查的表不一样.由于封装的思想,为了提高代码的重用性.可以使用jav ...

  4. C语言利用SMTP协议发送邮件

    #ifdef WIN32 #include <windows.h> #include <stdio.h> #else #include <stdio.h> #inc ...

  5. 【转】每天一个linux命令(5):rm 命令

    原文网址:http://www.cnblogs.com/peida/archive/2012/10/26/2740521.html 昨天学习了创建文件和目录的命令mkdir ,今天学习一下linux中 ...

  6. 怎样在两小时内搞定 OpenStack 部署?(转)

    怎样在两小时内搞定 OpenStack 部署? OpenStack的安装是一个难题,组件众多,非常麻烦.如果手工部署OpenStack,可能需要好几天,使用RDO,就是几个命令,再加一两个小时的等待. ...

  7. hadoop 配置文件简析

    文件名称            格式                     描述 hadoop-env.sh      bash脚本            记录hadoop要用的环境变量 core- ...

  8. Spring核心思想:“控制反转”,也叫“依赖注入” 的理解

    @Service对应的是业务层Bean,例如: @Service("userService") public class UserServiceImpl implements Us ...

  9. JMeter--详解JMeter配置元件

    JMeter配置元件可以用来初始化默认值和变量,以便后续采样器使用.将在其作用域的初始化阶段处理. CSV Data Set Config:被用来从文件中读取数据,并将它们拆分后存储到变量中,适合处理 ...

  10. hadoop入门篇-hadoop下载安装教程(附图文步骤)

    在前几篇的文章中分别就虚拟系统安装.LINUX系统安装以及hadoop运行服务器的设置等内容写了详细的操作教程,本篇分享的是hadoop的下载安装步骤. 在此之前有必要做一个简单的说明:分享的所有内容 ...