spark SQL初步认识

spark SQL是spark的一个模块,主要用于进行结构化数据的处理。它提供的最核心的编程抽象就是DataFrame。

DataFrame:它可以根据很多源进行构建,包括:结构化的数据文件,hive中的表,外部的关系型数据库,以及RDD

创建DataFrame

数据文件students.json

{"id":1, "name":"leo", "age":18}
{"id":2, "name":"jack", "age":19}
{"id":3, "name":"marry", "age":17}

spark-shell里创建DataFrame

//将文件上传到hdfs目录下
hadoop@master:~/wujiadong$ hadoop fs -put students.json /student/2016113012/spark
//启动spark shell
hadoop@slave01:~$ spark-shell
//导入SQLContext
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
//声明一个SQLContext的对象,以便对数据进行操作
scala> val sql = new SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sql: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@27acd9a7
//读取数据
scala> val students = sql.read.json("hdfs://master:9000/student/2016113012/spark/students.json")
students: org.apache.spark.sql.DataFrame = [age: bigint, id: bigint ... 1 more field]
//显示数据
scala> students.show
+---+---+-----+
|age| id| name|
+---+---+-----+
| 18| 1| leo|
| 19| 2| jack|
| 17| 3|marry|
+---+---+-----+

DataFrame常用操作

scala> students.show
+---+---+-----+
|age| id| name|
+---+---+-----+
| 18| 1| leo|
| 19| 2| jack|
| 17| 3|marry|
+---+---+-----+ scala> students.printSchema
root
|-- age: long (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true) scala> students.select("name").show
+-----+
| name|
+-----+
| leo|
| jack|
|marry|
+-----+ scala> students.select(students("name"),students("age")+1).show
+-----+---------+
| name|(age + 1)|
+-----+---------+
| leo| 19|
| jack| 20|
|marry| 18|
+-----+---------+ scala> students.filter(students("age")>18).show
+---+---+----+
|age| id|name|
+---+---+----+
| 19| 2|jack|
+---+---+----+ scala> students.groupBy("age").count().show
+---+-----+
|age|count|
+---+-----+
| 19| 1|
| 17| 1|
| 18| 1|
+---+-----+

两种方式将RDD转换成DataFrame

1)基于反射方式

package wujiadong_sparkSQL

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext} /**
* Created by Administrator on 2017/3/5.
*/
object RDDDataFrameReflection {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("rdddatafromareflection")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val fileRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt")
val lineRDD = fileRDD.map(line => line.split(","))
//将RDD和case class关联
val studentsRDD = lineRDD.map(x => Students(x(0).toInt,x(1),x(2).toInt))
//在scala中使用反射方式,进行rdd到dataframe的转换,需要手动导入一个隐式转换
import sqlContext.implicits._
val studentsDF = studentsRDD.toDF()
//注册表
studentsDF.registerTempTable("t_students")
val df = sqlContext.sql("select * from t_students")
df.rdd.foreach(row => println(row(0)+","+row(1)+","+row(2)))
df.rdd.saveAsTextFile("hdfs://master:9000/student/2016113012/data/out") } }
//放到外面
case class Students(id:Int,name:String,age:Int)

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameReflection  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/03/05 22:46:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/05 22:46:48 INFO Slf4jLogger: Slf4jLogger started
17/03/05 22:46:48 INFO Remoting: Starting remoting
17/03/05 22:46:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:34921]
17/03/05 22:46:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/03/05 22:46:51 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/03/05 22:47:00 INFO FileInputFormat: Total input paths to process : 1
17/03/05 22:47:07 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/05 22:47:07 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/05 22:47:07 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/05 22:47:07 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/05 22:47:07 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
1,leo,17
2,marry,17
3,jack,18
4,tom,19
17/03/05 22:47:10 INFO FileOutputCommitter: Saved output of task 'attempt_201703052247_0001_m_000000_1' to hdfs://master:9000/student/2016113012/data/out/_temporary/0/task_201703052247_0001_m_000000

2)编程接口方式

package wujiadong_sparkSQL

import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext} /**
* Created by Administrator on 2017/3/5.
*/
object RDDDataFrameBianchen {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("RDDDataFrameBianchen")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
//指定地址创建rdd
val studentsRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt").map(_.split(","))
//将rdd映射到rowRDD
val RowRDD = studentsRDD.map(x => Row(x(0).toInt,x(1),x(2).toInt))
//以编程方式动态构造元素据
val schema = StructType(
List(
StructField("id",IntegerType,true),
StructField("name",StringType,true),
StructField("age",IntegerType,true)
)
)
//将schema信息映射到rowRDD
val studentsDF = sqlContext.createDataFrame(RowRDD,schema)
//注册表
studentsDF.registerTempTable("t_students")
val df = sqlContext.sql("select * from t_students order by age")
df.rdd.collect().foreach(row => println(row))
} }

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameBianchen --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/03/06 11:07:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/06 11:07:27 INFO Slf4jLogger: Slf4jLogger started
17/03/06 11:07:27 INFO Remoting: Starting remoting
17/03/06 11:07:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:49756]
17/03/06 11:07:32 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/03/06 11:07:38 INFO FileInputFormat: Total input paths to process : 1
17/03/06 11:07:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/06 11:07:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/06 11:07:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/06 11:07:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/06 11:07:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
[1,leo,17]
[2,marry,17]
[3,jack,18]
[4,tom,19]
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

DataFrame与RDD

1)在spark中,DataFrame是一种以RDD为基础的分布式数据集,类似于传统数据库中的二维表格

2)DataFrame与RDD的主要区别就是,前者带有schema元信息,即DataFrame所表示的二维表数据集的每一列都带有名称和类型

参考资料

http://9269309.blog.51cto.com/9259309/1851673

参考资料

http://blog.csdn.net/ronaldo4511/article/details/53406069

参考资料

http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

spark SQL学习(认识spark SQL)的更多相关文章

  1. spark SQL学习(spark连接 mysql)

    spark连接mysql(打jar包方式) package wujiadong_sparkSQL import java.util.Properties import org.apache.spark ...

  2. spark SQL学习(spark连接hive)

    spark 读取hive中的数据 scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql. ...

  3. SQL学习笔记之SQL查询练习题1

    (网络搜集) 0x00 表名和字段 –1.学生表 Student(s_id,s_name,s_birth,s_sex) –学生编号,学生姓名, 出生年月,学生性别 –2.课程表 Course(c_id ...

  4. SQL学习之SqlMap SQL注入

    sqlmap也是渗透中常用的一个注入工具,其实在注入工具方面,一个sqlmap就足够用了,只要你用的熟,秒杀各种工具,只是一个便捷性问题,sql注入另一方面就是手工党了,这个就另当别论了. 今天把我一 ...

  5. SQL学习笔记之SQL中INNER、LEFT、RIGHT JOIN的区别和用法详解

    0x00 建表准备 相信很多人在刚开始使用数据库的INNER JOIN.LEFT JOIN和RIGHT JOIN时,都不太能明确区分和正确使用这三种JOIN操作,本文通过一个简单的例子通俗易懂的讲解这 ...

  6. SQL学习笔记----更改SQL默认的端口号

    1.SQLServer配置管理器----SQLServer网络配置----MSSQLSERVER的协议---TCP/IP(已启用)---IP地址 清空素有的IP,在IPALL下更改默认的端口: 2. ...

  7. 大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析 、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器

    第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...

  8. Spark学习之Spark SQL

    一.简介 Spark SQL 提供了以下三大功能. (1) Spark SQL 可以从各种结构化数据源(例如 JSON.Hive.Parquet 等)中读取数据. (2) Spark SQL 不仅支持 ...

  9. Spark学习之Spark SQL(8)

    Spark学习之Spark SQL(8) 1. Spark用来操作结构化和半结构化数据的接口--Spark SQL. 2. Spark SQL的三大功能 2.1 Spark SQL可以从各种结构化数据 ...

随机推荐

  1. upstream模块实现反向代理的功能

    nginx平台初探(100%) — Nginx开发从入门到精通 http://tengine.taobao.org/book/chapter_02.html#id6 nginx的模块化体系结构¶ ng ...

  2. window.navigator.userAgent $_SERVER['HTTP_USER_AGENT']

    wjs php返回结果一致 <script> !function () { var UA = window.navigator.userAgent, docEl = document.do ...

  3. Spark 源码分析 -- Task

    Task是介于DAGScheduler和TaskScheduler中间的接口 在DAGScheduler, 需要把DAG中的每个stage的每个partitions封装成task 最终把taskset ...

  4. 时间序列模式——ARIMA模型

    ARIMA模型全称为自回归积分滑动平均模型(Autoregressive Integrated Moving Average Model,简记ARIMA),是由博克思(Box)和詹金斯(Jenkins ...

  5. 转载--菜鸟Linux上使用Github

    1.安装Git:Ctrl + Alt + T使用终端:使用命令 sudo apt-get install git 2.创建GitHub帐号:登陆git主页: https://github.com/,自 ...

  6. SpringBean 定义继承

    Bean定义继承 bean定义可以包含很多的配置信息,包括构造函数的参数,属性值,容器的具体信息例如初始化方法,静态工厂方法名,等等.子bean的定义继承副定义的配置数据.子定义可以根据需要重写一些值 ...

  7. window7配置SQLserver 允许被远程连接

    需要别人远程你的数据库,首先需要的是在一个局域网内,或者连接的是同一个路由器,接下来就是具体步骤: (一)首先是要检查SQLServer数据库服务器中是否允许远程链接.其具体操作为: (1)打开数据库 ...

  8. C#建WindowForm调用R可视化

    众所周知R软件功能非常强大,可以很好的进行各类统计,并能输出图形.下面介绍一种R语言和C#进行通信的方法,并将R绘图结果显示到WinForm UI界面上的方法,文中介绍的很详细,需要的朋友可以参考下. ...

  9. 011-HQL中级1-Hive快捷查询:不启用Mapreduce job启用Fetch task三种方式介绍

    如果你想查询某个表的某一列,Hive默认是会启用MapReduce Job来完成这个任务,如下: hive; Total MapReduce jobs Launching Job out since ...

  10. jquery插件开发快速入门

    1.添加jQuery对象方法添加jQuery对象方法:jQuery.prototype.myMethod. 在jQuery源码中有一句:jQuery.fn = jQuery.prototype,也就是 ...