spark SQL初步认识

spark SQL是spark的一个模块,主要用于进行结构化数据的处理。它提供的最核心的编程抽象就是DataFrame。

DataFrame:它可以根据很多源进行构建,包括:结构化的数据文件,hive中的表,外部的关系型数据库,以及RDD

创建DataFrame

数据文件students.json

{"id":1, "name":"leo", "age":18}
{"id":2, "name":"jack", "age":19}
{"id":3, "name":"marry", "age":17}

spark-shell里创建DataFrame

//将文件上传到hdfs目录下
hadoop@master:~/wujiadong$ hadoop fs -put students.json /student/2016113012/spark
//启动spark shell
hadoop@slave01:~$ spark-shell
//导入SQLContext
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
//声明一个SQLContext的对象,以便对数据进行操作
scala> val sql = new SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sql: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@27acd9a7
//读取数据
scala> val students = sql.read.json("hdfs://master:9000/student/2016113012/spark/students.json")
students: org.apache.spark.sql.DataFrame = [age: bigint, id: bigint ... 1 more field]
//显示数据
scala> students.show
+---+---+-----+
|age| id| name|
+---+---+-----+
| 18| 1| leo|
| 19| 2| jack|
| 17| 3|marry|
+---+---+-----+

DataFrame常用操作

scala> students.show
+---+---+-----+
|age| id| name|
+---+---+-----+
| 18| 1| leo|
| 19| 2| jack|
| 17| 3|marry|
+---+---+-----+ scala> students.printSchema
root
|-- age: long (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true) scala> students.select("name").show
+-----+
| name|
+-----+
| leo|
| jack|
|marry|
+-----+ scala> students.select(students("name"),students("age")+1).show
+-----+---------+
| name|(age + 1)|
+-----+---------+
| leo| 19|
| jack| 20|
|marry| 18|
+-----+---------+ scala> students.filter(students("age")>18).show
+---+---+----+
|age| id|name|
+---+---+----+
| 19| 2|jack|
+---+---+----+ scala> students.groupBy("age").count().show
+---+-----+
|age|count|
+---+-----+
| 19| 1|
| 17| 1|
| 18| 1|
+---+-----+

两种方式将RDD转换成DataFrame

1)基于反射方式

package wujiadong_sparkSQL

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext} /**
* Created by Administrator on 2017/3/5.
*/
object RDDDataFrameReflection {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("rdddatafromareflection")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val fileRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt")
val lineRDD = fileRDD.map(line => line.split(","))
//将RDD和case class关联
val studentsRDD = lineRDD.map(x => Students(x(0).toInt,x(1),x(2).toInt))
//在scala中使用反射方式,进行rdd到dataframe的转换,需要手动导入一个隐式转换
import sqlContext.implicits._
val studentsDF = studentsRDD.toDF()
//注册表
studentsDF.registerTempTable("t_students")
val df = sqlContext.sql("select * from t_students")
df.rdd.foreach(row => println(row(0)+","+row(1)+","+row(2)))
df.rdd.saveAsTextFile("hdfs://master:9000/student/2016113012/data/out") } }
//放到外面
case class Students(id:Int,name:String,age:Int)

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameReflection  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/03/05 22:46:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/05 22:46:48 INFO Slf4jLogger: Slf4jLogger started
17/03/05 22:46:48 INFO Remoting: Starting remoting
17/03/05 22:46:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:34921]
17/03/05 22:46:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/03/05 22:46:51 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/03/05 22:47:00 INFO FileInputFormat: Total input paths to process : 1
17/03/05 22:47:07 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/05 22:47:07 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/05 22:47:07 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/05 22:47:07 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/05 22:47:07 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
1,leo,17
2,marry,17
3,jack,18
4,tom,19
17/03/05 22:47:10 INFO FileOutputCommitter: Saved output of task 'attempt_201703052247_0001_m_000000_1' to hdfs://master:9000/student/2016113012/data/out/_temporary/0/task_201703052247_0001_m_000000

2)编程接口方式

package wujiadong_sparkSQL

import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext} /**
* Created by Administrator on 2017/3/5.
*/
object RDDDataFrameBianchen {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("RDDDataFrameBianchen")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
//指定地址创建rdd
val studentsRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt").map(_.split(","))
//将rdd映射到rowRDD
val RowRDD = studentsRDD.map(x => Row(x(0).toInt,x(1),x(2).toInt))
//以编程方式动态构造元素据
val schema = StructType(
List(
StructField("id",IntegerType,true),
StructField("name",StringType,true),
StructField("age",IntegerType,true)
)
)
//将schema信息映射到rowRDD
val studentsDF = sqlContext.createDataFrame(RowRDD,schema)
//注册表
studentsDF.registerTempTable("t_students")
val df = sqlContext.sql("select * from t_students order by age")
df.rdd.collect().foreach(row => println(row))
} }

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameBianchen --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/03/06 11:07:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/06 11:07:27 INFO Slf4jLogger: Slf4jLogger started
17/03/06 11:07:27 INFO Remoting: Starting remoting
17/03/06 11:07:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:49756]
17/03/06 11:07:32 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/03/06 11:07:38 INFO FileInputFormat: Total input paths to process : 1
17/03/06 11:07:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/06 11:07:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/06 11:07:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/06 11:07:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/06 11:07:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
[1,leo,17]
[2,marry,17]
[3,jack,18]
[4,tom,19]
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

DataFrame与RDD

1)在spark中,DataFrame是一种以RDD为基础的分布式数据集,类似于传统数据库中的二维表格

2)DataFrame与RDD的主要区别就是,前者带有schema元信息,即DataFrame所表示的二维表数据集的每一列都带有名称和类型

参考资料

http://9269309.blog.51cto.com/9259309/1851673

参考资料

http://blog.csdn.net/ronaldo4511/article/details/53406069

参考资料

http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

spark SQL学习(认识spark SQL)的更多相关文章

  1. spark SQL学习(spark连接 mysql)

    spark连接mysql(打jar包方式) package wujiadong_sparkSQL import java.util.Properties import org.apache.spark ...

  2. spark SQL学习(spark连接hive)

    spark 读取hive中的数据 scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql. ...

  3. SQL学习笔记之SQL查询练习题1

    (网络搜集) 0x00 表名和字段 –1.学生表 Student(s_id,s_name,s_birth,s_sex) –学生编号,学生姓名, 出生年月,学生性别 –2.课程表 Course(c_id ...

  4. SQL学习之SqlMap SQL注入

    sqlmap也是渗透中常用的一个注入工具,其实在注入工具方面,一个sqlmap就足够用了,只要你用的熟,秒杀各种工具,只是一个便捷性问题,sql注入另一方面就是手工党了,这个就另当别论了. 今天把我一 ...

  5. SQL学习笔记之SQL中INNER、LEFT、RIGHT JOIN的区别和用法详解

    0x00 建表准备 相信很多人在刚开始使用数据库的INNER JOIN.LEFT JOIN和RIGHT JOIN时,都不太能明确区分和正确使用这三种JOIN操作,本文通过一个简单的例子通俗易懂的讲解这 ...

  6. SQL学习笔记----更改SQL默认的端口号

    1.SQLServer配置管理器----SQLServer网络配置----MSSQLSERVER的协议---TCP/IP(已启用)---IP地址 清空素有的IP,在IPALL下更改默认的端口: 2. ...

  7. 大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析 、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器

    第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...

  8. Spark学习之Spark SQL

    一.简介 Spark SQL 提供了以下三大功能. (1) Spark SQL 可以从各种结构化数据源(例如 JSON.Hive.Parquet 等)中读取数据. (2) Spark SQL 不仅支持 ...

  9. Spark学习之Spark SQL(8)

    Spark学习之Spark SQL(8) 1. Spark用来操作结构化和半结构化数据的接口--Spark SQL. 2. Spark SQL的三大功能 2.1 Spark SQL可以从各种结构化数据 ...

随机推荐

  1. wcur LOCATE +

    w字符串处理 DROP PROCEDURE IF EXISTS w_unique; DELIMITER /w/ CREATE PROCEDURE w_unique() BEGIN DECLARE do ...

  2. Java 使用阿里云短信的API接口

    亲们上午好,写的不好的地方还望指正.谢谢各位! 引言 短信服务(Short Message Service)是阿里云为用户提供的一种通信服务的能力,支持快速发送短信验证码.短信通知等.(我这里只讲一个 ...

  3. Yii框架2.0的过滤器

    过滤器是 控制器 动作 执行之前或之后执行的对象. 例如访问控制过滤器可在动作执行之前来控制特殊终端用户是否有权限执行动作, 内容压缩过滤器可在动作执行之后发给终端用户之前压缩响应内容. 过滤器可包含 ...

  4. HDFS 手写mapreduce单词计数框架

    一.数据处理类 package com.css.hdfs; import java.io.BufferedReader; import java.io.IOException; import java ...

  5. ORM之基础操作进阶

    一.外键自关联(一对多) 1.建表 # 评论表 class Comment(models.Model): id = models.AutoField(primary_key=True) content ...

  6. springMvc异常之 For input string: "show"

    异常提示: For input string: "show" 异常原因: 使用@PathVariable方式获取值,返回类型为String return "redirec ...

  7. python web框架 MVC MTV

    WEB框架 MVC Model View Controller 数据库 模板文件 业务处理 MTV Model Template View 数据库 模板文件 业务处理

  8. Linux系统性能调优之性能分析

    1.Linux性能分析的目的1)找出系统性能瓶颈(包括硬件瓶颈和软件瓶颈):2)提供性能优化的方案(升级硬件?改进系统系统结构?):3)达到合理的硬件和软件配置:4)使系统资源使用达到最大的平衡.(一 ...

  9. instanceof判断参数是否是给定的类型

    if(ofj instanceof CLOB) {//判断ofj是否是CLOB类型,如果是则把CLOB内容解析出来,放入TZNR字段中并返回 CLOB ft = (CLOB)ofj; String c ...

  10. genymotion——在虚拟机中当中安装genymotion,启动已经新增好的设备时,提示:the virtual device got no ip address

    1.启动已经新增好的设备时,提示:the virtual device got no ip address,于是在网上搜索该问题,便得到提示,先启动virtual box中的该模拟设备,于是便启动,出 ...