spark SQL初步认识

spark SQL是spark的一个模块，主要用于进行结构化数据的处理。它提供的最核心的编程抽象就是DataFrame。

DataFrame：它可以根据很多源进行构建，包括：结构化的数据文件，hive中的表，外部的关系型数据库，以及RDD

创建DataFrame

数据文件students.json

{"id":1, "name":"leo", "age":18}

{"id":2, "name":"jack", "age":19}

{"id":3, "name":"marry", "age":17}

spark-shell里创建DataFrame

//将文件上传到hdfs目录下

hadoop@master:~/wujiadong$ hadoop fs -put students.json /student/2016113012/spark

//启动spark shell

hadoop@slave01:~$ spark-shell

//导入SQLContext

scala> import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.SQLContext

//声明一个SQLContext的对象，以便对数据进行操作

scala> val sql = new SQLContext(sc)

warning: there was one deprecation warning; re-run with -deprecation for details

sql: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@27acd9a7

//读取数据

scala> val students = sql.read.json("hdfs://master:9000/student/2016113012/spark/students.json")

students: org.apache.spark.sql.DataFrame = [age: bigint, id: bigint ... 1 more field]

//显示数据

scala> students.show

+---+---+-----+

|age| id| name|

+---+---+-----+

| 18|  1|  leo|

| 19|  2| jack|

| 17|  3|marry|

+---+---+-----+

DataFrame常用操作

scala> students.show

+---+---+-----+

|age| id| name|

+---+---+-----+

| 18|  1|  leo|

| 19|  2| jack|

| 17|  3|marry|

+---+---+-----+

scala> students.printSchema

root

 |-- age: long (nullable = true)

 |-- id: long (nullable = true)

 |-- name: string (nullable = true)

scala> students.select("name").show

+-----+

| name|

+-----+

|  leo|

| jack|

|marry|

+-----+ 

scala> students.select(students("name"),students("age")+1).show

+-----+---------+

| name|(age + 1)|

+-----+---------+

|  leo|       19|

| jack|       20|

|marry|       18|

+-----+---------+

scala> students.filter(students("age")>18).show

+---+---+----+

|age| id|name|

+---+---+----+

| 19|  2|jack|

+---+---+----+

scala> students.groupBy("age").count().show

+---+-----+

|age|count|

+---+-----+

| 19|    1|

| 17|    1|

| 18|    1|

+---+-----+

两种方式将RDD转换成DataFrame

1）基于反射方式

package wujiadong_sparkSQL

import org.apache.spark.sql.SQLContext

import org.apache.spark.{SparkConf, SparkContext}

/**

  * Created by Administrator on 2017/3/5.

  */

object RDDDataFrameReflection {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("rdddatafromareflection")

    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    val fileRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt")

    val lineRDD = fileRDD.map(line => line.split(","))

    //将RDD和case class关联

    val studentsRDD = lineRDD.map(x => Students(x(0).toInt,x(1),x(2).toInt))

    //在scala中使用反射方式，进行rdd到dataframe的转换，需要手动导入一个隐式转换

    import sqlContext.implicits._

    val studentsDF = studentsRDD.toDF()

    //注册表

    studentsDF.registerTempTable("t_students")

    val df = sqlContext.sql("select * from t_students")

    df.rdd.foreach(row => println(row(0)+","+row(1)+","+row(2)))

    df.rdd.saveAsTextFile("hdfs://master:9000/student/2016113012/data/out")

  }

}

//放到外面

case class Students(id:Int,name:String,age:Int)

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameReflection  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar

17/03/05 22:46:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

17/03/05 22:46:48 INFO Slf4jLogger: Slf4jLogger started

17/03/05 22:46:48 INFO Remoting: Starting remoting

17/03/05 22:46:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:34921]

17/03/05 22:46:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

17/03/05 22:46:51 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.

17/03/05 22:47:00 INFO FileInputFormat: Total input paths to process : 1

17/03/05 22:47:07 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

17/03/05 22:47:07 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

17/03/05 22:47:07 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

17/03/05 22:47:07 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

17/03/05 22:47:07 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

1,leo,17

2,marry,17

3,jack,18

4,tom,19

17/03/05 22:47:10 INFO FileOutputCommitter: Saved output of task 'attempt_201703052247_0001_m_000000_1' to hdfs://master:9000/student/2016113012/data/out/_temporary/0/task_201703052247_0001_m_000000

2）编程接口方式

package wujiadong_sparkSQL

import org.apache.spark.sql.types._

import org.apache.spark.sql.{Row, SQLContext}

import org.apache.spark.{SparkConf, SparkContext}

/**

  * Created by Administrator on 2017/3/5.

  */

object RDDDataFrameBianchen {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("RDDDataFrameBianchen")

    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    //指定地址创建rdd

    val studentsRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt").map(_.split(","))

    //将rdd映射到rowRDD

    val RowRDD = studentsRDD.map(x => Row(x(0).toInt,x(1),x(2).toInt))

    //以编程方式动态构造元素据

    val schema = StructType(

      List(

        StructField("id",IntegerType,true),

        StructField("name",StringType,true),

        StructField("age",IntegerType,true)

      )

    )

    //将schema信息映射到rowRDD

    val studentsDF = sqlContext.createDataFrame(RowRDD,schema)

    //注册表

    studentsDF.registerTempTable("t_students")

    val df = sqlContext.sql("select * from t_students order by age")

    df.rdd.collect().foreach(row => println(row))

  }

}

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameBianchen --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar

17/03/06 11:07:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

17/03/06 11:07:27 INFO Slf4jLogger: Slf4jLogger started

17/03/06 11:07:27 INFO Remoting: Starting remoting

17/03/06 11:07:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:49756]

17/03/06 11:07:32 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.

17/03/06 11:07:38 INFO FileInputFormat: Total input paths to process : 1

17/03/06 11:07:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

17/03/06 11:07:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

17/03/06 11:07:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

17/03/06 11:07:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

17/03/06 11:07:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

[1,leo,17]

[2,marry,17]

[3,jack,18]

[4,tom,19]

17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

DataFrame与RDD

1）在spark中，DataFrame是一种以RDD为基础的分布式数据集，类似于传统数据库中的二维表格

2）DataFrame与RDD的主要区别就是，前者带有schema元信息，即DataFrame所表示的二维表数据集的每一列都带有名称和类型

参考资料

http://9269309.blog.51cto.com/9259309/1851673

参考资料

http://blog.csdn.net/ronaldo4511/article/details/53406069

参考资料

http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

spark SQL学习（认识spark SQL）的更多相关文章

spark SQL学习（spark连接 mysql）
spark连接mysql(打jar包方式) package wujiadong_sparkSQL import java.util.Properties import org.apache.spark ...
spark SQL学习（spark连接hive）
spark 读取hive中的数据 scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql. ...
SQL学习笔记之SQL查询练习题1
(网络搜集) 0x00 表名和字段 –1.学生表 Student(s_id,s_name,s_birth,s_sex) –学生编号,学生姓名, 出生年月,学生性别 –2.课程表 Course(c_id ...
SQL学习之SqlMap SQL注入
sqlmap也是渗透中常用的一个注入工具,其实在注入工具方面,一个sqlmap就足够用了,只要你用的熟,秒杀各种工具,只是一个便捷性问题,sql注入另一方面就是手工党了,这个就另当别论了. 今天把我一 ...
SQL学习笔记之SQL中INNER、LEFT、RIGHT JOIN的区别和用法详解
0x00 建表准备相信很多人在刚开始使用数据库的INNER JOIN.LEFT JOIN和RIGHT JOIN时,都不太能明确区分和正确使用这三种JOIN操作,本文通过一个简单的例子通俗易懂的讲解这 ...
SQL学习笔记----更改SQL默认的端口号
1.SQLServer配置管理器----SQLServer网络配置----MSSQLSERVER的协议---TCP/IP(已启用)---IP地址清空素有的IP,在IPALL下更改默认的端口: 2. ...
大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器
第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...
Spark学习之Spark SQL
一.简介 Spark SQL 提供了以下三大功能. (1) Spark SQL 可以从各种结构化数据源(例如 JSON.Hive.Parquet 等)中读取数据. (2) Spark SQL 不仅支持 ...
Spark学习之Spark SQL（8）
Spark学习之Spark SQL(8) 1. Spark用来操作结构化和半结构化数据的接口--Spark SQL. 2. Spark SQL的三大功能 2.1 Spark SQL可以从各种结构化数据 ...

随机推荐

windows accounts
Some built-in groups are used for management purposes. You control which > users belong to these ...
转:JAVA.NET.SOCKETEXCEPTION: TOO MANY OPEN FILES解决方法
最近随着网站访问量的提高把web服务器移到linux下了,在移服务器的第二天,tomcat频繁的报 java.net.SocketException: Too many open files错误,错误 ...
01-开始使用django（全、简）
目录 (一)创建项目 1.生成django默认目录 2.创建应用目录 3.安装应用 4.配置使用mysql数据库 5.运行轻量级web服务器,预览 (二)设计模型 1.在models.py中定义模型类 ...
（4.20）SQL Server数据库启动过程，以及启动不起来的各种问题的分析及解决技巧
转自:指尖流淌 https://www.cnblogs.com/zhijianliutang/p/4085546.html SQL Server数据库启动过程,以及启动不起来的各种问题的分析及解决技巧 ...
NoSQL 数据库概览及其与 SQL 语法的比较
NoSQL数据库的产生就是为了解决大规模数据集合多重数据种类带来的挑战,尤其是大数据应用的难题. 本文对NoSQL数据库的定义.分类.特征.当前比较流行的NoSQL数据库系统等进行了简单的介绍,并对N ...
my first ai application
正式下手之前,先跑个demo体验以下. 1.my first ai application https://sonnguyen.ws/first-ai-application/ https://git ...
深入了解SQL Tuning Advisor（转载）
1.前言:一直以来SQL调优都是DBA比较费力的技术活,而且很多DBA如果没有从事过开发的工作,那么调优更是一项头疼的工作,即使是SQL调优很厉害的高手,在SQL调优的过程中也要不停的分析执行计划.加 ...
jmeter接口测试实战
请求方法:get/post 接口请求地址:http://172.22.24.26:8080/fundhouse/external/getdata?name=xxxx &fund_udid=35 ...
JAVA_HOME is not defined correctly
这是个神奇的问题.系统运行着运行着,突然就挂了.各种Java包丢失. 1.检查maven配置 .bash_profile 2.检查运行调取文件 .mavenrc 运行 java -version ...
for…else和while…else
当while语句配备else子句时,如果while子句内嵌的循环体在整个循环过程中没有执行break语句(循环体中没有break语句,或者循环体中有break语句但是始终未执行),那么循环过程结束后, ...

spark SQL学习（认识spark SQL）