1、创建数据框架 Creating DataFrames

val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json");
df.show();

写到hdfs路径:
df.select("age", "name").write.save("examples/src/main/resources/peopleOUT.json")

再读出来:
val peopleDF = spark.read.format("parquet").load("/user/root/examples/src/main/resources/peopleOUT.json")
格式是parquet,是hdfs保存后的列格式存储格式,是一个通用存储格式。
注意:如果用json来读肯定就会乱码了
可以直接show,也可以通过路径来写SQL
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/peopleOUT.json`")

2、如果没有向HDFS文件系统里写数据,试试这条语句

val df = spark.read.json("examples/src/main/resources/people.json")

会提示错误:
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://localhost:9000/user/root/examples/src/main/resources/people.json;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)

3、弱类型数据集查询

import spark.implicits._

// 数据集结构
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// 只显示 "name" 列
df.select("name").show()
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+

// 年龄加1
df.select($"name", $"age" + 1).show()
// +-------+---------+
// | name|(age + 1)|
// +-------+---------+
// |Michael| null|
// | Andy| 31|
// | Justin| 20|
// +-------+---------+

// 年龄大于21
df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// 按年龄分组计数
df.groupBy("age").count().show()

4、使用sql查询的方式

Running SQL Queries

df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

全局临时视图
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()

5、定义一个自定义类型

case class Person(name: String, age: Long)

6、schema反射型推理模式 Inferring the Schema

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))

7、schema指定式查询模式

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+

8、自定义聚合函数

http://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-user-defined-aggregate-functions

9、parquet格式的读写

import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")

// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()

10、手写json的数据源

val otherPeopleRDD = spark.sparkContext.makeRDD(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
// +---------------+----+
// | address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

11、json数据源

val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()

12、hive数据源

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = "spark-warehouse"

val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DaraFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// | 5| val_5| 5| val_5|
// ...

13、jdbc数据源,表的读写(整表)

比如用mysql,首先要把jar文件(如mysql-connector-java-5.1.26.jar)放在jars目录,然后用jar包参数启动

spark-shell --driver-class-path /usr/local/spark/jars/mysql-connector-java-5.1.21.jar --jars /usr/local/spark/jars/mysql-connector-java-5.1.21.jar

val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/openfire").option("dbtable", "ofUser").option("user", "root").option("password", "mysql").load()

// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()

jdbcDF2.write
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

Spark基础脚本入门实践1的更多相关文章

  1. Spark基础脚本入门实践3:Pair RDD开发

    Pair RDD转化操作 val rdd = sc.parallelize(List((1,2),(3,4),(3,6))) //reduceByKey,通过key来做合并val r1 = rdd.r ...

  2. Spark基础脚本入门实践2:基础开发

    1.最基本的Map用法 val data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)val result = distData. ...

  3. 【原创 Hadoop&Spark 动手实践 5】Spark 基础入门,集群搭建以及Spark Shell

    Spark 基础入门,集群搭建以及Spark Shell 主要借助Spark基础的PPT,再加上实际的动手操作来加强概念的理解和实践. Spark 安装部署 理论已经了解的差不多了,接下来是实际动手实 ...

  4. 0.Python 爬虫之Scrapy入门实践指南(Scrapy基础知识)

    目录 0.0.Scrapy基础 0.1.Scrapy 框架图 0.2.Scrapy主要包括了以下组件: 0.3.Scrapy简单示例如下: 0.4.Scrapy运行流程如下: 0.5.还有什么? 0. ...

  5. sass、less和stylus的安装使用和入门实践

    刚 开始的时候,说实话,我很反感使用css预处理器这种新玩意的,因为其中涉及到了编程的东西,私以为很复杂,而且考虑到项目不是一天能够完成的,也很少是 一个人完成的,对于这种团队的项目开发,前端实践用c ...

  6. Spark在美团的实践

    https://tech.meituan.com/2016/03/31/spark-in-meituan.html 本文已发表在<程序员>杂志2016年4月期. 前言 美团是数据驱动的互联 ...

  7. cmd 与 bash 基础命令入门

    身为一个程序员会用命令行来进行一些简单的操作,不是显得很装逼嘛!?嘿嘿~ ヾ(>∀<) cmd 与 bash 基础命令入门       简介       CMD 基础命令          ...

  8. IM开发者的零基础通信技术入门(二):通信交换技术的百年发展史(下)

    1.系列文章引言 1.1 适合谁来阅读? 本系列文章尽量使用最浅显易懂的文字.图片来组织内容,力求通信技术零基础的人群也能看懂.但个人建议,至少稍微了解过网络通信方面的知识后再看,会更有收获.如果您大 ...

  9. IM开发者的零基础通信技术入门(一):通信交换技术的百年发展史(上)

    [来源申明]本文原文来自:微信公众号“鲜枣课堂”,官方网站:xzclass.com,原题为:<通信交换的百年沧桑(上)>,本文引用时已征得原作者同意.为了更好的内容呈现,即时通讯网在收录时 ...

随机推荐

  1. uboot 设备树 libfdt

    http://www.cnblogs.com/leaven/p/6295999.html

  2. Java学习笔记(十四):java常用的包

  3. Java 基本类型和包装类型

    讲基本类型和包装类型之前,首先要介绍,装箱和拆箱 装箱:基本类型转化为包装类型 拆箱:包装类型转化为拆箱类型 为什么要有包装类型?Java是面向对象的语言,Java中一切都是对象除了基本数据类型,所以 ...

  4. python--第十天总结(Select/Poll/Epoll使用 )

    首先列一下,sellect.poll.epoll三者的区别 select select最早于1983年出现在4.2BSD中,它通过一个select()系统调用来监视多个文件描述符的数组,当select ...

  5. [leetcode]35. Search Insert Position寻找插入位置

    Given a sorted array and a target value, return the index if the target is found. If not, return the ...

  6. Step by Step Guide on Yanhua ACDP Clear BMW EGS ISN

    Yanhua Mini ACDP authorize new function on BMW EGS ISN clearing.So here UOBDII want to share this st ...

  7. GUI学习之〇——PyQt5安装

    GUI(Graphical User Interface)是程序和软件使用者的接口,好的GUI是一个良好的软件的前提,在这里演示一下用PyQt5做一个GUI的方法 软件需求:python3.6 用的是 ...

  8. zeromq学习记录(三)使用ZMQ_PULL ZMQ_PUSH

    /************************************************************** 技术博客 http://www.cnblogs.com/itdef/   ...

  9. Zookeeper系列1 快速入门

    Zookeeper的简介这里我就不说了,在接下来的几篇文章会涉及zookeeper环境搭建,watcher以及相关配置说明, 三种操作zookeeper的方式(原生API方式,zkclient,Cur ...

  10. wamp环境搭建(apache安装,mysql安装,php安装)

    1.软件安装说明 WAMP:Window操作系统+Apache软件+PHP解析器+MySQL软件 2.Apache执行流程 用户向服务器端发送请求àDNS解析àIP地址à端口àApache服务 Apa ...