1、创建数据框架 Creating DataFrames

val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json");
df.show();

写到hdfs路径:
df.select("age", "name").write.save("examples/src/main/resources/peopleOUT.json")

再读出来:
val peopleDF = spark.read.format("parquet").load("/user/root/examples/src/main/resources/peopleOUT.json")
格式是parquet,是hdfs保存后的列格式存储格式,是一个通用存储格式。
注意:如果用json来读肯定就会乱码了
可以直接show,也可以通过路径来写SQL
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/peopleOUT.json`")

2、如果没有向HDFS文件系统里写数据,试试这条语句

val df = spark.read.json("examples/src/main/resources/people.json")

会提示错误:
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://localhost:9000/user/root/examples/src/main/resources/people.json;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)

3、弱类型数据集查询

import spark.implicits._

// 数据集结构
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// 只显示 "name" 列
df.select("name").show()
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+

// 年龄加1
df.select($"name", $"age" + 1).show()
// +-------+---------+
// | name|(age + 1)|
// +-------+---------+
// |Michael| null|
// | Andy| 31|
// | Justin| 20|
// +-------+---------+

// 年龄大于21
df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// 按年龄分组计数
df.groupBy("age").count().show()

4、使用sql查询的方式

Running SQL Queries

df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

全局临时视图
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()

5、定义一个自定义类型

case class Person(name: String, age: Long)

6、schema反射型推理模式 Inferring the Schema

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))

7、schema指定式查询模式

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+

8、自定义聚合函数

http://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-user-defined-aggregate-functions

9、parquet格式的读写

import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")

// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()

10、手写json的数据源

val otherPeopleRDD = spark.sparkContext.makeRDD(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
// +---------------+----+
// | address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

11、json数据源

val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()

12、hive数据源

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = "spark-warehouse"

val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DaraFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// | 5| val_5| 5| val_5|
// ...

13、jdbc数据源,表的读写(整表)

比如用mysql,首先要把jar文件(如mysql-connector-java-5.1.26.jar)放在jars目录,然后用jar包参数启动

spark-shell --driver-class-path /usr/local/spark/jars/mysql-connector-java-5.1.21.jar --jars /usr/local/spark/jars/mysql-connector-java-5.1.21.jar

val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/openfire").option("dbtable", "ofUser").option("user", "root").option("password", "mysql").load()

// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()

jdbcDF2.write
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

Spark基础脚本入门实践1的更多相关文章

  1. Spark基础脚本入门实践3:Pair RDD开发

    Pair RDD转化操作 val rdd = sc.parallelize(List((1,2),(3,4),(3,6))) //reduceByKey,通过key来做合并val r1 = rdd.r ...

  2. Spark基础脚本入门实践2:基础开发

    1.最基本的Map用法 val data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)val result = distData. ...

  3. 【原创 Hadoop&Spark 动手实践 5】Spark 基础入门,集群搭建以及Spark Shell

    Spark 基础入门,集群搭建以及Spark Shell 主要借助Spark基础的PPT,再加上实际的动手操作来加强概念的理解和实践. Spark 安装部署 理论已经了解的差不多了,接下来是实际动手实 ...

  4. 0.Python 爬虫之Scrapy入门实践指南(Scrapy基础知识)

    目录 0.0.Scrapy基础 0.1.Scrapy 框架图 0.2.Scrapy主要包括了以下组件: 0.3.Scrapy简单示例如下: 0.4.Scrapy运行流程如下: 0.5.还有什么? 0. ...

  5. sass、less和stylus的安装使用和入门实践

    刚 开始的时候,说实话,我很反感使用css预处理器这种新玩意的,因为其中涉及到了编程的东西,私以为很复杂,而且考虑到项目不是一天能够完成的,也很少是 一个人完成的,对于这种团队的项目开发,前端实践用c ...

  6. Spark在美团的实践

    https://tech.meituan.com/2016/03/31/spark-in-meituan.html 本文已发表在<程序员>杂志2016年4月期. 前言 美团是数据驱动的互联 ...

  7. cmd 与 bash 基础命令入门

    身为一个程序员会用命令行来进行一些简单的操作,不是显得很装逼嘛!?嘿嘿~ ヾ(>∀<) cmd 与 bash 基础命令入门       简介       CMD 基础命令          ...

  8. IM开发者的零基础通信技术入门(二):通信交换技术的百年发展史(下)

    1.系列文章引言 1.1 适合谁来阅读? 本系列文章尽量使用最浅显易懂的文字.图片来组织内容,力求通信技术零基础的人群也能看懂.但个人建议,至少稍微了解过网络通信方面的知识后再看,会更有收获.如果您大 ...

  9. IM开发者的零基础通信技术入门(一):通信交换技术的百年发展史(上)

    [来源申明]本文原文来自:微信公众号“鲜枣课堂”,官方网站:xzclass.com,原题为:<通信交换的百年沧桑(上)>,本文引用时已征得原作者同意.为了更好的内容呈现,即时通讯网在收录时 ...

随机推荐

  1. Java学习笔记(十八):static关键字

  2. flash/flex builder在IE中stage.stageWidth始终为0的解决办法

    这应该是IE的bug,解决办法: 原作者:菩提树下的杨过出处:http://yjmyzz.cnblogs.com stage.align=StageAlign.TOP_LEFT; stage.scal ...

  3. Python设计模式 - UML - 部署图(Deployment Diagram)

    简介 部署图也称配置图,用来显示系统中硬件和软件的物理架构.从中可以了解到软件和硬件组件之间的物理拓扑.连接关系以及处理节点的分布情况. 部署图建模步骤 - 找出需要进行部署的各类节点,如网络硬件设备 ...

  4. Apache无法正常启动(配置多个监听端口)

    Apache监测多个端口配置: 1.conf->extra->httpd-vhosts.conf  检查配置项是否写错 2.http.conf listen端口是否监听正确 3.环境变量中 ...

  5. Python开发【第六篇】:面向对象

    configparser模块 configparser用于处理特定格式的文件,其本质是利用open来操作文件. 文件a.txt [section1] k1 = 123 k2:v2   [section ...

  6. django mysql 数据库配置

    在settings.py中保存了数据库的连接配置信息,Django默认初始配置使用sqlite数据库. DATABASES = { 'default': { 'ENGINE': 'django.db. ...

  7. [leetcode]72. Edit Distance 最少编辑步数

    Given two words word1 and word2, find the minimum number of operations required to convert word1 to ...

  8. [leetcode]34.Find First and Last Position of Element in Sorted Array找区间

    Given an array of integers nums sorted in ascending order, find the starting and ending position of ...

  9. mysql启动错误,提示crash 错误

    :: mysqld_safe Starting mysqld daemon with databases from /data/mysql_data -- :: [Note] Plugin 'FEDE ...

  10. ahk保存

    python ;把大写禁用了,因为确实基本不用.`表示删除,caplock+ijkl可以控制光标 SetCapsLockState , AlwaysOff ;用;p来替换书写经常不好使,因为输入多个字 ...