1 涉及到的API

  BaseRelation: In a simple way, we can say it represents the collection of tuples with known schema
TableScan: provides a way to scan the data and generates the RDD[Row] from the data
RelationProvider: takes a list of parameters and returns a BaseRelation.
BaseRelation提供了定义数据结构Schema的方法,类似tuples的集合结构
TableScan,提供了扫描数据并生成RDD[Row]的方法
RelationProvider,拿到参数列表并返回一个BaseRelation

  

2 代码实现

  定义ralation

package cn.zj.spark.sql.datasource

import org.apache.hadoop.fs.Path
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode}
import org.apache.spark.sql.sources.{BaseRelation, CreatableRelationProvider, RelationProvider, SchemaRelationProvider}
import org.apache.spark.sql.types.StructType /**
* Created by rana on 29/9/16.
*/
class DefaultSource extends RelationProvider with SchemaRelationProvider with CreatableRelationProvider {
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
createRelation(sqlContext, parameters, null)
} override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation = {
val path = parameters.get("path")
path match {
case Some(p) => new CustomDatasourceRelation(sqlContext, p, schema)
case _ => throw new IllegalArgumentException("Path is required for custom-datasource format!!")
}
} override def createRelation(sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String],
data: DataFrame): BaseRelation = {
val path = parameters.getOrElse("path", "./output/") //can throw an exception/error, it's just for this tutorial
val fsPath = new Path(path)
val fs = fsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration) mode match {
case SaveMode.Append => sys.error("Append mode is not supported by " + this.getClass.getCanonicalName); sys.exit(1)
case SaveMode.Overwrite => fs.delete(fsPath, true)
case SaveMode.ErrorIfExists => sys.error("Given path: " + path + " already exists!!"); sys.exit(1)
case SaveMode.Ignore => sys.exit()
} val formatName = parameters.getOrElse("format", "customFormat")
formatName match {
case "customFormat" => saveAsCustomFormat(data, path, mode)
case "json" => saveAsJson(data, path, mode)
case _ => throw new IllegalArgumentException(formatName + " is not supported!!!")
}
createRelation(sqlContext, parameters, data.schema)
} private def saveAsJson(data : DataFrame, path : String, mode: SaveMode): Unit = {
/**
* Here, I am using the dataframe's Api for storing it as json.
* you can have your own apis and ways for saving!!
*/
data.write.mode(mode).json(path)
} private def saveAsCustomFormat(data : DataFrame, path : String, mode: SaveMode): Unit = {
/**
* Here, I am going to save this as simple text file which has values separated by "|".
* But you can have your own way to store without any restriction.
*/
val customFormatRDD = data.rdd.map(row => {
row.toSeq.map(value => value.toString).mkString("|")
})
customFormatRDD.saveAsTextFile(path)
}
}

  定义Schema以及读取数据代码

package cn.zj.spark.sql.datasource

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._ /**
* Created by rana on 29/9/16.
*/
class CustomDatasourceRelation(override val sqlContext : SQLContext, path : String, userSchema : StructType)
extends BaseRelation with TableScan with PrunedScan with PrunedFilteredScan with Serializable { override def schema: StructType = {
if (userSchema != null) {
userSchema
} else {
StructType(
StructField("id", IntegerType, false) ::
StructField("name", StringType, true) ::
StructField("gender", StringType, true) ::
StructField("salary", LongType, true) ::
StructField("expenses", LongType, true) :: Nil
)
}
} override def buildScan(): RDD[Row] = {
println("TableScan: buildScan called...") val schemaFields = schema.fields
// Reading the file's content
val rdd = sqlContext.sparkContext.wholeTextFiles(path).map(f => f._2) val rows = rdd.map(fileContent => {
val lines = fileContent.split("\n")
val data = lines.map(line => line.split(",").map(word => word.trim).toSeq)
val tmp = data.map(words => words.zipWithIndex.map{
case (value, index) =>
val colName = schemaFields(index).name
Util.castTo(if (colName.equalsIgnoreCase("gender")) {if(value.toInt == 1) "Male" else "Female"} else value,
schemaFields(index).dataType)
}) tmp.map(s => Row.fromSeq(s))
}) rows.flatMap(e => e)
} override def buildScan(requiredColumns: Array[String]): RDD[Row] = {
println("PrunedScan: buildScan called...") val schemaFields = schema.fields
// Reading the file's content
val rdd = sqlContext.sparkContext.wholeTextFiles(path).map(f => f._2) val rows = rdd.map(fileContent => {
val lines = fileContent.split("\n")
val data = lines.map(line => line.split(",").map(word => word.trim).toSeq)
val tmp = data.map(words => words.zipWithIndex.map{
case (value, index) =>
val colName = schemaFields(index).name
val castedValue = Util.castTo(if (colName.equalsIgnoreCase("gender")) {if(value.toInt == 1) "Male" else "Female"} else value,
schemaFields(index).dataType)
if (requiredColumns.contains(colName)) Some(castedValue) else None
}) tmp.map(s => Row.fromSeq(s.filter(_.isDefined).map(value => value.get)))
}) rows.flatMap(e => e)
} override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
println("PrunedFilterScan: buildScan called...") println("Filters: ")
filters.foreach(f => println(f.toString)) var customFilters: Map[String, List[CustomFilter]] = Map[String, List[CustomFilter]]()
filters.foreach( f => f match {
case EqualTo(attr, value) =>
println("EqualTo filter is used!!" + "Attribute: " + attr + " Value: " + value) /**
* as we are implementing only one filter for now, you can think that this below line doesn't mak emuch sense
* because any attribute can be equal to one value at a time. so what's the purpose of storing the same filter
* again if there are.
* but it will be useful when we have more than one filter on the same attribute. Take the below condition
* for example:
* attr > 5 && attr < 10
* so for such cases, it's better to keep a list.
* you can add some more filters in this code and try them. Here, we are implementing only equalTo filter
* for understanding of this concept.
*/
customFilters = customFilters ++ Map(attr -> {
customFilters.getOrElse(attr, List[CustomFilter]()) :+ new CustomFilter(attr, value, "equalTo")
})
case _ => println("filter: " + f.toString + " is not implemented by us!!")
}) val schemaFields = schema.fields
// Reading the file's content
val rdd = sqlContext.sparkContext.wholeTextFiles(path).map(f => f._2) val rows = rdd.map(file => {
val lines = file.split("\n")
val data = lines.map(line => line.split(",").map(word => word.trim).toSeq) val filteredData = data.map(s => if (customFilters.nonEmpty) {
var includeInResultSet = true
s.zipWithIndex.foreach {
case (value, index) =>
val attr = schemaFields(index).name
val filtersList = customFilters.getOrElse(attr, List())
if (filtersList.nonEmpty) {
if (CustomFilter.applyFilters(filtersList, value, schema)) {
} else {
includeInResultSet = false
}
}
}
if (includeInResultSet) s else Seq()
} else s) val tmp = filteredData.filter(_.nonEmpty).map(s => s.zipWithIndex.map {
case (value, index) =>
val colName = schemaFields(index).name
val castedValue = Util.castTo(if (colName.equalsIgnoreCase("gender")) {
if (value.toInt == 1) "Male" else "Female"
} else value,
schemaFields(index).dataType)
if (requiredColumns.contains(colName)) Some(castedValue) else None
}) tmp.map(s => Row.fromSeq(s.filter(_.isDefined).map(value => value.get)))
}) rows.flatMap(e => e)
}
}

  类型转换类

package cn.zj.spark.sql.datasource

import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StringType}

/**
* Created by rana on 30/9/16.
*/
object Util {
def castTo(value : String, dataType : DataType) = {
dataType match {
case _ : IntegerType => value.toInt
case _ : LongType => value.toLong
case _ : StringType => value
}
}
}

 3 依赖的pom文件配置

  

 <properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
<!--<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>-->
<!--<hbase.version>1.2.0-cdh5.7.0</hbase.version>-->
<encoding>UTF-8</encoding>
</properties> <dependencies>
<!-- 导入spark的依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- 导入spark的依赖 -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency> </dependencies>

4测试代码以及测试文件数据

package cn.zj.spark.sql.datasource

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession /**
* Created by rana on 29/9/16.
*/
object app extends App {
println("Application started...") val conf = new SparkConf().setAppName("spark-custom-datasource")
val spark = SparkSession.builder().config(conf).master("local").getOrCreate() val df = spark.sqlContext.read.format("cn.zj.spark.sql.datasource").load("1229practice/data/") df.createOrReplaceTempView("test")
spark.sql("select * from test where salary = 50000").show() println("Application Ended...")
}

  

数据

  

10002, Alice Heady, 0, 20000, 8000
10003, Jenny Brown, 0, 30000, 120000
10004, Bob Hayden, 1, 40000, 16000
10005, Cindy Heady, 0, 50000, 20000
10006, Doug Brown, 1, 60000, 24000
10007, Carolina Hayden, 0, 70000, 280000

  

参考文献:http://sparkdatasourceapi.blogspot.com/2016/10/spark-data-source-api-write-custom.html

完整代码详见 git@github.com:ZhangJin1988/spark-extend-dataSource.git

Spark SQL自定义外部数据源的更多相关文章

  1. Spark SQL 自定义函数类型

    Spark SQL 自定义函数类型 一.spark读取数据 二.自定义函数结构 三.附上长长的各种pom 一.spark读取数据 前段时间一直在研究GeoMesa下的Spark JTS,Spark J ...

  2. spark SQL学习(数据源之json)

    准备工作 数据文件students.json {"id":1, "name":"leo", "age":18} {&qu ...

  3. spark SQL学习(数据源之parquet)

    Parquet是面向分析型业务得列式存储格式 编程方式加载数据 代码示例 package wujiadong_sparkSQL import org.apache.spark.sql.SQLConte ...

  4. 大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析 、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器

    第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...

  5. 4. Spark SQL数据源

    4.1 通用加载/保存方法 4.1.1手动指定选项 Spark SQL的DataFrame接口支持多种数据源的操作.一个DataFrame可以进行RDDs方式的操作,也可以被注册为临时表.把DataF ...

  6. Spark SQL | 目前Spark社区最活跃的组件之一

    Spark SQL是一个用来处理结构化数据的Spark组件,前身是shark,但是shark过多的依赖于hive如采用hive的语法解析器.查询优化器等,制约了Spark各个组件之间的相互集成,因此S ...

  7. Spark SQL知识点大全与实战

    Spark SQL概述 1.什么是Spark SQL Spark SQL是Spark用于结构化数据(structured data)处理的Spark模块. 与基本的Spark RDD API不同,Sp ...

  8. Spark SQL知识点与实战

    Spark SQL概述 1.什么是Spark SQL Spark SQL是Spark用于结构化数据(structured data)处理的Spark模块. 与基本的Spark RDD API不同,Sp ...

  9. Spark SQL 官方文档-中文翻译

    Spark SQL 官方文档-中文翻译 Spark版本:Spark 1.5.2 转载请注明出处:http://www.cnblogs.com/BYRans/ 1 概述(Overview) 2 Data ...

随机推荐

  1. 百度地图API示例:使用vue添加删除覆盖物

    1.index.html <script type="text/javascript" src="http://api.map.baidu.com/api?v=2. ...

  2. P3486 [POI2009]KON-Ticket Inspector

    啊!这题做的真是爽!除了DP这个方法是有提示的之外,这题居然没有题解,哈哈哈嘿嘿嘿.很自豪的说:全是我自己独立解出来的一道题,包括设计状态,推倒(☺)转移方程,最后记录路径. 好了,首先,我们发现这题 ...

  3. (转)深入理解Java注解类型(@Annotation)

    背景:在面试时候问过关于注解的问题,工作中也用到过该java的特性,但是也没有深入的了解. 秒懂,Java 注解 (Annotation)你可以这样学 ps:注解最通俗易懂的解释 注解是一系列元数据, ...

  4. java 中 byte[]、File、InputStream 互相转换

    1.将File.FileInputStream 转换为byte数组: File file = new File("test.txt"); InputStream input = n ...

  5. Luogu P4011 孤岛营救问题

    题目链接 \(Click\) \(Here\) 注意坑点:一个地方可以有多把钥匙. 被卡了一会,调出来发现忘了取出来实际的数字,直接把二进制位或上去了\(TwT\),其他的就是套路的分层图最短路.不算 ...

  6. node(基础)_node.js中的http服务以及模板引擎的渲染

    一.前言 本节的内容主要涉及: 1.node.js中http服务 2.node.js中fs服务 3.node.js中模板引擎的渲染 4.利用上面几点模拟apache服务器 二.知识 1.node.js ...

  7. php小项目踩坑以及其中的注意点(第二篇)

    用户登录页面 1.通过数据库验证用户名和密码(可以将里面要用到的数据库信息,放入到一个config文件中) <?php define('DB_HOST','localhost'); define ...

  8. M1-Flask-Day4

    今日内容概要: 1.git使用 2.redis基本操作 3.celery应用 4.在flask中使用celery 5.saltstack的基本使用 基础回顾: 1.关于FLASK -基本使用 路由 视 ...

  9. request.getSession()几种获取情况之间的差异

    一.三种情况如下 HttpSession session = request.getSession(); HttpSession session = request.getSession(true); ...

  10. Spring_AOP 实现原理与 CGLIB 应用

    转自:https://www.ibm.com/developerworks/cn/java/j-lo-springaopcglib/index.html AOP(Aspect Orient Progr ...