Spark SQL自定义外部数据源

1 涉及到的API

  BaseRelation: In a simple way, we can say it represents the collection of tuples with known schema

  TableScan: provides a way to scan the data and generates the RDD[Row] from the data
  RelationProvider: takes a list of parameters and returns a BaseRelation.  
BaseRelation提供了定义数据结构Schema的方法，类似tuples的集合结构
TableScan,提供了扫描数据并生成RDD[Row]的方法
RelationProvider,拿到参数列表并返回一个BaseRelation

2 代码实现

　　定义ralation

package cn.zj.spark.sql.datasource

import org.apache.hadoop.fs.Path

import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode}

import org.apache.spark.sql.sources.{BaseRelation, CreatableRelationProvider, RelationProvider, SchemaRelationProvider}

import org.apache.spark.sql.types.StructType

/**

  * Created by rana on 29/9/16.

  */

class DefaultSource extends RelationProvider with SchemaRelationProvider with CreatableRelationProvider {

  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {

    createRelation(sqlContext, parameters, null)

  }

  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation = {

    val path = parameters.get("path")

    path match {

      case Some(p) => new CustomDatasourceRelation(sqlContext, p, schema)

      case _ => throw new IllegalArgumentException("Path is required for custom-datasource format!!")

    }

  }

  override def createRelation(sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String],

                              data: DataFrame): BaseRelation = {

    val path = parameters.getOrElse("path", "./output/") //can throw an exception/error, it's just for this tutorial

    val fsPath = new Path(path)

    val fs = fsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)

    mode match {

      case SaveMode.Append => sys.error("Append mode is not supported by " + this.getClass.getCanonicalName); sys.exit(1)

      case SaveMode.Overwrite => fs.delete(fsPath, true)

      case SaveMode.ErrorIfExists => sys.error("Given path: " + path + " already exists!!"); sys.exit(1)

      case SaveMode.Ignore => sys.exit()

    }

    val formatName = parameters.getOrElse("format", "customFormat")

    formatName match {

      case "customFormat" => saveAsCustomFormat(data, path, mode)

      case "json" => saveAsJson(data, path, mode)

      case _ => throw new IllegalArgumentException(formatName + " is not supported!!!")

    }

    createRelation(sqlContext, parameters, data.schema)

  }

  private def saveAsJson(data : DataFrame, path : String, mode: SaveMode): Unit = {

    /**

      * Here, I am using the dataframe's Api for storing it as json.

      * you can have your own apis and ways for saving!!

      */

    data.write.mode(mode).json(path)

  }

  private def saveAsCustomFormat(data : DataFrame, path : String, mode: SaveMode): Unit = {

    /**

      * Here, I am  going to save this as simple text file which has values separated by "|".

      * But you can have your own way to store without any restriction.

      */

    val customFormatRDD = data.rdd.map(row => {

      row.toSeq.map(value => value.toString).mkString("|")

    })

    customFormatRDD.saveAsTextFile(path)

  }

}

　　定义Schema以及读取数据代码

package cn.zj.spark.sql.datasource

import org.apache.spark.rdd.RDD

import org.apache.spark.sql.{Row, SQLContext}

import org.apache.spark.sql.sources._

import org.apache.spark.sql.types._

/**

  * Created by rana on 29/9/16.

  */

class CustomDatasourceRelation(override val sqlContext : SQLContext, path : String, userSchema : StructType)

  extends BaseRelation with TableScan with PrunedScan with PrunedFilteredScan with Serializable {

  override def schema: StructType = {

    if (userSchema != null) {

      userSchema

    } else {

      StructType(

        StructField("id", IntegerType, false) ::

        StructField("name", StringType, true) ::

        StructField("gender", StringType, true) ::

        StructField("salary", LongType, true) ::

        StructField("expenses", LongType, true) :: Nil

      )

    }

  }

  override def buildScan(): RDD[Row] = {

    println("TableScan: buildScan called...")

    val schemaFields = schema.fields

    // Reading the file's content

    val rdd = sqlContext.sparkContext.wholeTextFiles(path).map(f => f._2)

    val rows = rdd.map(fileContent => {

      val lines = fileContent.split("\n")

      val data = lines.map(line => line.split(",").map(word => word.trim).toSeq)

      val tmp = data.map(words => words.zipWithIndex.map{

        case (value, index) =>

          val colName = schemaFields(index).name

          Util.castTo(if (colName.equalsIgnoreCase("gender")) {if(value.toInt == 1) "Male" else "Female"} else value,

            schemaFields(index).dataType)

      })

      tmp.map(s => Row.fromSeq(s))

    })

    rows.flatMap(e => e)

  }

  override def buildScan(requiredColumns: Array[String]): RDD[Row] = {

    println("PrunedScan: buildScan called...")

    val schemaFields = schema.fields

    // Reading the file's content

    val rdd = sqlContext.sparkContext.wholeTextFiles(path).map(f => f._2)

    val rows = rdd.map(fileContent => {

      val lines = fileContent.split("\n")

      val data = lines.map(line => line.split(",").map(word => word.trim).toSeq)

      val tmp = data.map(words => words.zipWithIndex.map{

        case (value, index) =>

          val colName = schemaFields(index).name

          val castedValue = Util.castTo(if (colName.equalsIgnoreCase("gender")) {if(value.toInt == 1) "Male" else "Female"} else value,

                                        schemaFields(index).dataType)

          if (requiredColumns.contains(colName)) Some(castedValue) else None

      })

      tmp.map(s => Row.fromSeq(s.filter(_.isDefined).map(value => value.get)))

    })

    rows.flatMap(e => e)

  }

  override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {

    println("PrunedFilterScan: buildScan called...")

    println("Filters: ")

    filters.foreach(f => println(f.toString))

    var customFilters: Map[String, List[CustomFilter]] = Map[String, List[CustomFilter]]()

    filters.foreach( f => f match {

      case EqualTo(attr, value) =>

        println("EqualTo filter is used!!" + "Attribute: " + attr + " Value: " + value)

        /**

          * as we are implementing only one filter for now, you can think that this below line doesn't mak emuch sense

          * because any attribute can be equal to one value at a time. so what's the purpose of storing the same filter

          * again if there are.

          * but it will be useful when we have more than one filter on the same attribute. Take the below condition

          * for example:

          * attr > 5 && attr < 10

          * so for such cases, it's better to keep a list.

          * you can add some more filters in this code and try them. Here, we are implementing only equalTo filter

          * for understanding of this concept.

          */

        customFilters = customFilters ++ Map(attr -> {

          customFilters.getOrElse(attr, List[CustomFilter]()) :+ new CustomFilter(attr, value, "equalTo")

        })

      case _ => println("filter: " + f.toString + " is not implemented by us!!")

    })

    val schemaFields = schema.fields

    // Reading the file's content

    val rdd = sqlContext.sparkContext.wholeTextFiles(path).map(f => f._2)

    val rows = rdd.map(file => {

      val lines = file.split("\n")

      val data = lines.map(line => line.split(",").map(word => word.trim).toSeq)

      val filteredData = data.map(s => if (customFilters.nonEmpty) {

        var includeInResultSet = true

        s.zipWithIndex.foreach {

          case (value, index) =>

            val attr = schemaFields(index).name

            val filtersList = customFilters.getOrElse(attr, List())

            if (filtersList.nonEmpty) {

              if (CustomFilter.applyFilters(filtersList, value, schema)) {

              } else {

                includeInResultSet = false

              }

            }

        }

        if (includeInResultSet) s else Seq()

      } else s)

      val tmp = filteredData.filter(_.nonEmpty).map(s => s.zipWithIndex.map {

        case (value, index) =>

          val colName = schemaFields(index).name

          val castedValue = Util.castTo(if (colName.equalsIgnoreCase("gender")) {

            if (value.toInt == 1) "Male" else "Female"

          } else value,

            schemaFields(index).dataType)

          if (requiredColumns.contains(colName)) Some(castedValue) else None

      })

      tmp.map(s => Row.fromSeq(s.filter(_.isDefined).map(value => value.get)))

    })

    rows.flatMap(e => e)

  }

}

　　类型转换类

package cn.zj.spark.sql.datasource

import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StringType}

/**

  * Created by rana on 30/9/16.

  */

object Util {

  def castTo(value : String, dataType : DataType) = {

    dataType match {

      case _ : IntegerType => value.toInt

      case _ : LongType => value.toLong

      case _ : StringType => value

    }

  }

}

　3 依赖的pom文件配置

 <properties>

        <maven.compiler.source>1.8</maven.compiler.source>

        <maven.compiler.target>1.8</maven.compiler.target>

        <scala.version>2.11.8</scala.version>

        <spark.version>2.2.0</spark.version>

        <!--<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>-->

        <!--<hbase.version>1.2.0-cdh5.7.0</hbase.version>-->

        <encoding>UTF-8</encoding>

    </properties>

    <dependencies>

        <!-- 导入spark的依赖 -->

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-core_2.11</artifactId>

            <version>${spark.version}</version>

        </dependency>

        <!-- 导入spark的依赖 -->

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-sql_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

    </dependencies>

4测试代码以及测试文件数据

package cn.zj.spark.sql.datasource

import org.apache.spark.SparkConf

import org.apache.spark.sql.SparkSession

/**

  * Created by rana on 29/9/16.

  */

object app extends App {

  println("Application started...")

  val conf = new SparkConf().setAppName("spark-custom-datasource")

  val spark = SparkSession.builder().config(conf).master("local").getOrCreate()

  val df = spark.sqlContext.read.format("cn.zj.spark.sql.datasource").load("1229practice/data/")

  df.createOrReplaceTempView("test")

  spark.sql("select * from test where salary = 50000").show()

  println("Application Ended...")

}

数据

10002, Alice Heady, 0, 20000, 8000

10003, Jenny Brown, 0, 30000, 120000

10004, Bob Hayden, 1, 40000, 16000

10005, Cindy Heady, 0, 50000, 20000

10006, Doug Brown, 1, 60000, 24000

10007, Carolina Hayden, 0, 70000, 280000

参考文献:http://sparkdatasourceapi.blogspot.com/2016/10/spark-data-source-api-write-custom.html

完整代码详见 git@github.com:ZhangJin1988/spark-extend-dataSource.git

Spark SQL自定义外部数据源的更多相关文章

Spark SQL 自定义函数类型
Spark SQL 自定义函数类型一.spark读取数据二.自定义函数结构三.附上长长的各种pom 一.spark读取数据前段时间一直在研究GeoMesa下的Spark JTS,Spark J ...
spark SQL学习（数据源之json）
准备工作数据文件students.json {"id":1, "name":"leo", "age":18} {&qu ...
spark SQL学习（数据源之parquet）
Parquet是面向分析型业务得列式存储格式编程方式加载数据代码示例 package wujiadong_sparkSQL import org.apache.spark.sql.SQLConte ...
大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器
第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...
4. Spark SQL数据源
4.1 通用加载/保存方法 4.1.1手动指定选项 Spark SQL的DataFrame接口支持多种数据源的操作.一个DataFrame可以进行RDDs方式的操作,也可以被注册为临时表.把DataF ...
Spark SQL | 目前Spark社区最活跃的组件之一
Spark SQL是一个用来处理结构化数据的Spark组件,前身是shark,但是shark过多的依赖于hive如采用hive的语法解析器.查询优化器等,制约了Spark各个组件之间的相互集成,因此S ...
Spark SQL知识点大全与实战
Spark SQL概述 1.什么是Spark SQL Spark SQL是Spark用于结构化数据(structured data)处理的Spark模块. 与基本的Spark RDD API不同,Sp ...
Spark SQL知识点与实战
Spark SQL概述 1.什么是Spark SQL Spark SQL是Spark用于结构化数据(structured data)处理的Spark模块. 与基本的Spark RDD API不同,Sp ...
Spark SQL 官方文档-中文翻译
Spark SQL 官方文档-中文翻译 Spark版本:Spark 1.5.2 转载请注明出处:http://www.cnblogs.com/BYRans/ 1 概述(Overview) 2 Data ...

随机推荐

洛谷P1477 假面舞会
坑死了...... 题意:给你个有向图,你需要把点分成k种,满足每条边都是分层的(从i种点连向i + 1种点,从k连向1). 要确保每种点至少有一个. 求k的最大值,最小值. n <= 1e5, ...
【CF1141G】Privatization of Roads in Treeland
题目大意:给定一个 N 个点的无根树,现给这个树进行染色.定义一个节点是坏点,若满足与该节点相连的至少两条边是相同的颜色,求至多有 k 个坏点的情况下最少需要几种颜色才能进行合法染色. 题解:考虑一个 ...
中性SNP的突变年龄评估（estimate the average age of a neutral two-allele polymorphism）
假设中性突变的频率分别为P和1-P,则其突变年龄为:-4Ne[p*( logep)+(1-p)* loge (1-p)] The average age of a neutral two-allele ...
JS学习笔记Day3
一.什么是循环结构满足一定条件,(((重复)))执行一段相同的代码二.循环思想是什么(循环三要素) 开始结束步长(步进) 三.可以实现循环语句的有哪些 while do while for 四 ...
javaMail简介(一)
一:开发javaMail用到的协议 SMTP(simple Message Transfer Protocal):简单消息传输协议.发送邮件时使用的协议,描述了数据该如何表示,默认端口为:25 POP ...
bzoj1875 边点互换+矩乘
https://www.lydsy.com/JudgeOnline/problem.php?id=1875 题意 HH有个一成不变的习惯,喜欢饭后百步走.所谓百步走,就是散步,就是在一定的时间内,走 ...
influxDB和grafana
influxdb启动服务 sudo service influxdb start 登录数据库 influx 在influxDB中,measurement相当于sql中的table, 插入measure ...
Java项目中，如何限制每个用户访问接口的次数
转自:https://blog.csdn.net/qq_30947533/article/details/78844709 方法1:数据访问量大的话用redis来做,用户在调用短信接口时,先根据用户 ...
java io系列08之 File总结
本文对File的API和常用方法进行介绍. 转载请注明出处:http://www.cnblogs.com/skywang12345/p/io_08.html File 介绍 File 是“文件”和“目 ...
PHP7 学习笔记（十二）Stream 函数详解
官方:http://php.net/manual/zh/ref.stream.php Stream_*系列函数 PHP中对流的描述如下:每一种流都实现了一个包装器(wrapper),包装器包含一些额外 ...

Spark SQL自定义外部数据源

Spark SQL自定义外部数据源的更多相关文章

随机推荐

热门专题