自定义sparkSQL数据源的过程中,需要对sparkSQL表的schema和Hbase表的schema进行整合;

对于spark来说,要想自定义数据源,你可以实现这3个接口:

BaseRelation 代表了一个抽象的数据源。该数据源由一行行有着已知schema的数据组成(关系表)。
TableScan 用于扫描整张表,将数据返回成RDD[Row]。
RelationProvider 顾名思义,根据用户提供的参数返回一个数据源(BaseRelation)。

当然,TableScan其实是最粗粒度的查询,代表一次性扫描整张表,如果有需求,更细粒度在数据源处过滤掉数据,可以实现:

PrunedScan:可以列剪枝

PrunedFilteredScan:列剪枝 + 过滤

所以,如果对接Hbase的话,就定义一个Hbase的relation

class DefaultSource extends RelationProvider {
def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) = {
HBaseRelation(parameters)(sqlContext)
}
}
case class HBaseRelation(@transient val hbaseProps: Map[String,String])(@transient val sqlContext: SQLContext) extends BaseRelation with Serializable with TableScan{

  val hbaseTableName =  hbaseProps.getOrElse("hbase_table_name", sys.error("not valid schema"))
val hbaseTableSchema = hbaseProps.getOrElse("hbase_table_schema", sys.error("not valid schema"))
val registerTableSchema = hbaseProps.getOrElse("sparksql_table_schema", sys.error("not valid schema"))
val rowRange = hbaseProps.getOrElse("row_range", "->")
//get star row and end row
val range = rowRange.split("->",-1)
val startRowKey = range(0).trim
val endRowKey = range(1).trim
val tempHBaseFields = extractHBaseSchema(hbaseTableSchema) //do not use this, a temp field
val registerTableFields = extractRegisterSchema(registerTableSchema)
val tempFieldRelation = tableSchemaFieldMapping(tempHBaseFields,registerTableFields)
val hbaseTableFields = feedTypes(tempFieldRelation)
val fieldsRelations = tableSchemaFieldMapping(hbaseTableFields,registerTableFields)
val queryColumns = getQueryTargetCloumns(hbaseTableFields)
def feedTypes( mapping: Map[HBaseSchemaField, RegisteredSchemaField]) : Array[HBaseSchemaField] = {
val hbaseFields = mapping.map{
case (k,v) =>
val field = k.copy(fieldType=v.fieldType)
field
}
hbaseFields.toArray
} def isRowKey(field: HBaseSchemaField) : Boolean = {
val cfColArray = field.fieldName.split(":",-1)
val cfName = cfColArray(0)
val colName = cfColArray(1)
if(cfName=="" && colName=="key") true else false
} def getQueryTargetCloumns(hbaseTableFields: Array[HBaseSchemaField]): String = {
var str = ArrayBuffer[String]()
hbaseTableFields.foreach{ field=>
if(!isRowKey(field)) {
str.append(field.fieldName)
}
}
println(str.mkString(" "))
str.mkString(" ")
}
lazy val schema = {
val fields = hbaseTableFields.map{ field=>
val name = fieldsRelations.getOrElse(field, sys.error("table schema is not match the definition.")).fieldName
val relatedType = field.fieldType match {
case "String" =>
SchemaType(StringType,nullable = false)
case "Int" =>
SchemaType(IntegerType,nullable = false)
case "Long" =>
SchemaType(LongType,nullable = false)
case "Double" =>
SchemaType(DoubleType,nullable = false) }
StructField(name,relatedType.dataType,relatedType.nullable)
}
StructType(fields)
} def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField], registerTable : Array[RegisteredSchemaField]): Map[HBaseSchemaField, RegisteredSchemaField] = {
if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable)
rs.toMap
} /**
* spark sql schema will be register
* registerTableSchema '(rowkey string, value string, column_a string)'
*/
def extractRegisterSchema(registerTableSchema: String) : Array[RegisteredSchemaField] = {
val fieldsStr = registerTableSchema.trim.drop(1).dropRight(1)
val fieldsArray = fieldsStr.split(",").map(_.trim)//sorted
fieldsArray.map{ fildString =>
val splitedField = fildString.split("\\s+", -1)//sorted
RegisteredSchemaField(splitedField(0), splitedField(1))
}
} def extractHBaseSchema(externalTableSchema: String) : Array[HBaseSchemaField] = {
val fieldsStr = externalTableSchema.trim.drop(1).dropRight(1)
val fieldsArray = fieldsStr.split(",").map(_.trim)
fieldsArray.map(fildString => HBaseSchemaField(fildString,""))
} // By making this a lazy val we keep the RDD around, amortizing the cost of locating splits.
lazy val buildScan = { val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", GlobalConfigUtils.hbaseQuorem)
hbaseConf.set(TableInputFormat.INPUT_TABLE, hbaseTableName)
hbaseConf.set(TableInputFormat.SCAN_COLUMNS, queryColumns)
hbaseConf.set(TableInputFormat.SCAN_ROW_START, startRowKey)
hbaseConf.set(TableInputFormat.SCAN_ROW_STOP, endRowKey) val hbaseRdd = sqlContext.sparkContext.newAPIHadoopRDD(
hbaseConf,
classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result]
) val rs = hbaseRdd.map(tuple => tuple._2).map(result => {
var values = new ArrayBuffer[Any]()
hbaseTableFields.foreach{field=>
values += Resolver.resolve(field,result)
}
Row.fromSeq(values.toSeq)
})
rs
} private case class SchemaType(dataType: DataType, nullable: Boolean)
}

HBaseRelation

Hbase的schema:

package object hbase {

  abstract class SchemaField extends Serializable

  case class RegisteredSchemaField(fieldName: String, fieldType: String)  extends  SchemaField  with Serializable

  case class HBaseSchemaField(fieldName: String, fieldType: String)  extends  SchemaField  with Serializable

  case class Parameter(name: String)
//sparksql_table_schema
protected val SPARK_SQL_TABLE_SCHEMA = Parameter("sparksql_table_schema")
protected val HBASE_TABLE_NAME = Parameter("hbase_table_name")
protected val HBASE_TABLE_SCHEMA = Parameter("hbase_table_schema")
protected val ROW_RANGE = Parameter("row_range") /**
* Adds a method, `hbaseTable`, to SQLContext that allows reading data stored in hbase table.
*/
implicit class HBaseContext(sqlContext: SQLContext) {
def hbaseTable(sparksqlTableSchema: String, hbaseTableName: String, hbaseTableSchema: String, rowRange: String = "->") = {
var params = new HashMap[String, String]
params += ( SPARK_SQL_TABLE_SCHEMA.name -> sparksqlTableSchema)
params += ( HBASE_TABLE_NAME.name -> hbaseTableName)
params += ( HBASE_TABLE_SCHEMA.name -> hbaseTableSchema)
//get star row and end row
params += ( ROW_RANGE.name -> rowRange)
sqlContext.baseRelationToDataFrame(HBaseRelation(params)(sqlContext))
}
}
}

当然了,其中schema的数据类型也得处理下:

object Resolver extends  Serializable {
def resolve (hbaseField: HBaseSchemaField, result: Result ): Any = {
val cfColArray = hbaseField.fieldName.split(":",-1)
val cfName = cfColArray(0)
val colName = cfColArray(1)
var fieldRs: Any = null
//resolve row key otherwise resolve column
if(cfName=="" && colName=="key") {
fieldRs = resolveRowKey(result, hbaseField.fieldType)
} else {
fieldRs = resolveColumn(result, cfName, colName,hbaseField.fieldType)
}
fieldRs
} def resolveRowKey (result: Result, resultType: String): Any = {
val rowkey = resultType match {
case "String" =>
result.getRow.map(_.toChar).mkString
case "Int" =>
result .getRow.map(_.toChar).mkString.toInt
case "Long" =>
result.getRow.map(_.toChar).mkString.toLong
case "Float" =>
result.getRow.map(_.toChar).mkString.toLong
case "Double" =>
result.getRow.map(_.toChar).mkString.toDouble
}
rowkey
} def resolveColumn (result: Result, columnFamily: String, columnName: String, resultType: String): Any = { val column = result.containsColumn(columnFamily.getBytes, columnName.getBytes) match{
case true =>{
resultType match {
case "String" =>
Bytes.toString(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Int" =>
Bytes.toInt(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Long" =>
Bytes.toLong(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Float" =>
Bytes.toFloat(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Double" =>
Bytes.toDouble(result.getValue(columnFamily.getBytes,columnName.getBytes)) }
}
case _ => {
resultType match {
case "String" =>
""
case "Int" =>
0
case "Long" =>
0
case "Double" =>
0.0
}
}
}
column
}
}

Resolver

做个测试:

object CustomHbaseTest {
def main(args: Array[String]): Unit = {
val startTime = System.currentTimeMillis()
val sparkConf: SparkConf = new SparkConf()
.setMaster("local[6]")
.setAppName("query")
.set("spark.worker.timeout" , GlobalConfigUtils.sparkWorkTimeout)
.set("spark.cores.max" , GlobalConfigUtils.sparkMaxCores)
.set("spark.rpc.askTimeout" , GlobalConfigUtils.sparkRpcTimeout)
.set("spark.task.macFailures" , GlobalConfigUtils.sparkTaskMaxFailures)
.set("spark.speculation" , GlobalConfigUtils.sparkSpeculation)
.set("spark.driver.allowMutilpleContext" , GlobalConfigUtils.sparkAllowMutilpleContext)
.set("spark.serializer" , GlobalConfigUtils.sparkSerializer)
.set("spark.buffer.pageSize" , GlobalConfigUtils.sparkBuferSize)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.driver.host", "localhost")
val sparkSession: SparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport() //开启支持hive
.getOrCreate()
var hbasetable = sparkSession
.read
.format("com.df.test_custom.customSource")
.options(
Map(
"sparksql_table_schema" -> "(id String, create_time String , open_lng String , open_lat String , begin_address_code String , charge_mileage String , city_name String , vehicle_license String)",
"hbase_table_name" -> "order_info",
"hbase_table_schema" -> "(MM:id , MM:create_time , MM:open_lng , MM:open_lat , MM:begin_address_code , MM:charge_mileage , MM:city_name , MM:vehicle_license)"
)).load() hbasetable.createOrReplaceTempView("orderData") sparkSession.sql(
"""
|select * from orderData
""".stripMargin).show()
val endTime = System.currentTimeMillis()
println(s"花费时间:${endTime - startTime}")
}
}

test

所有代码整合完毕之后,跑通了,但是确发现查询出来的数据和具体的列值对不上

比如:

var hbasetable = sparkSession
.read
.format("com.df.test_custom.customSource")
.options(
Map(
"sparksql_table_schema" -> "(id String, create_time String , open_lng String , open_lat String , begin_address_code String , charge_mileage String , city_name String , vehicle_license String)",
"hbase_table_name" -> "order_info",
"hbase_table_schema" -> "(MM:id , MM:create_time , MM:open_lng , MM:open_lat , MM:begin_address_code , MM:charge_mileage , MM:city_name , MM:vehicle_license)"
)).load()

我指定的sparkSQL表的schema和Hbase的schema如上面的代码;

但是我查询出来的数据是这样的:

hbasetable.createOrReplaceTempView("orderData")

    sparkSession.sql(
"""
|select * from orderData
""".stripMargin).show()

从上面的图可以看到,其实好多列的顺序对不上了!

问题所在的原因:

def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField],  registerTable : Array[RegisteredSchemaField]): Map[HBaseSchemaField, RegisteredSchemaField] = {
if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable) rs.toMap
}

可以看到,最后是---------->  rs.toMap

您注意了,scala中的这个map是不能保证顺序的,举个栗子:

object TestMap {
def main(args: Array[String]): Unit = {
val arr1 = Array("java" , "scla" , "javascripe" , "ii" , "wqe" , "qaz")
val arr2 = Array("java" , "scla" , "javascripe" , "ii" , "wqe" , "qaz")
val toMap: Map[String, String] = arr1.zip(arr2).toMap
for((k,v) <- toMap){
println(s"k :${k} , v:${v}")
}
}
}

结果是这样的:

明显发现,这个结果没按照最初zip后的顺序来,问题其实就是在toMap这里

解决:

在jdk1.5之后,给出了一个可以保持插入顺序强相关的Map,就是 :LinkedHashMap

所以说,解决方案就是,将scala中的Map转成LinkedHashMap

1):修改feedTypes

  def feedTypes( mapping: util.LinkedHashMap[HBaseSchemaField, RegisteredSchemaField]) :  Array[HBaseSchemaField] = {
val hbaseFields = mapping.map{
case (k,v) =>
val field = k.copy(fieldType=v.fieldType)
field
}
hbaseFields.toArray
} // def feedTypes( mapping: Map[HBaseSchemaField, RegisteredSchemaField]) : Array[HBaseSchemaField] = {
// val hbaseFields = mapping.map{
// case (k,v) =>
// val field = k.copy(fieldType=v.fieldType)
// field
// }
// hbaseFields.toArray
// }

2):修改tableSchemaFieldMapping

  def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField],  registerTable : Array[RegisteredSchemaField]): util.LinkedHashMap[HBaseSchemaField, RegisteredSchemaField] = {
if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable)
val linkedHashMap = new util.LinkedHashMap[HBaseSchemaField, RegisteredSchemaField]()
for(arr <- rs){
linkedHashMap.put(arr._1 , arr._2)
}
linkedHashMap
} // def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField], registerTable : Array[RegisteredSchemaField]): Map[HBaseSchemaField, RegisteredSchemaField] = {
// if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
// val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable)
// rs.toMap
// }

然后在跑test代码:结果

跑通!!!

PS:直接赋值我的代码就能用了

另外:

var hbasetable = sparkSession
.read
.format("com.df.test_custom.customSource")
.options(
Map(
"sparksql_table_schema" -> "(id String, create_time String , open_lng String , open_lat String , begin_address_code String , charge_mileage String , city_name String , vehicle_license String)",
"hbase_table_name" -> "order_info",
"hbase_table_schema" -> "(MM:id , MM:create_time , MM:open_lng , MM:open_lat , MM:begin_address_code , MM:charge_mileage , MM:city_name , MM:vehicle_license)"
)).load()
sparksql_table_schema和hbase_table_schema 顺序必须一样

关于自定义sparkSQL数据源(Hbase)操作中遇到的坑的更多相关文章

  1. Spark(四): Spark-sql 读hbase

    SparkSQL是指整合了Hive的spark-sql cli, 本质上就是通过Hive访问HBase表,具体就是通过hive-hbase-handler, 具体配置参见:Hive(五):hive与h ...

  2. Spark SQL 编程API入门系列之SparkSQL数据源

    不多说,直接上干货! SparkSQL数据源:从各种数据源创建DataFrame 因为 spark sql,dataframe,datasets 都是共用 spark sql 这个库的,三者共享同样的 ...

  3. 第4章 SparkSQL数据源

    第4章 SparkSQL数据源 4.1 通用加载/保存方法 4.1.1 手动指定选项 Spark SQL的DataFrame接口支持多种数据源的操作.一个DataFrame可以进行RDDs方式的操作, ...

  4. DB数据源之SpringBoot+Mybatis踏坑过程实录系列(一)

    DB数据源之SpringBoot+MyBatis踏坑过程(一) liuyuhang原创,未经允许进制转载 系列目录 DB数据源之SpringBoot+Mybatis踏坑过程实录(一) DB数据源之Sp ...

  5. DB数据源之SpringBoot+MyBatis踏坑过程(二)手工配置数据源与加载Mapper.xml扫描

    DB数据源之SpringBoot+MyBatis踏坑过程(二)手工配置数据源与加载Mapper.xml扫描 liuyuhang原创,未经允许进制转载  吐槽之后应该有所改了,该方式可以作为一种过渡方式 ...

  6. DB数据源之SpringBoot+MyBatis踏坑过程(三)手工+半自动注解配置数据源与加载Mapper.xml扫描

    DB数据源之SpringBoot+MyBatis踏坑过程(三)手工+半自动注解配置数据源与加载Mapper.xml扫描 liuyuhang原创,未经允许禁止转载    系列目录连接 DB数据源之Spr ...

  7. DB数据源之SpringBoot+MyBatis踏坑过程(四)没有使用连接池的后果

    DB数据源之SpringBoot+MyBatis踏坑过程(四)没有使用连接池的后果 liuyuhang原创,未经允许禁止转载  系列目录连接 DB数据源之SpringBoot+Mybatis踏坑过程实 ...

  8. DB数据源之SpringBoot+MyBatis踏坑过程(五)手动使用Hikari连接池

    DB数据源之SpringBoot+MyBatis踏坑过程(五)手动使用Hikari连接池 liuyuhang原创,未经允许禁止转载  系列目录连接 DB数据源之SpringBoot+Mybatis踏坑 ...

  9. DB数据源之SpringBoot+MyBatis踏坑过程(七)手动使用Tomcat连接池

    DB数据源之SpringBoot+MyBatis踏坑过程(七)手动使用Tomcat连接池 liuyuhang原创,未经允许禁止转载  系列目录连接 DB数据源之SpringBoot+Mybatis踏坑 ...

随机推荐

  1. python读写增删修改ini配置文件

    一,百度百科 .ini 文件是Initialization File的缩写,即初始化文件,是windows的系统配置文件所采用的存储格式,统管windows的各项配置,一般用户就用windows提供的 ...

  2. SQL SERVER 语法

    1.获取所有用户名: Select name FROM Sysusers where status='2' and islogin='1' islogin='1' :表示帐户 islogin='0' ...

  3. [转载]aspnet webapi 跨域请求 405错误

    写了个webapi给同事用ajax调用,配置完跨域以后get请求完全没问题,post就一直报405错误,花了半天时间就是解决不了,后来在网上看到一博主的帖子才知道原来是webapi 默认的web.co ...

  4. JS ES5

    常用 严格模式 use strict 必须使用var声明变量 禁止自定义函数this指向window 'use strict' funcion Person(name){ this.name = na ...

  5. golang 环境变量讲解

    以下配置以MAC 下配置为例,但其他环境下大同小异. GOROOT就是go的安装路径在~/.bash_profile中添加下面语句: GOROOT=/usr/local/go export GOROO ...

  6. python之文件 I/O

    打印到屏幕 最简单的输出方法是用print语句,你可以给它传递零个或多个用逗号隔开的表达式.此函数把你传递的表达式转换成一个字符串表达式,并将结果写到标准输出如下: >>> prin ...

  7. Troubleshooting: Cannot Run on an Android Device

    同事在他的开发环境中,在IDE中直接在手机上运行Android项目,结果出现这个错误,无法在手机上安装. 产生这个问题的原因,一般就是签名不对,这种情况,删除手机上装过的同名应用,就可以解决.当然,你 ...

  8. Java学习笔记【一、环境搭建】

    今天把java的学习重新拾起来,一方面是因为公司的项目需要用到大数据方面的东西,需要用java做语言 另一方面是原先使用的C#公司也在慢慢替换为java,为了以后路宽一些吧,技多不压身 此次的学习目标 ...

  9. sql分页查询(2005以后的数据库)和access分页查询

    sql分页查询: select * from ( select ROW_NUMBER() over(order by 排序条件) as rowNumber,* from [表名] where 条件 ) ...

  10. Linux 下幾種網芳/Samba 目錄的 mount 方式

      Linux 下幾種網芳/Samba 目錄的 mount 方式,比較新的 Smaba 只能用 cifs 的 mount 方式. [smbmount] smbmount -o username=&qu ...