前言

在平时的 Spark 处理中常常会有把一个如 2012-12-12 这样的 date 类型转换成一个 long 的 Unix time 然后进行计算的需求.下面是一段示例代码:

val schema = StructType(

  Array(

    StructField("id", IntegerType, nullable = true),

    StructField("birth", DateType, nullable = true),

    StructField("time", TimestampType, nullable = true)

  ))

val data = Seq(

  Row(, Date.valueOf("2012-12-12"), Timestamp.valueOf("2016-09-30 03:03:00")),

  Row(, Date.valueOf("2016-12-14"), Timestamp.valueOf("2016-12-14 03:03:00")))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)

问题 & 解决

首先很直观的是直接把DateType cast 成 LongType, 如下:

df.select(df.col("birth").cast(LongType))

但是这样出来都是 null, 这是为什么? 答案就在 org.apache.spark.sql.catalyst.expressions.Cast 中, 先看 canCast 方法, 可以看到 DateType 其实是可以转成 NumericType 的, 然后再看下面castToLong的方法, 可以看到case DateType => buildCast[Int](_, d => null)居然直接是个 null, 看提交记录其实这边有过反复, 然后为了和 hive 统一, 所以返回最后还是返回 null 了.

虽然 DateType 不能直接 castToLong, 但是TimestampType可以, 所以这里的解决方案就是先把 DateType cast 成 TimestampType. 但是这里又会有一个非常坑爹的问题: 时区问题.

首先明确一个问题, 就是这个放到了 spark 中的 2012-12-12 到底 UTC 还是我们当前时区? 答案是如果没有经过特殊配置, 这个2012-12-12代表的是当前时区的 2012-12-12 00:00:00., 对应 UTC 其实是: 2012-12-11 16:00:00, 少了8小时. 这里还顺便说明了Spark 入库 Date 数据的时候是带着时区的.

然后再看DateType cast toTimestampType 的代码, 可以看到buildCast[Int](_, d => DateTimeUtils.daysToMillis(d, timeZone) * 1000), 这里是带着时区的, 但是 Spark SQL 默认会用当前机器的时区. 但是大家一般底层数据比如这个2016-09-30, 都是代表的 UTC 时间, 在用 Spark 处理数据的时候, 这个时间还是 UTC 时间, 只有通过 JDBC 出去的时间才会变成带目标时区的结果. 经过摸索, 这里有两种解决方案:

配置 Spark 的默认时区config("spark.sql.session.timeZone", "UTC"), 最直观. 这样直接写 df.select(df.col("birth").cast(TimestampType).cast(LongType)) 就可以了.

不配置 conf, 正面刚:

df.select(from_utc_timestamp(to_utc_timestamp(df.col("birth"), TimeZone.getTimeZone("UTC").getID), TimeZone.getDefault.getID).cast(LongType))

, 可以看到各种 cast, 这是区别:

没有配置 UTC:

from_utc_timestamp(to_utc_timestamp(lit("2012-12-11 16:00:00"), TimeZone.getTimeZone("UTC").getID), TimeZone.getDefault.getID)

配置了 UTC: 多了8小时

from_utc_timestamp(to_utc_timestamp(lit("2012-12-12 00:00:00"), TimeZone.getTimeZone("UTC").getID), TimeZone.getDefault.getID)

 /**

   * Returns true iff we can cast `from` type to `to` type.

   */

  def canCast(from: DataType, to: DataType): Boolean = (from, to) match {

    case (fromType, toType) if fromType == toType => true

    case (NullType, _) => true

    case (_, StringType) => true

    case (StringType, BinaryType) => true

    case (StringType, BooleanType) => true

    case (DateType, BooleanType) => true

    case (TimestampType, BooleanType) => true

    case (_: NumericType, BooleanType) => true

    case (StringType, TimestampType) => true

    case (BooleanType, TimestampType) => true

    case (DateType, TimestampType) => true

    case (_: NumericType, TimestampType) => true

    case (StringType, DateType) => true

    case (TimestampType, DateType) => true

    case (StringType, CalendarIntervalType) => true

    case (StringType, _: NumericType) => true

    case (BooleanType, _: NumericType) => true

    case (DateType, _: NumericType) => true

    case (TimestampType, _: NumericType) => true

    case (_: NumericType, _: NumericType) => true

    ...

  }

  private[this] def castToLong(from: DataType): Any => Any = from match {

    case StringType =>

      val result = new LongWrapper()

      buildCast[UTF8String](_, s => if (s.toLong(result)) result.value else null)

    case BooleanType =>

      buildCast[Boolean](_, b => if (b) 1L else 0L)

    case DateType =>

      buildCast[Int](_, d => null)

    case TimestampType =>

      buildCast[Long](_, t => timestampToLong(t))

    case x: NumericType =>

      b => x.numeric.asInstanceOf[Numeric[Any]].toLong(b)

  }

  // TimestampConverter

  private[this] def castToTimestamp(from: DataType): Any => Any = from match {

    ...

    case DateType =>

      buildCast[Int](_, d => DateTimeUtils.daysToMillis(d, timeZone) * )

    // TimestampWritable.decimalToTimestamp

    ...

  }

  /**

   * Given a timestamp, which corresponds to a certain time of day in the given timezone, returns

   * another timestamp that corresponds to the same time of day in UTC.

   * @group datetime_funcs

   * @since 1.5.0

   */

  def to_utc_timestamp(ts: Column, tz: String): Column = withExpr {

    ToUTCTimestamp(ts.expr, Literal(tz))

  }

  /**

   * Given a timestamp, which corresponds to a certain time of day in UTC, returns another timestamp

   * that corresponds to the same time of day in the given timezone.

   * @group datetime_funcs

   * @since 1.5.0

   */

  def from_utc_timestamp(ts: Column, tz: String): Column = withExpr {

    FromUTCTimestamp(ts.expr, Literal(tz))

  }

Deep dive

配置源码解读:

  val SESSION_LOCAL_TIMEZONE = buildConf("spark.sql.session.timeZone").stringConf.createWithDefaultFunction(() => TimeZone.getDefault.getID)

def sessionLocalTimeZone: String = getConf(SQLConf.SESSION_LOCAL_TIMEZONE)

/**

 * Replace [[TimeZoneAwareExpression]] without timezone id by its copy with session local

 * time zone.

 */

case class ResolveTimeZone(conf: SQLConf) extends Rule[LogicalPlan] {

  private val transformTimeZoneExprs: PartialFunction[Expression, Expression] = {

    case e: TimeZoneAwareExpression if e.timeZoneId.isEmpty =>

      e.withTimeZone(conf.sessionLocalTimeZone)

    // Casts could be added in the subquery plan through the rule TypeCoercion while coercing

    // the types between the value expression and list query expression of IN expression.

    // We need to subject the subquery plan through ResolveTimeZone again to setup timezone

    // information for time zone aware expressions.

    case e: ListQuery => e.withNewPlan(apply(e.plan))

  }

  override def apply(plan: LogicalPlan): LogicalPlan =

    plan.transformAllExpressions(transformTimeZoneExprs)

  def resolveTimeZones(e: Expression): Expression = e.transform(transformTimeZoneExprs)

}

/**

 * Mix-in trait for constructing valid [[Cast]] expressions.

 */

trait CastSupport {

  /**

   * Configuration used to create a valid cast expression.

   */

  def conf: SQLConf

  /**

   * Create a Cast expression with the session local time zone.

   */

  def cast(child: Expression, dataType: DataType): Cast = {

    Cast(child, dataType, Option(conf.sessionLocalTimeZone))

  }

}

org.apache.spark.sql.catalyst.analysis.Analyzer#batches 可以看到有ResolveTimeZone

  lazy val batches: Seq[Batch] = Seq(

    Batch("Resolution", fixedPoint,

      ResolveTableValuedFunctions ::

      ResolveRelations ::

      ResolveReferences ::

      ...

      ResolveTimeZone(conf) ::

      ResolvedUuidExpressions ::

      TypeCoercion.typeCoercionRules(conf) ++

      extendedResolutionRules : _*),

    Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*),

    Batch("View", Once,

      AliasViewChild(conf)),

    Batch("Nondeterministic", Once,

      PullOutNondeterministic),

    Batch("UDF", Once,

      HandleNullInputsForUDF),

    Batch("FixNullability", Once,

      FixNullability),

    Batch("Subquery", Once,

      UpdateOuterReferences),

    Batch("Cleanup", fixedPoint,

      CleanupAliases)

  )

Test Example

对于时区理解

在不同的时区下 sql.Timestamp 对象的表现:

这里是 GMT+8:

Timestamp "2014-06-24 07:22:15.0"

    - fastTime =

    - "2014-06-24T07:22:15.000+0700"

如果是 GMT+7, 会显示如下,可以看到是同一个毫秒数

Timestamp "2014-06-24 06:22:15.0"

    - fastTime =

    - "2014-06-24T06:22:15.000+0700"

 test("ColumnBatch") {

    val schema = StructType(

      Array(

        StructField("id", IntegerType, nullable = true),

        StructField("birth", DateType, nullable = true),

        StructField("time", TimestampType, nullable = true)

      ))

    val columnarBatch = ColumnarBatch.allocate(schema, MemoryMode.ON_HEAP, )

    val c0 = columnarBatch.column()

    val c1 = columnarBatch.column()

    val c2 = columnarBatch.column()

    c0.putInt(, )

    // 1355241600, /3600/24 s to days

    c1.putInt(,  /  / )

    // microsecond

    c2.putLong(, 1355285532000000L)

    val internal0 = columnarBatch.getRow()

    //a way converting internal row to unsafe row.

    //val convert = UnsafeProjection.create(schema)

    //val internal = convert.apply(internal0)

    val enc = RowEncoder.apply(schema).resolveAndBind()

    val row = enc.fromRow(internal0)

    val df = spark.createDataFrame(Lists.newArrayList(row), schema)

    TimeZone.setDefault(TimeZone.getTimeZone("UTC"))

    val tsStr0 = df.select(col("time")).head().getTimestamp().toString

    val ts0 = df.select(col("time").cast(LongType)).head().getLong()

    TimeZone.setDefault(TimeZone.getTimeZone("GMT+8"))

    val tsStr1 = df.select(col("time")).head().getTimestamp().toString

    val ts1 = df.select(col("time").cast(LongType)).head().getLong()

    assert(true, "2012-12-12 04:12:12.0".equals(tsStr0))

    assert(true, "2012-12-12 12:12:12.0".equals(tsStr1))

    // to long 之后毫秒数都是一样的

    assert(true, ts0 == ts1)

  }

番外 : ImplicitCastInputTypes

我们自己定义了一个Expr, 要求接受两个 input 为 DateType 的参数.

case class MockExpr(d0: Expression, d1: Expression)

  extends BinaryExpression with ImplicitCastInputTypes {

  override def left: Expression = d0

  override def right: Expression = d1

  override def inputTypes: Seq[AbstractDataType] = Seq(DateType, DateType)

  override def dataType: DataType = IntegerType

  override def nullSafeEval(date0: Any, date1: Any): Any = {

    ...

  }

}

假设我们有如下调用, 请问这个调用符合预期吗? 结论是符合的, 因为有ImplicitCastInputTypes.

lit("2012-11-12 12:12:12.0").cast(TimestampType)

lit("2012-12-12 12:12:12.0").cast(TimestampType)

Column(MockExpr(tsc1.expr, tsc2.expr))

org.apache.spark.sql.catalyst.analysis.TypeCoercion.ImplicitTypeCasts

case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>

val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>

  // If we cannot do the implicit cast, just use the original input.

  implicitCast(in, expected).getOrElse(in)

}

e.withNewChildren(children)

def implicitCast(e: Expression, expectedType: AbstractDataType): Option[Expression] = {

  implicitCast(e.dataType, expectedType).map { dt =>

    if (dt == e.dataType) e else Cast(e, dt)

  }

}

org.apache.spark.sql.catalyst.expressions.Cast#castToDate#DateConverter

private[this] def castToDate(from: DataType): Any => Any = from match {

  case StringType =>

    buildCast[UTF8String](_, s => DateTimeUtils.stringToDate(s).orNull)

  case TimestampType =>

    // throw valid precision more than seconds, according to Hive.

    // Timestamp.nanos is in 0 to 999,999,999, no more than a second.

    buildCast[Long](_, t => DateTimeUtils.millisToDays(t / 1000L, timeZone))

}

Spark DateType cast 踩坑的更多相关文章

Spark踩坑记——从RDD看集群调度
[TOC] 前言在Spark的使用中,性能的调优配置过程中,查阅了很多资料,之前自己总结过两篇小博文Spark踩坑记--初试和Spark踩坑记--数据库(Hbase+Mysql),第一篇概况的归纳了 ...
ALS部署Spark集群入坑记
[Stage 236:> (0 + 0) / 400]17/12/04 09:45:55 ERROR yarn.ApplicationMaster: User class threw excep ...
Spark 1.6升级2.x防踩坑指南
原创文章,谢绝转载 Spark 2.x自2.0.0发布到目前的2.2.0已经有一年多的时间了,2.x宣称有诸多的性能改进,相信不少使用Spark的同学还停留在1.6.x或者更低的版本上,没有升级到2. ...
Spark踩坑记——Spark Streaming+Kafka
[TOC] 前言在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark strea ...
Spark踩坑记——数据库（Hbase+Mysql）
[TOC] 前言在使用Spark Streaming的过程中对于计算产生结果的进行持久化时,我们往往需要操作数据库,去统计或者改变一些值.最近一个实时消费者处理任务,在使用spark streami ...
Spark踩坑记——共享变量
[TOC] 前言 Spark踩坑记--初试 Spark踩坑记--数据库(Hbase+Mysql) Spark踩坑记--Spark Streaming+kafka应用及调优在前面总结的几篇spark踩 ...
[转]Spark 踩坑记：数据库（Hbase+Mysql）
https://cloud.tencent.com/developer/article/1004820 Spark 踩坑记:数据库(Hbase+Mysql) 前言在使用Spark Streaming ...
Spark踩坑记：共享变量
收录待用,修改转载已取得腾讯云授权前言前面总结的几篇spark踩坑博文中,我总结了自己在使用spark过程当中踩过的一些坑和经验.我们知道Spark是多机器集群部署的,分为Driver/Maste ...
Spark踩坑记——数据库（Hbase+Mysql）转
转自:http://www.cnblogs.com/xlturing/p/spark.html 前言在使用Spark Streaming的过程中对于计算产生结果的进行持久化时,我们往往需要操作数据库 ...

随机推荐

SpringBoot图片上传(五) 上一篇的新版本，样式修改后的
简单描述:一次上传N张图片(N可自定义):上传完后图片回显,鼠标放到已经上传的图片上后,显示删除,点击后可以删除图片,鼠标离开后,图片恢复. 效果:一次上传多个图片后的效果上传成功: 鼠标悬浮到图片 ...
php 的文件操作类
<?php header('Content-type:text/html;charset=utf8'); Class FILE { private static $path; private s ...
MySQL学习笔记：timediff、timestampdiff、datediff
一.时间差函数:timestampdiff 语法:timestampdiff(interval, datetime1,datetime2) 结果:返回(时间2-时间1)的时间差,结果单位由interv ...
Redis .NET操作
Redis是一个支持数据结构更多的键值对数据库.它的值不仅可以是字符串等基本数据类型,也可以是类对象,更可以是Set.List.计数器等高级的数据结构. Memcached也可以保存类似于Set.Li ...
Imcash科普：没有网络也可以转账比特币？你可能有些误解
近日,据Bitcoinist消息,比特币解决方案开发商CoinKite联合创始人Rodolfo Novak和OpenBazaar联合创始人Sam Patterson在推特上宣布,两人不借助互联网和卫星 ...
thinkphp的静态缓存，数据缓存，快速缓存，查询缓存
// 静态缓存 // 'HTML_PATH' 缓存目录,这是个常量不是配置项,在入口文件中定义 // 'HTML_CACHE_ON' => true, // 开启静态缓存 'HTM ...
shullfe机制详解
一.shuffle机制概述 shuffle机制就是发生在MR程序中,Mapper之后,Reducer之前的一系列分区排序的操作.shuffle的作用是为了保证Reducer收到的数据都是按键排序的. ...
Android高级工程师面试实战，您会挂么？
xxx公司面试总结面试形势群聊(2个面试官+HR+自己) 面试流程自我介绍面试官根据你的介绍开始问你对我们公司有什么想了解的么(复活卡,要时回到没有了也就没有了,可以让面试官给自己提一下建议 ...
ISP PIPLINE (附加1) Green Imbalance
1.什么是Green imbalance 芯片的Gr和Gb通道获取的能量或者是输出的数据不一致,造成这种情况的原因一方面是Gr,Gb通道的半导体制造工艺方面的差异,另一方面是microlens的存在, ...
02_ if_else if 练习
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <title&g ...

Spark DateType cast 踩坑

前言