Kafka：ZK+Kafka+Spark Streaming集群环境搭建（三十）：使用flatMapGroupsWithState替换agg

flatMapGroupsWithState的出现解决了什么问题：

flatMapGroupsWithState的出现在spark structured streaming原因(从spark.2.2.0开始才开始支持)：

1）可以实现agg函数；

2）就目前最新spark2.3.2版本来说在spark structured streming中依然不支持对dataset多次agg操作

，而flatMapGroupsWithState可以替代agg的作用，同时它允许在sink为append模式下在agg之前使用。

注意：尽管允许agg之前使用，但前提是：输出（sink）方式Append方式。

flatMapGroupsWithState的使用示例（从网上找到）：

参考：《https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.html》

说明：以下示例代码实现了“select deviceId,count(0) as count from tbName group by deviceId.”。

1）spark2.3.0版本下定义一个Signal实体类：

scala> spark.version

res0: String = 2.3.0-SNAPSHOT

import java.sql.Timestamp

type DeviceId = Int

case class Signal(timestamp: java.sql.Timestamp, value: Long, deviceId: DeviceId)

2）使用Rate source方式生成一些测试数据（随机实时流方式），并查看执行计划：

// input stream

import org.apache.spark.sql.functions._

val signals = spark.

  readStream.

  format("rate").

  option("rowsPerSecond", 1).

  load.

  withColumn("value", $"value" % 10).  // <-- randomize the values (just for fun)

  withColumn("deviceId", rint(rand() * 10) cast "int"). // <-- 10 devices randomly assigned to values

  as[Signal] // <-- convert to our type (from "unpleasant" Row)

scala> signals.explain

== Physical Plan ==

*Project [timestamp#0, (value#1L % 10) AS value#5L, cast(ROUND((rand(4440296395341152993) * 10.0)) as int) AS deviceId#9]

+- StreamingRelation rate, [timestamp#0, value#1L]

3）对Rate source流对象进行groupBy，使用flatMapGroupsWithState实现agg

// stream processing using flatMapGroupsWithState operator

val device: Signal => DeviceId = { case Signal(_, _, deviceId) => deviceId }

val signalsByDevice = signals.groupByKey(device)

import org.apache.spark.sql.streaming.GroupState

type Key = Int

type Count = Long

type State = Map[Key, Count]

case class EventsCounted(deviceId: DeviceId, count: Long)

def countValuesPerKey(deviceId: Int, signalsPerDevice: Iterator[Signal], state: GroupState[State]): Iterator[EventsCounted] = {

  val values = signalsPerDevice.toList

  println(s"Device: $deviceId")

  println(s"Signals (${values.size}):")

  values.zipWithIndex.foreach { case (v, idx) => println(s"$idx. $v") }

  println(s"State: $state")

  // update the state with the count of elements for the key

  val initialState: State = Map(deviceId -> 0)

  val oldState = state.getOption.getOrElse(initialState)

  // the name to highlight that the state is for the key only

  val newValue = oldState(deviceId) + values.size

  val newState = Map(deviceId -> newValue)

  state.update(newState)

  // you must not return as it's already consumed

  // that leads to a very subtle error where no elements are in an iterator

  // iterators are one-pass data structures

  Iterator(EventsCounted(deviceId, newValue))

}

import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode}

val signalCounter = signalsByDevice.flatMapGroupsWithState(

  outputMode = OutputMode.Append,

  timeoutConf = GroupStateTimeout.NoTimeout)(func = countValuesPerKey)

4）使用Console Sink方式打印agg结果：

import org.apache.spark.sql.streaming.{OutputMode, Trigger}

import scala.concurrent.duration._

val sq = signalCounter.

  writeStream.

  format("console").

  option("truncate", false).

  trigger(Trigger.ProcessingTime(10.seconds)).

  outputMode(OutputMode.Append).

  start

5）console print

...

-------------------------------------------

Batch:

-------------------------------------------

+--------+-----+

|deviceId|count|

+--------+-----+

+--------+-----+

...

// :: INFO StreamExecution: Streaming query made progress: {

  "id" : "a43822a6-500b-4f02-9133-53e9d39eedbf",

  "runId" : "79cb037e-0f28-4faf-a03e-2572b4301afe",

  "name" : null,

  "timestamp" : "2017-08-21T06:57:26.719Z",

  "batchId" : ,

  "numInputRows" : ,

  "processedRowsPerSecond" : 0.0,

  "durationMs" : {

    "addBatch" : ,

    "getBatch" : ,

    "getOffset" : ,

    "queryPlanning" : ,

    "triggerExecution" : ,

    "walCommit" :

  },

  "stateOperators" : [ {

    "numRowsTotal" : ,

    "numRowsUpdated" : ,

    "memoryUsedBytes" :

  } ],

  "sources" : [ {

    "description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",

    "startOffset" : null,

    "endOffset" : ,

    "numInputRows" : ,

    "processedRowsPerSecond" : 0.0

  } ],

  "sink" : {

    "description" : "ConsoleSink[numRows=20, truncate=false]"

  }

}

// :: DEBUG StreamExecution: batch  committed

...

-------------------------------------------

Batch:

-------------------------------------------

Device:

Signals ():

. Signal(-- ::27.682,,)

State: GroupState(<undefined>)

Device:

Signals ():

. Signal(-- ::26.682,,)

State: GroupState(<undefined>)

Device:

Signals ():

. Signal(-- ::28.682,,)

State: GroupState(<undefined>)

+--------+-----+

|deviceId|count|

+--------+-----+

|       |    |

|       |    |

|       |    |

+--------+-----+

...

// :: INFO StreamExecution: Streaming query made progress: {

  "id" : "a43822a6-500b-4f02-9133-53e9d39eedbf",

  "runId" : "79cb037e-0f28-4faf-a03e-2572b4301afe",

  "name" : null,

  "timestamp" : "2017-08-21T06:57:30.004Z",

  "batchId" : ,

  "numInputRows" : ,

  "inputRowsPerSecond" : 0.91324200913242,

  "processedRowsPerSecond" : 2.2388059701492535,

  "durationMs" : {

    "addBatch" : ,

    "getBatch" : ,

    "getOffset" : ,

    "queryPlanning" : ,

    "triggerExecution" : ,

    "walCommit" :

  },

  "stateOperators" : [ {

    "numRowsTotal" : ,

    "numRowsUpdated" : ,

    "memoryUsedBytes" :

  } ],

  "sources" : [ {

    "description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",

    "startOffset" : ,

    "endOffset" : ,

    "numInputRows" : ,

    "inputRowsPerSecond" : 0.91324200913242,

    "processedRowsPerSecond" : 2.2388059701492535

  } ],

  "sink" : {

    "description" : "ConsoleSink[numRows=20, truncate=false]"

  }

}

// :: DEBUG StreamExecution: batch  committed

...

-------------------------------------------

Batch:

-------------------------------------------

Device:

Signals ():

. Signal(-- ::36.682,,)

State: GroupState(<undefined>)

Device:

Signals ():

. Signal(-- ::32.682,,)

. Signal(-- ::35.682,,)

State: GroupState(Map( -> ))

Device:

Signals ():

. Signal(-- ::34.682,,)

State: GroupState(<undefined>)

Device:

Signals ():

. Signal(-- ::29.682,,)

State: GroupState(<undefined>)

Device:

Signals ():

. Signal(-- ::31.682,,)

. Signal(-- ::33.682,,)

State: GroupState(Map( -> ))

Device:

Signals ():

. Signal(-- ::30.682,,)

. Signal(-- ::37.682,,)

State: GroupState(Map( -> ))

Device:

Signals ():

. Signal(-- ::38.682,,)

State: GroupState(<undefined>)

+--------+-----+

|deviceId|count|

+--------+-----+

|       |    |

|       |    |

|       |    |

|       |    |

|       |    |

|       |    |

|       |    |

+--------+-----+

...

// :: INFO StreamExecution: Streaming query made progress: {

  "id" : "a43822a6-500b-4f02-9133-53e9d39eedbf",

  "runId" : "79cb037e-0f28-4faf-a03e-2572b4301afe",

  "name" : null,

  "timestamp" : "2017-08-21T06:57:40.005Z",

  "batchId" : ,

  "numInputRows" : ,

  "inputRowsPerSecond" : 0.9999000099990002,

  "processedRowsPerSecond" : 9.242144177449168,

  "durationMs" : {

    "addBatch" : ,

    "getBatch" : ,

    "getOffset" : ,

    "queryPlanning" : ,

    "triggerExecution" : ,

    "walCommit" :

  },

  "stateOperators" : [ {

    "numRowsTotal" : ,

    "numRowsUpdated" : ,

    "memoryUsedBytes" :

  } ],

  "sources" : [ {

    "description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",

    "startOffset" : ,

    "endOffset" : ,

    "numInputRows" : ,

    "inputRowsPerSecond" : 0.9999000099990002,

    "processedRowsPerSecond" : 9.242144177449168

  } ],

  "sink" : {

    "description" : "ConsoleSink[numRows=20, truncate=false]"

  }

}

// :: DEBUG StreamExecution: batch  committed

// In the end...

sq.stop

// Use stateOperators to access the stats

scala> println(sq.lastProgress.stateOperators().prettyJson)

{

  "numRowsTotal" : ,

  "numRowsUpdated" : ,

  "memoryUsedBytes" :

}