Spark Streaming中向flume拉取数据

在这里看到的解决方法

https://issues.apache.org/jira/browse/SPARK-1729

请是个人理解，有问题请大家留言。

其实本身flume是不支持像KAFKA一样的发布/订阅功能的，也就是说无法让spark去flume拉取数据，所以老外就想了个取巧的办法。

在flume中其实sinks是向channel主动拿数据的，那么就让就自定义sinks进行自监听，然后使sparkstreaming先和sinks连接在一起，让streaming来决定是否拿数据及拿数据的频率，那么这不就是实现了由streaming来向flume拿数据的需求了嘛？

你看，真是聪明人的作法，但我觉得吧，如果真的有发布/订阅的需求，其实还是上KAFKA吧…

最后，现在来说一下应该怎么去使用

首先，需要将以下代码编译成jar包，然后在flume中使用，代码转自这里（如果发现需要依赖的工具类神马的，请在相同目录下的scala文件中找一找）

package org.apache.spark.streaming.flume.sink

import java.net.InetSocketAddress

import java.util.concurrent._

import org.apache.avro.ipc.NettyServer

import org.apache.avro.ipc.specific.SpecificResponder

import org.apache.flume.Context

import org.apache.flume.Sink.Status

import org.apache.flume.conf.{Configurable, ConfigurationException}

import org.apache.flume.sink.AbstractSink

/**

 * A sink that uses Avro RPC to run a server that can be polled by Spark's

 * FlumePollingInputDStream. This sink has the following configuration parameters:

 *

 * hostname - The hostname to bind to. Default: 0.0.0.0

 * port - The port to bind to. (No default - mandatory)

 * timeout - Time in seconds after which a transaction is rolled back,

 * if an ACK is not received from Spark within that time

 * threads - Number of threads to use to receive requests from Spark (Default: 10)

 *

 * This sink is unlike other Flume sinks in the sense that it does not push data,

 * instead the process method in this sink simply blocks the SinkRunner the first time it is

 * called. This sink starts up an Avro IPC server that uses the SparkFlumeProtocol.

 *

 * Each time a getEventBatch call comes, creates a transaction and reads events

 * from the channel. When enough events are read, the events are sent to the Spark receiver and

 * the thread itself is blocked and a reference to it saved off.

 *

 * When the ack for that batch is received,

 * the thread which created the transaction is is retrieved and it commits the transaction with the

 * channel from the same thread it was originally created in (since Flume transactions are

 * thread local). If a nack is received instead, the sink rolls back the transaction. If no ack

 * is received within the specified timeout, the transaction is rolled back too. If an ack comes

 * after that, it is simply ignored and the events get re-sent.

 *

 */

class SparkSink extends AbstractSink with Logging with Configurable {

  // Size of the pool to use for holding transaction processors.

  private var poolSize: Integer = SparkSinkConfig.DEFAULT_THREADS

  // Timeout for each transaction. If spark does not respond in this much time,

  // rollback the transaction

  private var transactionTimeout = SparkSinkConfig.DEFAULT_TRANSACTION_TIMEOUT

  // Address info to bind on

  private var hostname: String = SparkSinkConfig.DEFAULT_HOSTNAME

  private var port: Int = 0

  private var backOffInterval: Int = 200

  // Handle to the server

  private var serverOpt: Option[NettyServer] = None

  // The handler that handles the callback from Avro

  private var handler: Option[SparkAvroCallbackHandler] = None

  // Latch that blocks off the Flume framework from wasting 1 thread.

  private val blockingLatch = new CountDownLatch(1)

  override def start() {

    logInfo("Starting Spark Sink: " + getName + " on port: " + port + " and interface: " +

      hostname + " with " + "pool size: " + poolSize + " and transaction timeout: " +

      transactionTimeout + ".")

    handler = Option(new SparkAvroCallbackHandler(poolSize, getChannel, transactionTimeout,

      backOffInterval))

    val responder = new SpecificResponder(classOf[SparkFlumeProtocol], handler.get)

    // Using the constructor that takes specific thread-pools requires bringing in netty

    // dependencies which are being excluded in the build. In practice,

    // Netty dependencies are already available on the JVM as Flume would have pulled them in.

    serverOpt = Option(new NettyServer(responder, new InetSocketAddress(hostname, port)))

    serverOpt.foreach(server => {

      logInfo("Starting Avro server for sink: " + getName)

      server.start()

    })

    super.start()

  }

  override def stop() {

    logInfo("Stopping Spark Sink: " + getName)

    handler.foreach(callbackHandler => {

      callbackHandler.shutdown()

    })

    serverOpt.foreach(server => {

      logInfo("Stopping Avro Server for sink: " + getName)

      server.close()

      server.join()

    })

    blockingLatch.countDown()

    super.stop()

  }

  override def configure(ctx: Context) {

    import SparkSinkConfig._

    hostname = ctx.getString(CONF_HOSTNAME, DEFAULT_HOSTNAME)

    port = Option(ctx.getInteger(CONF_PORT)).

      getOrElse(throw new ConfigurationException("The port to bind to must be specified"))

    poolSize = ctx.getInteger(THREADS, DEFAULT_THREADS)

    transactionTimeout = ctx.getInteger(CONF_TRANSACTION_TIMEOUT, DEFAULT_TRANSACTION_TIMEOUT)

    backOffInterval = ctx.getInteger(CONF_BACKOFF_INTERVAL, DEFAULT_BACKOFF_INTERVAL)

    logInfo("Configured Spark Sink with hostname: " + hostname + ", port: " + port + ", " +

      "poolSize: " + poolSize + ", transactionTimeout: " + transactionTimeout + ", " +

      "backoffInterval: " + backOffInterval)

  }

  override def process(): Status = {

    // This method is called in a loop by the Flume framework - block it until the sink is

    // stopped to save CPU resources. The sink runner will interrupt this thread when the sink is

    // being shut down.

    logInfo("Blocking Sink Runner, sink will continue to run..")

    blockingLatch.await()

    Status.BACKOFF

  }

  private[flume] def getPort(): Int = {

    serverOpt

      .map(_.getPort)

      .getOrElse(

        throw new RuntimeException("Server was not started!")

      )

  }

  /**

   * Pass in a [[CountDownLatch]] for testing purposes. This batch is counted down when each

   * batch is received. The test can simply call await on this latch till the expected number of

   * batches are received.

   * @param latch

   */

  private[flume] def countdownWhenBatchReceived(latch: CountDownLatch) {

    handler.foreach(_.countDownWhenBatchAcked(latch))

  }

}

/**

 * Configuration parameters and their defaults.

 */

private[flume]

object SparkSinkConfig {

  val THREADS = "threads"

  val DEFAULT_THREADS = 10

  val CONF_TRANSACTION_TIMEOUT = "timeout"

  val DEFAULT_TRANSACTION_TIMEOUT = 60

  val CONF_HOSTNAME = "hostname"

  val DEFAULT_HOSTNAME = "0.0.0.0"

  val CONF_PORT = "port"

  val CONF_BACKOFF_INTERVAL = "backoffInterval"

  val DEFAULT_BACKOFF_INTERVAL = 200

}

然后在你的streaming中使用如下的代码

package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf

import org.apache.spark.storage.StorageLevel

import org.apache.spark.streaming._

import org.apache.spark.streaming.flume._

import org.apache.spark.util.IntParam

import java.net.InetSocketAddress

/**

 *  Produces a count of events received from Flume.

 *

 *  This should be used in conjunction with the Spark Sink running in a Flume agent. See

 *  the Spark Streaming programming guide for more details.

 *

 *  Usage: FlumePollingEventCount <host> <port>

 *    `host` is the host on which the Spark Sink is running.

 *    `port` is the port at which the Spark Sink is listening.

 *

 *  To run this example:

 *    `$ bin/run-example org.apache.spark.examples.streaming.FlumePollingEventCount [host] [port] `

 */

object FlumePollingEventCount {

  def main(args: Array[String]) {

    if (args.length < 2) {

      System.err.println(

        "Usage: FlumePollingEventCount <host> <port>")

      System.exit(1)

    }

    StreamingExamples.setStreamingLogLevels()

    val Array(host, IntParam(port)) = args

    val batchInterval = Milliseconds(2000)

    // Create the context and set the batch size

    val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")

    val ssc = new StreamingContext(sparkConf, batchInterval)

    // Create a flume stream that polls the Spark Sink running in a Flume agent

    val stream = FlumeUtils.createPollingStream(ssc, host, port)

    // Print out the count of events received from this server in each batch

    stream.count().map(cnt => "Received " + cnt + " flume events." ).print()

    ssc.start()

    ssc.awaitTermination()

  }

}

Spark Streaming中向flume拉取数据的更多相关文章

spark streaming中使用flume数据源
有两种方式,一种是sparkstreaming中的driver起监听,flume来推数据:另一种是sparkstreaming按照时间策略轮训的向flume拉数据. 最开始我以为只有第一种方法,但是尼 ...
canal从mysql拉取数据，并以protobuf的格式往kafka中写数据
大致思路: canal去mysql拉取数据,放在canal所在的节点上,并且自身对外提供一个tcp服务,我们只要写一个连接该服务的客户端,去拉取数据并且指定往kafka写数据的格式就能达到以proto ...
Spark Streaming中的操作函数分析
根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transformations Window Operations J ...
Spark Streaming中的操作函数讲解
Spark Streaming中的操作函数讲解根据根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transform ...
用setTimeout 代替 setInterval实时拉取数据
在开发中,我们常常碰到需要定时拉取网站数据,如: setInterval(function(){ $.ajax({ url: 'xx', success: function( response ){ ...
HBase指定大量列集合的场景下并发拉取数据时卡住的问题排查
最近遇到一例,HBase 指定大量列集合的场景下,并发拉取数据,应用卡住不响应的情形.记录一下. 问题背景退款导出中,为了获取商品规格编码,需要从 HBase 表 T 里拉取对应的数据. T 对商品 ...
Mysql分片后分页排序拉取数据的方法
高并发大流量的互联网架构,一般通过服务层来访问数据库,随着数据量的增大,数据库需要进行水平切分,分库后将数据分布到不同的数据库实例(甚至物理机器)上,以达到降低数据量,增加实例数的扩容目的. 一旦涉及 ...
Kafka消费者拉取数据异常Unexpected error code 2 while fetching data
Kafka消费程序间歇性报同一个错: 上网没查到相关资料,只好自己分析.通过进一步分析日志发现,只有在拉取某一个特定的topic的数据时报错,如果拉取其他topic的数据则不会报错.而从这个异常信息来 ...
spark streaming中维护kafka偏移量到外部介质
spark streaming中维护kafka偏移量到外部介质以kafka偏移量维护到redis为例. redis存储格式使用的数据结构为string,其中key为topic:partition, ...

随机推荐

jQuery1.11源码分析(4)-----Sizzle工厂函数[原创]
在用前两篇讲述完正则表达式.初始化.特性检测之后,终于到了我们的正餐——Sizzle工厂函数! Sizzle工厂函数有四个参数, selector:选择符 context:查找上下文 results: ...
php利用淘宝IP库获取用户ip地理位置
我们查ip的时候都是利用ip138查询的,不过那个有时候是不准确的,还不如自己引用淘宝的ip库来查询,这样准确度还高一些.不多说了,介绍一下淘宝IP地址库的使用. 淘宝IP地址库淘宝公布了他们的IP ...
调用gluNurbsCurve绘制圆弧
<OpenGL编程指南>第12章第3小结专门介绍调用GLU绘制NURBS曲线或曲面,很可惜的是并未给出绘制圆弧的例子.网上可以找到很多绘制整个园的例子,却没圆弧例子,自己瞎折腾了2个礼拜, ...
spring - 自定义注解
本自定义注解的作用:用于控制类方法的调用,只有拥有某个角色时才能调用. java内置注解 1.@Target 表示该注解用于什么地方,可能的 ElemenetType 参数包括: ElemenetTy ...
WinAPI【远程注入】三种注入方案【转】
来源:http://www.cnblogs.com/okwary/archive/2008/12/20/1358788.html 导言: 我们在Code project(www.codeprojec ...
[Effective JavaScript 笔记]第47条：绝不要在Object.prototype中增加可枚举的属性
之前的几条都不断地重复着for...in循环,它便利好用,但又容易被原型污染.for...in循环最常见的用法是枚举字典中的元素.这里就是从侧面提出不要在共享的Object.prototype中增加可 ...
sql分页查询语句
有关分页 SQL 的资料很多,有的使用存储过程,有的使用游标.本人不喜欢使用游标,我觉得它耗资.效率低:使用存储过程是个不错的选择,因为存储过程是经过预编译的,执行效率高,也更灵活.先看看单条 SQL ...
OJ 1188 全排列---康托展开
题目描述求n的从小到大第m个全排列(n≤20). 输入 n和m 输出输出第m个全排列,两个数之间有一空格. 样例输入 3 2 样例输出 1 3 2 #include<cstdio> # ...
Heap(堆)和stack(栈)有的区别是什么。
java的内存分为两类,一类是栈内存,一类是堆内存.栈内存是指程序进入一个方法时,会为这个方法单独分配一块私属存储空间,用于存储这个方法内部的局部变量,当这个方法结束时,分配给这个方法的栈会释放,这个 ...
【Other】推荐点好听的钢琴曲
2013-12-13 16:19 匿名 | 浏览 138977 次音乐钢琴推荐点好听的钢琴曲,纯音乐也可以thanks!!! 2013-12-14 19:34 网友采纳热心网友巴洛克:帕海贝尔 ...

Spark Streaming中向flume拉取数据

Spark Streaming中向flume拉取数据的更多相关文章

随机推荐

热门专题