在这里看到的解决方法

https://issues.apache.org/jira/browse/SPARK-1729

请是个人理解,有问题请大家留言。

其实本身flume是不支持像KAFKA一样的发布/订阅功能的,也就是说无法让spark去flume拉取数据,所以老外就想了个取巧的办法。

在flume中其实sinks是向channel主动拿数据的,那么就让就自定义sinks进行自监听,然后使sparkstreaming先和sinks连接在一起, 让streaming来决定是否拿数据及拿数据的频率, 那么这不就是实现了由streaming来向flume拿数据的需求了嘛?

你看,真是聪明人的作法,但我觉得吧,如果真的有发布/订阅的需求,其实还是上KAFKA吧…

最后,现在来说一下应该怎么去使用

首先,需要将以下代码编译成jar包,然后在flume中使用,代码转自这里 (如果发现需要依赖的工具类神马的,请在相同目录下的scala文件中找一找)

package org.apache.spark.streaming.flume.sink

import java.net.InetSocketAddress
import java.util.concurrent._ import org.apache.avro.ipc.NettyServer
import org.apache.avro.ipc.specific.SpecificResponder
import org.apache.flume.Context
import org.apache.flume.Sink.Status
import org.apache.flume.conf.{Configurable, ConfigurationException}
import org.apache.flume.sink.AbstractSink /**
* A sink that uses Avro RPC to run a server that can be polled by Spark's
* FlumePollingInputDStream. This sink has the following configuration parameters:
*
* hostname - The hostname to bind to. Default: 0.0.0.0
* port - The port to bind to. (No default - mandatory)
* timeout - Time in seconds after which a transaction is rolled back,
* if an ACK is not received from Spark within that time
* threads - Number of threads to use to receive requests from Spark (Default: 10)
*
* This sink is unlike other Flume sinks in the sense that it does not push data,
* instead the process method in this sink simply blocks the SinkRunner the first time it is
* called. This sink starts up an Avro IPC server that uses the SparkFlumeProtocol.
*
* Each time a getEventBatch call comes, creates a transaction and reads events
* from the channel. When enough events are read, the events are sent to the Spark receiver and
* the thread itself is blocked and a reference to it saved off.
*
* When the ack for that batch is received,
* the thread which created the transaction is is retrieved and it commits the transaction with the
* channel from the same thread it was originally created in (since Flume transactions are
* thread local). If a nack is received instead, the sink rolls back the transaction. If no ack
* is received within the specified timeout, the transaction is rolled back too. If an ack comes
* after that, it is simply ignored and the events get re-sent.
*
*/ class SparkSink extends AbstractSink with Logging with Configurable { // Size of the pool to use for holding transaction processors.
private var poolSize: Integer = SparkSinkConfig.DEFAULT_THREADS // Timeout for each transaction. If spark does not respond in this much time,
// rollback the transaction
private var transactionTimeout = SparkSinkConfig.DEFAULT_TRANSACTION_TIMEOUT // Address info to bind on
private var hostname: String = SparkSinkConfig.DEFAULT_HOSTNAME
private var port: Int = 0 private var backOffInterval: Int = 200 // Handle to the server
private var serverOpt: Option[NettyServer] = None // The handler that handles the callback from Avro
private var handler: Option[SparkAvroCallbackHandler] = None // Latch that blocks off the Flume framework from wasting 1 thread.
private val blockingLatch = new CountDownLatch(1) override def start() {
logInfo("Starting Spark Sink: " + getName + " on port: " + port + " and interface: " +
hostname + " with " + "pool size: " + poolSize + " and transaction timeout: " +
transactionTimeout + ".")
handler = Option(new SparkAvroCallbackHandler(poolSize, getChannel, transactionTimeout,
backOffInterval))
val responder = new SpecificResponder(classOf[SparkFlumeProtocol], handler.get)
// Using the constructor that takes specific thread-pools requires bringing in netty
// dependencies which are being excluded in the build. In practice,
// Netty dependencies are already available on the JVM as Flume would have pulled them in.
serverOpt = Option(new NettyServer(responder, new InetSocketAddress(hostname, port)))
serverOpt.foreach(server => {
logInfo("Starting Avro server for sink: " + getName)
server.start()
})
super.start()
} override def stop() {
logInfo("Stopping Spark Sink: " + getName)
handler.foreach(callbackHandler => {
callbackHandler.shutdown()
})
serverOpt.foreach(server => {
logInfo("Stopping Avro Server for sink: " + getName)
server.close()
server.join()
})
blockingLatch.countDown()
super.stop()
} override def configure(ctx: Context) {
import SparkSinkConfig._
hostname = ctx.getString(CONF_HOSTNAME, DEFAULT_HOSTNAME)
port = Option(ctx.getInteger(CONF_PORT)).
getOrElse(throw new ConfigurationException("The port to bind to must be specified"))
poolSize = ctx.getInteger(THREADS, DEFAULT_THREADS)
transactionTimeout = ctx.getInteger(CONF_TRANSACTION_TIMEOUT, DEFAULT_TRANSACTION_TIMEOUT)
backOffInterval = ctx.getInteger(CONF_BACKOFF_INTERVAL, DEFAULT_BACKOFF_INTERVAL)
logInfo("Configured Spark Sink with hostname: " + hostname + ", port: " + port + ", " +
"poolSize: " + poolSize + ", transactionTimeout: " + transactionTimeout + ", " +
"backoffInterval: " + backOffInterval)
} override def process(): Status = {
// This method is called in a loop by the Flume framework - block it until the sink is
// stopped to save CPU resources. The sink runner will interrupt this thread when the sink is
// being shut down.
logInfo("Blocking Sink Runner, sink will continue to run..")
blockingLatch.await()
Status.BACKOFF
} private[flume] def getPort(): Int = {
serverOpt
.map(_.getPort)
.getOrElse(
throw new RuntimeException("Server was not started!")
)
} /**
* Pass in a [[CountDownLatch]] for testing purposes. This batch is counted down when each
* batch is received. The test can simply call await on this latch till the expected number of
* batches are received.
* @param latch
*/
private[flume] def countdownWhenBatchReceived(latch: CountDownLatch) {
handler.foreach(_.countDownWhenBatchAcked(latch))
}
} /**
* Configuration parameters and their defaults.
*/
private[flume]
object SparkSinkConfig {
val THREADS = "threads"
val DEFAULT_THREADS = 10 val CONF_TRANSACTION_TIMEOUT = "timeout"
val DEFAULT_TRANSACTION_TIMEOUT = 60 val CONF_HOSTNAME = "hostname"
val DEFAULT_HOSTNAME = "0.0.0.0" val CONF_PORT = "port" val CONF_BACKOFF_INTERVAL = "backoffInterval"
val DEFAULT_BACKOFF_INTERVAL = 200
}

  

然后在你的streaming中使用如下的代码

package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import java.net.InetSocketAddress /**
* Produces a count of events received from Flume.
*
* This should be used in conjunction with the Spark Sink running in a Flume agent. See
* the Spark Streaming programming guide for more details.
*
* Usage: FlumePollingEventCount <host> <port>
* `host` is the host on which the Spark Sink is running.
* `port` is the port at which the Spark Sink is listening.
*
* To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.FlumePollingEventCount [host] [port] `
*/
object FlumePollingEventCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
} StreamingExamples.setStreamingLogLevels() val Array(host, IntParam(port)) = args val batchInterval = Milliseconds(2000) // Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval) // Create a flume stream that polls the Spark Sink running in a Flume agent
val stream = FlumeUtils.createPollingStream(ssc, host, port) // Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print() ssc.start()
ssc.awaitTermination()
}
}

  

Spark Streaming中向flume拉取数据的更多相关文章

  1. spark streaming中使用flume数据源

    有两种方式,一种是sparkstreaming中的driver起监听,flume来推数据:另一种是sparkstreaming按照时间策略轮训的向flume拉数据. 最开始我以为只有第一种方法,但是尼 ...

  2. canal从mysql拉取数据,并以protobuf的格式往kafka中写数据

    大致思路: canal去mysql拉取数据,放在canal所在的节点上,并且自身对外提供一个tcp服务,我们只要写一个连接该服务的客户端,去拉取数据并且指定往kafka写数据的格式就能达到以proto ...

  3. Spark Streaming中的操作函数分析

    根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transformations Window Operations J ...

  4. Spark Streaming中的操作函数讲解

    Spark Streaming中的操作函数讲解 根据根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transform ...

  5. 用setTimeout 代替 setInterval实时拉取数据

    在开发中,我们常常碰到需要定时拉取网站数据,如: setInterval(function(){ $.ajax({ url: 'xx', success: function( response ){ ...

  6. HBase指定大量列集合的场景下并发拉取数据时卡住的问题排查

    最近遇到一例,HBase 指定大量列集合的场景下,并发拉取数据,应用卡住不响应的情形.记录一下. 问题背景 退款导出中,为了获取商品规格编码,需要从 HBase 表 T 里拉取对应的数据. T 对商品 ...

  7. Mysql分片后分页排序拉取数据的方法

    高并发大流量的互联网架构,一般通过服务层来访问数据库,随着数据量的增大,数据库需要进行水平切分,分库后将数据分布到不同的数据库实例(甚至物理机器)上,以达到降低数据量,增加实例数的扩容目的. 一旦涉及 ...

  8. Kafka消费者拉取数据异常Unexpected error code 2 while fetching data

    Kafka消费程序间歇性报同一个错: 上网没查到相关资料,只好自己分析.通过进一步分析日志发现,只有在拉取某一个特定的topic的数据时报错,如果拉取其他topic的数据则不会报错.而从这个异常信息来 ...

  9. spark streaming中维护kafka偏移量到外部介质

    spark streaming中维护kafka偏移量到外部介质 以kafka偏移量维护到redis为例. redis存储格式 使用的数据结构为string,其中key为topic:partition, ...

随机推荐

  1. 基础知识系列☞GET和POST→及相关知识

    参考资料: [1].<IT企业必读的200个.Net面试题> [2].http://www.cnblogs.com/hyddd/archive/2009/03/31/1426026.htm ...

  2. [Effective JavaScript 笔记]第50条:迭代方法优于循环

    "懒"程序员才是好程序员.复制和粘贴样板代码,一但代码有错误,或代码功能修改,那么程序在修改的时候,程序员需要找到所有相同功能的代码一处处进行修改.这会使人重复发明轮子,而且在别人 ...

  3. Linux中常用的查看系统信息的命令

    导读 Linux是一个神奇而又高效的操作系统,学完Linux对Linux系统有一个熟悉的了解后,你需要了解下这些实用的查看系统信息的命令. 查看系统版本命令 uname 谈到系统版本就一定会想到una ...

  4. Nginx安装与配置文件解析

    导读 Nginx是一款开放源代码的高性能HTTP服务器和反向代理服务器,同时支持IMAP/POP3代理服务,是一款自由的软件,同时也是运维工程师必会的一种服务器,下面我就简单的说一下Nginx服务器的 ...

  5. Lucene4.3开发之分词器总结

    Lucene4.3开发之分词器总结 http://java.chinaitlab.com/tools/940011.html

  6. sql注入攻击的预防函数

    /* 待更新 */ addslashes  htmlspecialchars  mysql_escape_string($string) mysql_real_escape_string($strin ...

  7. linux下cp目录时排除一个或者多个目录的方法

    说明:/home目录里面有data目录,data目录里面有a.b.c.d.e五个目录,现在要把data目录里面除过e目录之外的所有目录拷贝到/bak目录中 系统运维 www.osyunwei.com ...

  8. Nikto是一款Web安全扫描工具,可以扫描指定主机的web类型,主机名,特定目录,cookie,特定CGI漏洞,XSS漏洞,SQL注入漏洞等,非常强大滴说。。。

    Nikto是一款Web安全扫描工具,可以扫描指定主机的web类型,主机名,特定目录,cookie,特定CGI漏洞,XSS漏洞,SQL注入漏洞等,非常强大滴说... root@xi4ojin:~# cd ...

  9. JS的trim()方法

    去除字符串左右两端的空格,在vbscript里面可以轻松地使用 trim.ltrim 或 rtrim,但在js中却没有这3个内置方法,需要手工编写.下面的实现方法是用到了正则表达式,效率不错,并把这三 ...

  10. MVC 详细说明

    .NET MVC执行过程: 1.网址路由比对 2.执行Controller与Action 3.执行View并返回结果 在使用MVC中是由IgnoreRoute()辅助方法对比成功的,会导致程序直接跳离 ...