有两种方式,一种是sparkstreaming中的driver起监听,flume来推数据;另一种是sparkstreaming按照时间策略轮训的向flume拉数据。

最开始我以为只有第一种方法,但是尼玛问题在于driver起来的结点是没谱的,所以每次我重启streaming后发现尼玛每次都要修改flume的sinks,蛋疼死了,后来才发现有后面的方法,好吧,把不同的方法代码写出来,其实变化不大。(代码转自官方的githup)

第一种,监听端口:

package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam /**
* Produces a count of events received from Flume.
*
* This should be used in conjunction with an AvroSink in Flume. It will start
* an Avro server on at the request host:port address and listen for requests.
* Your Flume AvroSink should be pointed to this address.
*
* Usage: FlumeEventCount <host> <port>
* <host> is the host the Flume receiver will be started on - a receiver
* creates a server and listens for flume events.
* <port> is the port the Flume receiver will listen on.
*
* To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.FlumeEventCount <host> <port> `
*/
object FlumeEventCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumeEventCount <host> <port>")
System.exit(1)
} StreamingExamples.setStreamingLogLevels() val Array(host, IntParam(port)) = args val batchInterval = Milliseconds(2000) // Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumeEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval) // Create a flume stream
val stream = FlumeUtils.createStream(ssc, host, port, StorageLevel.MEMORY_ONLY_SER_2) // Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print() ssc.start()
ssc.awaitTermination()
}
}

第二种是轮训主动向flume拿数据

package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import java.net.InetSocketAddress /**
* Produces a count of events received from Flume.
*
* This should be used in conjunction with the Spark Sink running in a Flume agent. See
* the Spark Streaming programming guide for more details.
*
* Usage: FlumePollingEventCount <host> <port>
* `host` is the host on which the Spark Sink is running.
* `port` is the port at which the Spark Sink is listening.
*
* To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.FlumePollingEventCount [host] [port] `
*/
object FlumePollingEventCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
} StreamingExamples.setStreamingLogLevels() val Array(host, IntParam(port)) = args val batchInterval = Milliseconds(2000) // Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval) // Create a flume stream that polls the Spark Sink running in a Flume agent
val stream = FlumeUtils.createPollingStream(ssc, host, port) // Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print() ssc.start()
ssc.awaitTermination()
}
}

  

spark streaming中使用flume数据源的更多相关文章

  1. Spark Streaming中向flume拉取数据

    在这里看到的解决方法 https://issues.apache.org/jira/browse/SPARK-1729 请是个人理解,有问题请大家留言. 其实本身flume是不支持像KAFKA一样的发 ...

  2. Spark Streaming中的操作函数分析

    根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transformations Window Operations J ...

  3. Spark Streaming中的操作函数讲解

    Spark Streaming中的操作函数讲解 根据根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transform ...

  4. spark streaming中维护kafka偏移量到外部介质

    spark streaming中维护kafka偏移量到外部介质 以kafka偏移量维护到redis为例. redis存储格式 使用的数据结构为string,其中key为topic:partition, ...

  5. Spark Streaming中动态Batch Size实现初探

    本期内容 : BatchDuration与 Process Time 动态Batch Size Spark Streaming中有很多算子,是否每一个算子都是预期中的类似线性规律的时间消耗呢? 例如: ...

  6. flink和spark Streaming中的Back Pressure

    Spark Streaming的back pressure 在讲flink的back pressure之前,我们先讲讲Spark Streaming的back pressure.Spark Strea ...

  7. spark streaming中使用checkpoint

    从官方的Programming Guides中看到的 我理解streaming中的checkpoint有两种,一种指的是metadata的checkpoint,用于恢复你的streaming:一种是r ...

  8. Spark Streaming数据限流简述

      Spark Streaming对实时数据流进行分析处理,源源不断的从数据源接收数据切割成一个个时间间隔进行处理:   流处理与批处理有明显区别,批处理中的数据有明显的边界.数据规模已知:而流处理数 ...

  9. Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN

    Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...

随机推荐

  1. Emmet插件使用方法总结

    Emmet插件使用方法总结 在前端开发的过程中,一大部分的工作是写 HTML.CSS 代码.特别是手动编写 HTML 代码的时候,效率会特别低下,因为需要敲打很多尖括号,而且很多标签都需要闭合标签等. ...

  2. metinfo首页内容简介

    http://www.hlbaozhuangji.cn/manage/content/other_info.php?anyid=31&lang=cn 首页内容简介: select * from ...

  3. 为自己的git添加alias,命令缩写

    在多人协作开发时,一般用git来进行代码管理.git有一些命令如:git pull . git push等等,这些命令可以设置alias,也就是缩写.如:git pull 是 git pl, git ...

  4. 用纯原生js实现jquery的ready函数(两种实现)

    第一种实现方式: var dom = new function() { var dom = []; dom.isReady = false; dom.isFunction = function(obj ...

  5. socket基本

    fd_set用法: http://blog.sina.com.cn/s/blog_5c8d13830100erzs.htm socket连接: lpszHost="127.0.0.1&quo ...

  6. [Effective JavaScript 笔记]第39条:不要重用父类的属性名

    假设想给上节讲的场景图库添加收集诊断信息的功能.这对于调试和性能分析很有用. 38条示例续 给每个Actor实例一个唯一的标识数. 添加标识数 function Actor(scene,x,y){ t ...

  7. characterCustomezition的资源打包代码分析

    using System.Collections.Generic; using System.IO; using UnityEditor; using UnityEngine; class Creat ...

  8. IOS 页面之间的跳转

    1.UINavigationController popToViewController 对应popViewControllerAnimated: 也可以使用: [self.navigationCon ...

  9. Centos 7 安装LAMP环境

    一.安装Centos 官网下载Centos 7刻录成光盘后安装 二.安装apache yum install httpd #根据提示,输入Y安装即可成功安装 systemctl start httpd ...

  10. sort如何按指定的列排序

    <1>[root@localhost company]# cat test 06d7            145             4192542506e1            ...