Spark Streaming官方文档学习--下

Accumulators and Broadcast Variables

这些不能从checkpoint重新恢复

如果想启动检查点的时候使用这两个变量，就需要创建这写变量的懒惰的singleton实例。

下面是一个例子：

def getWordBlacklist(sparkContext):
    if ('wordBlacklist' not in globals()):
        globals()['wordBlacklist'] = sparkContext.broadcast(["a", "b", "c"])
    return globals()['wordBlacklist']

def getDroppedWordsCounter(sparkContext):
    if ('droppedWordsCounter' not in globals()):
        globals()['droppedWordsCounter'] = sparkContext.accumulator(0)
    return globals()['droppedWordsCounter']

def echo(time, rdd):
    # Get or register the blacklist Broadcast
    blacklist = getWordBlacklist(rdd.context)
    # Get or register the droppedWordsCounter Accumulator
    droppedWordsCounter = getDroppedWordsCounter(rdd.context)

    # Use blacklist to drop words and use droppedWordsCounter to count them
    def filterFunc(wordCount):
        if wordCount[0] in blacklist.value:
            droppedWordsCounter.add(wordCount[1])
            False
        else:
            True

    counts = "Counts at time %s %s" % (time, rdd.filter(filterFunc).collect())

wordCounts.foreachRDD(echo)

DataFrame and SQL Operations

通过创建SparkSession的懒惰的singleton实例，可以从失败中恢复。

# Lazily instantiated global instance of SparkSession
def getSparkSessionInstance(sparkConf):
    if ('sparkSessionSingletonInstance' not in globals()):
        globals()['sparkSessionSingletonInstance'] = SparkSession\
            .builder\
            .config(conf=sparkConf)\
            .getOrCreate()
    return globals()['sparkSessionSingletonInstance']

...

# DataFrame operations inside your streaming program

words = ... # DStream of strings

def process(time, rdd):
    print("========= %s =========" % str(time))
    try:
        # Get the singleton instance of SparkSession
        spark = getSparkSessionInstance(rdd.context.getConf())

        # Convert RDD[String] to RDD[Row] to DataFrame
        rowRdd = rdd.map(lambda w: Row(word=w))
        wordsDataFrame = spark.createDataFrame(rowRdd)

        # Creates a temporary view using the DataFrame
        wordsDataFrame.createOrReplaceTempView("words")

        # Do word count on table using SQL and print it
        wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
        wordCountsDataFrame.show()
    except:
        pass

words.foreachRDD(process)

MLlib Operations

streaming machine learning algorithms (e.g. Streaming Linear Regression, Streaming KMeans, etc.)

can simultaneously learn from the streaming data as well as apply the model on the streaming data

for a much larger class of machine learning algorithms

you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data

Caching / Persistence

DStreams also allow developers to persist the stream’s data in memory

using the persist() method on a DStream will automatically persist every RDD of that DStream in memory

For window-based operations like reduceByWindow and reduceByKeyAndWindow and state-based operations like updateStateByKey, this is implicitly true（对这些操作，默认实现自动的缓存）

For input streams that receive data over the network (such as, Kafka, Flume, sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance.（网络数据，默认存两份来容错）

You can mark an RDD to be persisted using the persist() or cache() methods on it.

levels are set by passing a StorageLevel object to persist()

The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY

unlike RDDs, the default persistence level of DStreams keeps the data serialized in memory

Checkpointing

Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures,There are two types of data that are checkpointed.

Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application. Metadata includes:

Configuration - The configuration that was used to create the streaming application.

DStream operations - The set of DStream operations that define the streaming application.

Incomplete batches - Batches whose jobs are queued but have not completed yet.
Data checkpointing - Saving of the generated RDDs to reliable storage.

When to enable Checkpointing

Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic(周期的) RDD checkpointing.
Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information.

How to configure Checkpointing

Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved

done by using streamingContext.checkpoint(checkpointDirectory)

if you want to make the application recover from driver failures, you should rewrite your streaming application to have the following behavior：

When the program is being started for the first time, it will create a new StreamingContext, set up all the streams and then call start().
When the program is being restarted after failure, it will re-create a StreamingContext from the checkpoint data in the checkpoint directory.

This behavior is made simple by using StreamingContext.getOrCreate

# Function to create and setup a new StreamingContext
def functionToCreateContext():
    sc = SparkContext(...)   # new context
    ssc = new StreamingContext(...)
    lines = ssc.socketTextStream(...) # create DStreams
    ...
    ssc.checkpoint(checkpointDirectory)   # set checkpoint directory
    return ssc

# Get StreamingContext from checkpoint data or create a new one
context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)

# Do additional setup on context that needs to be done,
# irrespective of whether it is being started or restarted
context. ...

# Start the context
context.start()
context.awaitTermination()

如果checkpointDirectory存在，会从检查点重新新建

如果路径不存在，函数functionToCreateContext会创建新的context

You can also explicitly(明确的) create a StreamingContext from the checkpoint data and start the computation by using

StreamingContext.getOrCreate(checkpointDirectory, None).

In addition to using getOrCreate one also needs to ensure that the driver process gets restarted automatically on failure，This is further discussed in the Deployment section

At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput。

he default interval is a multiple of the batch interval that is at least 10 seconds

It can be set by using dstream.checkpoint(checkpointInterval)

Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.

Deploying Applications

Requirements

Cluster with a cluster manager
Package the application JAR
If you are using spark-submit to start the application, then you will not need to provide Spark and Spark Streaming in the JAR. However, if your application uses advanced sources (e.g. Kafka, Flume), then you will have to package the extra artifact they link to, along with their dependencies, in the JAR that is used to deploy the application.
Configuring sufficient memory for the executors
Note that if you are doing 10 minute window operations, the system has to keep at least last 10 minutes of data in memory. So the memory requirements for the application depends on the operations used in it.
Configuring checkpointing
Configuring automatic restart of the application driver

Spark Standalone
the Standalone cluster manager can be instructed to supervise the driver, and relaunch it if the driver fails either due to non-zero exit code, or due to failure of the node running the driver.
YARN automatically restarting an application
Mesos Marathon has been used to achieve this with Mesos

Configuring write ahead logs
If enabled, all the data received from a receiver gets written into a write ahead log in the configuration checkpoint directory.
Setting the max receiving rate

Upgrading Application Code

两种机制去更新代码

更新的应用和旧的应用并行的执行，Once the new one (receiving the same data as the old one) has been warmed up and is ready for prime time, the old one be can be brought down.这要求，数据源可以向两个地方发送数据。
优雅的停止，就是处理完接受到的数据之后再停止。ensure data that has been received is completely processed before shutdown。Then the upgraded application can be started, which will start processing from the same point where the earlier application left off.为了实现这个需要数据源的数据是可以缓存的。

Monitoring Applications

http://localhost:4040/streaming/

Performance Tuning

目的或者方式

Reducing the processing time of each batch of data by efficiently using cluster resources.
Setting the right batch size such that the batches of data can be processed as fast as they are received (that is, data processing keeps up with the data ingestion).

Level of Parallelism in Data Receiving

Level of Parallelism in Data Processing

来自为知笔记(Wiz)

Spark Streaming官方文档学习--下的更多相关文章

Spark Streaming官方文档学习--上
官方文档地址:http://spark.apache.org/docs/latest/streaming-programming-guide.html Spark Streaming是spark ap ...
Spark监控官方文档学习笔记
任务的监控和使用有几种方式监控spark应用:Web UI,指标和外部方法 Web接口每个SparkContext都会启动一个web UI,默认是4040端口,用来展示一些信息: 一系列调度的st ...
Spring 4 官方文档学习（十一）Web MVC 框架
介绍Spring Web MVC 框架 Spring Web MVC的特性其他MVC实现的可插拔性 DispatcherServlet 在WebApplicationContext中的特殊的bean ...
Spark SQL 官方文档-中文翻译
Spark SQL 官方文档-中文翻译 Spark版本:Spark 1.5.2 转载请注明出处:http://www.cnblogs.com/BYRans/ 1 概述(Overview) 2 Data ...
Spring 4 官方文档学习（十二）View技术
关键词:view technology.template.template engine.markup.内容较多,按需查用即可. 介绍 Thymeleaf Groovy Markup Template ...
Spring 4 官方文档学习（十一）Web MVC 框架之配置Spring MVC
内容列表: 启用MVC Java config 或 MVC XML namespace 修改已提供的配置类型转换和格式化校验拦截器内容协商 View Controllers View Reso ...
Spring Data Commons 官方文档学习
Spring Data Commons 官方文档学习 -by LarryZeal Version 1.12.6.Release, 2017-07-27 为知笔记版本在这里,带格式. Table o ...
Spring 4 官方文档学习（十一）Web MVC 框架之resolving views 解析视图
接前面的Spring 4 官方文档学习(十一)Web MVC 框架,那篇太长,故另起一篇. 针对web应用的所有的MVC框架,都会提供一种呈现views的方式.Spring提供了view resolv ...
Spring Boot 官方文档学习（一）入门及使用
个人说明:本文内容都是从为知笔记上复制过来的,样式难免走样,以后再修改吧.另外,本文可以看作官方文档的选择性的翻译(大部分),以及个人使用经验及问题. 其他说明:如果对Spring Boot没有概念, ...

随机推荐

Android 常用工具类之DeviceInfoUtil
public class DeviceInfoUtil { private static WifiManager wifiManager = null; // wifi是否已连接 public sta ...
zjtd 2016面试
1.写一个函数get_next() class A{ public :int next(); //取下一个值,并且指针后移 bool has_next(); private: //可以认为是一个q ...
Linux下后台程序完成自动输入密码等交互行为的例子
今天要开发一个定时任务,然后加入cron列表中.但是有个问题摆在眼前,脚本的执行中需要输入数据库密码: mysql -u root -p << SQL use db; set names ...
PHP易混淆函数的区分
常量定义自定义常量常量名区分大小写系统的魔术常量不区分大小写 __DIR__ __dir__变量定义变量名是区分大小写变量名声明时用$符号开头, 而且要符合变量名的命名规则$i;var_dump($ ...
Python升级Yum不能使用解决
1.系统版本 [root@vm10-254-206-95 ~]# cat /etc/issue CentOS release 6.4 (Final) Kernel \r on an \m 2.系统默认 ...
php 计算两个日期之间的差，得出：年月日时分秒
<?php$time1 = "2008-6-15 11:49:59";//第一个时间$time2 = "2007-5-5 12:53:28";//第二个时 ...
Poj(2679)，SPFA，二级比较
题目链接:http://poj.org/problem?id=2679 嗯,思路清晰,先DFS看是不是通路,接着就是SPFA找最短路(路是费用,费用相同就比较路的长度). 超哥的代码还有一点问题,初始 ...
Windows上Python2.7安装Scrapy过程
需要执行: pip install scrapy pip install requests 在Windows下用pip安装Scrapy报如下错误,看错误提示就知道去http://aka.ms/vcpy ...
WCF 下载“http://localhost:XXX”时出错。无法连接到远程服务器。由于目标计算机积极拒绝，无法连接。
根据理论,WCF既然查找不到该地址应该是没打开Service,或者说该端口被防火墙保护了,关了防火墙没解决,最后一试成功了. 我的服务端和客户端都位于一个解决方案中,当运行服务端时,则 [添加服务引 ...
httpclient请求方法
/** * httpclient请求方法 * @param url 请求地址 * @param paramMap 请求参数 * @param ent 编码格式 gbk.utf-8 * @return ...

Spark Streaming官方文档学习--下

Spark Streaming官方文档学习--下的更多相关文章

随机推荐

热门专题