http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/41378.pdf

 

为什么要做MillWheel?

因为当前的其他的流式系统,无法同时满足 fault tolerance, versatility, and scalability 的需求。

Spark Streaming [34] and Sonora [32] do excellent jobs of efficient checkpointing, but limit the space of operators that are available to user code.

S4 [26] does not provide fully fault-tolerant persistent state

Storm’s [23] exactly-once mechanism for record delivery, Trident [22], requires strict transaction ordering to operate.

Streaming SQL systems [1] [2] [5] [6] [21] [24] provide succinct and simple solutions to many streaming problems, but intuitive state abstractions and complex application logic (e.g. matrix multiplication) are more naturally expressed using the operational flow of an imperative language rather than a declarative language like SQL.

Note,imperative language, declarative language, function language。refer:http://stackoverflow.com/questions/1784664/what-is-the-difference-between-declarative-and-imperative-programming

 

具体需求

先描述应用场景,

Google’s Zeitgeist pipeline is used to track trends in web queries.
This pipeline ingests a continuous input of search queries and performs anomaly detection, outputting queries which are spiking or dipping as quickly as possible.

Google’s Zeitgeist 这个服务用于 track web 查询的趋势的,对持续的 search queries 进行 anomaly detection,尽可能快的发现spiking or dipping。

架构如下,

Our approach is to bucket records into one-second intervals and to compare the actual traffic for each time bucket to the expected traffic that the model predicts.
If these quantities are consistently different over a non-trivial number of buckets, then we have high confidence that a query is spiking or dipping.
In parallel, we update the model with the newly received data and store it for future use.

 

场景中关键的几点,

Persistent Storage: It is important to note that this implementation requires both short- and long-term storage.
A spike may only last a few seconds, and thus depend on state from a small window of time, whereas model data can correspond to months of continuous updates.

 

LowWatermarks:
在现实的场景中,网络环境是很复杂的,当一个时间点出现dipping的时候,有两种可能性,
真正的dipping,这个点query确实变少了
由于网络或其他问题,数据被delay了,还没有收到

那么自然产生的问题,我如何知道这个时间点的数据是否到齐?

MillWheel addresses this by providing a low watermark for incoming data for each processing stage (e.g. Window Counter, Model Calculator), which indicates that all data up to a given timestamp has been received.

MillWheel提供 low watermark机制来告诉你什么时候数据会到齐。

当然low watermark往往也是启发式得到的,其实并不能完美的解这个问题,只能说如果过了 low watermark 还没有数据来,我们有 high confidence 来说应该是没有数据,而不是被delay

 

Duplicate Prevention: For Zeitgeist, duplicate record deliveries could cause spurious spikes.
我们要在平台层面保证exactly-once

 

整理出的详细需求如下:

• Data should be available to consumers as soon as it is published (i.e. there are no system-intrinsic barriers to ingesting inputs and providing output data). 比如micro-batch就是种 system-intrinsic barriers
• Persistent state abstractions should be available to user code, and should be integrated into the system’s overall consistency model.
• Out-of-order data should be handled gracefully by the system. 可以处理时间乱序的数据
• A monotonically increasing low watermark of data timestamps should be computed by the system. 系统会生成 low watermarker
• Latency should stay constant as the system scales to more machines. 保证 latency
• The system should provide exactly-once delivery of records. 保证 exactly-once 语义

 

SYSTEM OVERVIEW

Abstractly, inputs and outputs in MillWheel are represented by (key, value, timestamp) triples.

 

Computations,等同于Bolt
Application logic lives in computations, which encapsulate arbitrary user code.

 

Keys
Keys are the primary abstraction for aggregation and comparison between different records in MillWheel.
For every record in the system, the consumer specifies a key extraction function, which assigns a key to the record.

 

注意在,millwhell中,相同key的record是被串行处理的,只有不同key的record才可以被并行处理

 

Streams,等同于Storm里面的流
Streams are the delivery mechanism between different computations in MillWheel.

 

Persistent State
In its most basic form, persistent state in MillWheel is an opaque byte string that is managed on a per-key basis.
The user provides serialization and deserialization routines (such as translating a rich data structure in and out of its wire format), for which a variety of convenient mechanisms (e.g. Protocol Buffers [13]) exist.
Persistent state is backed by a replicated, highly available data store (e.g. Bigtable [7] or Spanner [9]), which ensures data integrity in a way that is completely transparent to the end user.
Common uses of state include counters aggregated over windows of records and buffered data for a join.

这里persistent state,可以认为是checkpoint,注意,MillWheel的checkpoint是 per-key basis的,可以在MillWheel起到很关键的作用
用户需要提供序列号和反序列化的逻辑,这些checkpoint往往被存到像bigtable这样的分布式存储中
往往像有状态的computation就需要存persistent state,比如基于窗口的聚合计数,或流join

 

Low Watermarks

对于computation,当给定low watermark,就不应该收到比它还早的数据

Definition: We provide a recursive definition of low watermarks based on a pipeline’s data flow.

min(oldest work of A, low watermark of C : C outputs to A)

oldest work of A,是A中最老的record的时间戳
而C是A的父节点,那么A的low watermark不可能比C迟,因为A一定比C迟收到数据,所以A的low watermark一定是小于等于C的low watermark的

这样递归的结果是,最终low watermark会取决于injector(即,源),而对于injector的input,肯定是外部系统比如kafka这样的队列,或文件系统,那么injector怎么知道它的low watermark

injector其实是不知道的,只能做estimate,比如对于文件系统,可以以文件的create时间作为low watermark,文件里面一定不会有比create time更早的记录

所以low watermark机制,是无法完美解这个问题的,都会有too fast,too late的问题

 

Timers,即trigger,解决when的问题

A simple implementation of dips in Zeitgeist would set a low watermark timer for the end of a given time bucket, and report a dip if the observed traffic falls well below the model’s prediction.

 

FAULT TOLERANCE

终于到了关键的地方了,

Delivery Guarantees

Exactly-Once Delivery

MillWheel是如何保证exactly-once语义的,

Upon receipt of an input record for a computation, the MillWheel framework performs the following steps:

• The record is checked against deduplication data from previous deliveries; duplicates are discarded.
• User code is run for the input record, possibly resulting in pending changes to timers, state, and productions.
• Pending changes are committed to the backing store.
• Senders are ACKed.
• Pending downstream productions are sent.

两点需要注意的,

一是,它会去重,这样可以保证exactly-once,如何去后面说
其实一般的streaming系统都可以做到at-least once,所以做到exactly-once,只需要做到去重即可
你可以依赖外部存储,或者系统里面直接做掉

二是,对中间状态做checkpoint

MillWheel如何在系统层面做去重,

The system assigns unique IDs to all records at production time.
We identify duplicate records by including this unique ID for the record in the same atomic write as the state modification.
If the same record is later retried, we can compare it to the journaled ID, and discard and ACK the duplicate.

通过为每个record增加unique id

为了快速知道这个id是否出现过,使用bloom filter

Since we cannot necessarily store all duplication data in-memory, we maintain a Bloom filter of known record fingerprints, to provide a fast path for records that we have provably never seen before.

 

如果filter miss,我们需要读后端存储才能判断是否是duplicate

In the event of a filter miss, we must read the backing store to determine whether a record is a duplicate.

这个怎么实现?怎么判断是filter miss,还是新出现的record?出现duplicate毕竟不是经常发生的

 

为了防止record id爆掉,需要回收,有个问题?回收后,bloom filter需要重新初始化吗,还是说bloom filter本身是支持过期的

Record IDs for past deliveries are garbage collected after MillWheel can guarantee that all internal senders have finished retrying.

 

Strong Productions

We checkpoint produced records before delivery in the same atomic write as state modification.
We call this pattern of checkpointing before record production strong productions.

这部用以保证at-least once,storm是通过spout超时重发的,后续的系统很少继续沿用这个方式,因为这样做周期太长

Millwheel或Linkedin的Samza都是采用local重发的方式,比如MillWheel,在produce record之前,会把checkpoint和状态修改放在一个原子写中做掉,checkpoint往往写入bigtable中
当然下层节点,成功处理完该record,会send回acker,这时,我们可以把checkpoint删除

如果这时crash,我们可以come back时,从checkpoint中读出record,重新produce

如果之前不做checkpoint,当come back时,会以当前状态(比如计数,有可能新到数据已产生更新)来produce,这样就会产生不一致

另外区别于persistent state,这里checkpoint特指produced record

 

Weak Productions and Idempotency

MillWheel通过 record id 和 Strong Production 来保证 exactly-once 语义,这其中也是有很多代价的,有些场景不需要保证exactly-once,at-least onces就足够了,比如很多无状态的场景

所以他提供Weak production来满足这种需求。

不需要保证exactly-once,就不去重就ok了,disabling exactly-once can be accomplished simply by skipping the deduplication pass

是不是checkpoint produced records也可以完全去掉了,直接produce,然后等ack,失败或超时就重发,那这样就和storm一样了,链路长的时候,周期会很长

MillWheel提供的优化就是 weak productions,

比如,对于A-》B-》C的链路

B-》C的produce,超过1s还没有返回

我们这时候,对该produce进行checkpoint,然后直接ack A,避免A继续等待

当然B会继续等待,直到收到C的ack,才将该checkpoint删除

如果此时B Crash,那么当B restart,他会自己去replay上次的produce,对A透明,直到成功,才会删除checkpoint

 

State Manipulation

In implementing mechanisms to manipulate user state in MillWheel, we discuss both the “hard” state that is persisted to our backing store and the “soft” state which includes any in-memory caches or aggregates.
We must satisfy the following user-visible guarantees:

• The system does not lose data.
• Updates to state must obey exactly-once semantics.
• All persisted data throughout the system must be consistent at any given point in time.
• Low watermarks must reflect all pending state in the system.
• Timers must fire in-order for a given key.

首先,为了避免不一致,所有per-key的操作,包含persist,checkpoint,状态更新,都会在一个原子写中完成
To avoid inconsistencies in persisted state (e.g. between timers, user state, and production checkpoints), we wrap all per-key updates in a single atomic operation.

 

再者,对于僵尸writer或由于网络延迟导致的延迟写,采用sequencer的方式,每个写都有sequence id,过期的写请求会被丢弃;并且在新的workers启动时需要invalid之前的sequencers

As work may shift between machines (due to load balancing, failures, or other reasons) a major threat to our data consistency is the possibility of zombie writers and network remnants issuing stale writes to our backing store.
To address this possibility, we attach a sequencer token to each write, which the mediator of the backing store checks for validity before allowing the write to commit.
New workers invalidate any extant sequencers before starting work, so that no remnant writes can succeed thereafter.

所以,对于MillWheel,对于一个给定的key,只能有一个worker writer有权限执行写操作,这个是MillWheel保证写一致性的关键

Thus, we can guarantee that, for a given key, only a single worker can write to that key at a particular point in time.

In order to quickly recover from unplanned process failures, each computation worker in MillWheel can checkpoint its state at an arbitrarily fine granularity (in practice, sub-second or per-record granularity is standard, depending on input volume). Our use of always-consistent soft state allows us to minimize the number of occasions when we must scan these checkpoints to specific cases – machine failures or load-balancing events.
When we do perform scans, these can often be asynchronous, allowing the computation to continue processing input records while the scan progresses.

MillWheel: Fault-Tolerant Stream Processing at Internet Scale的更多相关文章

  1. Stream Processing 101: From SQL to Streaming SQL in 10 Minutes

    转自:https://wso2.com/library/articles/2018/02/stream-processing-101-from-sql-to-streaming-sql-in-ten- ...

  2. Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN

    http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-fram ...

  3. [转]Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications

    This article is from blog of Amazon CTO Werner Vogels. -------------------- Today is a very exciting ...

  4. 腾讯大数据平台Oceanus: A one-stop platform for real time stream processing powered by Apache Flink

    January 25, 2019Use Cases, Apache Flink The Big Data Team at Tencent     In recent years, the increa ...

  5. Stream processing with Apache Flink and Minio

    转自:https://blog.minio.io/stream-processing-with-apache-flink-and-minio-10da85590787 Modern technolog ...

  6. 13 Stream Processing Patterns for building Streaming and Realtime Applications

    原文:https://iwringer.wordpress.com/2015/08/03/patterns-for-streaming-realtime-analytics/ Introduction ...

  7. 1.1 Introduction中 Kafka for Stream Processing官网剖析(博主推荐)

    不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Kafka for Stream Processing kafka的流处理 It i ...

  8. Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

    https://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf Discretized Streams: A Fault-Tol ...

  9. Storm(2) - Log Stream Processing

    Introduction This chapter will present an implementation recipe for an enterprise log storage and a ...

随机推荐

  1. vijos 1038 括号+路径 ***

    链接:点我 就是自己写不出来 #include <cstdio> #include <climits> #include <memory.h> using name ...

  2. C++的那些事:容器和泛型算法

    一.顺序容器 1,标准库定义了3种类型的顺序容器:vector.list和deque.它们的差别主要在于访问元素的方式,以及添加或删除元素相关操作运算代价.标准库还提供了三种容器适配器:stack.q ...

  3. 网站构建学习笔记(0)——基本概念了解及资源学习(copy自w3school)

    一.学习方面 1.WWW - 万维网 什么是 WWW? WWW 指万维网(World Wide Web) 万维网常被称为Web Web 是由遍布全球的计算机所组成的网络 所有 Web 中的计算机都可以 ...

  4. 建模算法(五)——图与网络

    (一)图与网络的基本概念 一.无向图 含有的元素为顶点,弧和权重,但是没有方向 二.有向图 含有的元素为顶点,弧和权重,弧具有方向. 三.有限图.无限图 顶点和边有限就是有限图,否则就是无限图. 四. ...

  5. Web安全测试之XSS

    Web安全测试之XSS XSS 全称(Cross Site Scripting) 跨站脚本攻击, 是Web程序中最常见的漏洞.指攻击者在网页中嵌入客户端脚本(例如JavaScript), 当用户浏览此 ...

  6. General part中方向选取的作用

    这个方向是零部件坐标系 part coordinate system, 也叫local part reference frame. 这个方向要注意, 因为质心的方位由它决定,同时下面的输入的转动惯量就 ...

  7. object-c [self class] 和 [self _cmd]

    [self class] 返回当前类名 [self _cmd] 返回当前方法名 self 是类的隐藏的参数,指向当前当前调用方法的类 另一个隐藏参数是_cmd,代表当前类方法的selector

  8. 【wikioi】1403 新三国争霸(dp+kruskal)

    http://wikioi.com/problem/1403/ 一开始的确感觉和bzoj1003很像,不同的是这里还要求联通,求最小的边. 我们可以想到用最小生成树(为嘛我自己想不到呢..) 我们可以 ...

  9. Mina传递对象

    利用Apache MINA来传递对象,这对了MINA来说非常容易,并且这也是Java网络编程中很常用的应用. 首先看两个用来传递的Java对象MyRequestObject和MyResponseObj ...

  10. (转)微信公众平台开发之基于百度 BAE3.0 的开发环境搭建(采用 Baidu Eclipse)

    原文传送门(http://blog.csdn.net/bingtianxuelong/article/details/17843111) 版本说明:     V1:         2014-2-13 ...