FlinkCEP - Complex event processing for Flink
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/libs/cep.html
首先目的是匹配pattern sequence
pattern Sequence是由多个pattern构成
DataStream<Event> input = ...
Pattern<Event, ?> pattern = Pattern.<Event>begin("start").where(
new SimpleCondition<Event>() {
@Override
public boolean filter(Event event) {
return event.getId() == 42;
}
}
).next("middle").subtype(SubEvent.class).where(
new SimpleCondition<Event>() {
@Override
public boolean filter(SubEvent subEvent) {
return subEvent.getVolume() >= 10.0;
}
}
).followedBy("end").where(
new SimpleCondition<Event>() {
@Override
public boolean filter(Event event) {
return event.getName().equals("end");
}
}
);
PatternStream<Event> patternStream = CEP.pattern(input, pattern);
DataStream<Alert> result = patternStream.select(
new PatternSelectFunction<Event, Alert> {
@Override
public Alert select(Map<String, List<Event>> pattern) throws Exception {
return createAlertFrom(pattern);
}
}
});
如例子中,这个pattern Sequence由3个pattern组成,begin,next,followedBy
pattern Sequence的第一个pattern都是begin
每个pattern都需要有一个唯一的名字,比如这里的start,middle,end
每个pattern也可以设置condition,比如where
Pattern
Pattern可以分为两种,Individual Patterns,Complex Patterns.
对于individual patterns,又可以分为singleton pattern, or a looping one
通俗点,singleton pattern指出现一次,而looping指可能出现多次,在有限自动机里面匹配相同的pattern就形成looping
比如,对于a b+ c? d
b+就是looping,而其他的都是singleton
对于singleton pattern可以加上Quantifiers,就变成looping
// expecting 4 occurrences
start.times(4); // expecting 0 or 4 occurrences
start.times(4).optional(); // expecting 1 or more occurrences
start.oneOrMore(); // expecting 0 or more occurrences
start.oneOrMore().optional();
同一个pattenr的多次匹配可以定义Contiguity
illustrate the above with an example, a pattern sequence "a+ b" (one or more "a"’s followed by a "b") with input "a1", "c", "a2", "b" will have the following results:
Strict Contiguity:
{a2 b}– the"c"after"a1"causes"a1"to be discarded.Relaxed Contiguity:
{a1 b}and{a1 a2 b}–cis simply ignored.Non-Deterministic Relaxed Contiguity:
{a1 b},{a2 b}, and{a1 a2 b}.
oneOrMore() and times()) the default is relaxed contiguity. If you want strict contiguity, you have to explicitly specify it by using the consecutive() call, and if you want non-deterministic relaxed contiguity you can use the allowCombinations() call
consecutive() 的使用例子,
Pattern.<Event>begin("start").where(new SimpleCondition<Event>() {
@Override
public boolean filter(Event value) throws Exception {
return value.getName().equals("c");
}
})
.followedBy("middle").where(new SimpleCondition<Event>() {
@Override
public boolean filter(Event value) throws Exception {
return value.getName().equals("a");
}
}).oneOrMore().consecutive()
.followedBy("end1").where(new SimpleCondition<Event>() {
@Override
public boolean filter(Event value) throws Exception {
return value.getName().equals("b");
}
});
Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B
with consecutive applied: {C A1 B}, {C A1 A2 B}, {C A1 A2 A3 B}
without consecutive applied: {C A1 B}, {C A1 A2 B}, {C A1 A2 A3 B}, {C A1 A2 A3 A4 B}
这是针对单个pattern的Contiguity,后面还可以定义pattern之间的Contiguity
当然对于Pattern,很关键的是Conditions
就是条件,怎么样算匹配上?
Conditions 分为好几种,
Simple Conditions
start.where(new SimpleCondition<Event>() {
@Override
public boolean filter(Event value) {
return value.getName().startsWith("foo");
}
});
很容易理解,单纯的根据当前Event来判断
Iterative Conditions
This is how you can specify a condition that accepts subsequent events based on properties of the previously accepted events or some statistic over a subset of them.
即当判断这个条件是否满足时,需要参考之前已经匹配过的pattern,所以称为iterative
Below is the code for an iterative condition that accepts the next event for a pattern named “middle” if its name starts with “foo”, and if the sum of the prices of the previously accepted events for that pattern plus the price of the current event do not exceed the value of 5.0. Iterative conditions can be very powerful, especially in combination with looping patterns, e.g. oneOrMore().
middle.oneOrMore().where(new IterativeCondition<SubEvent>() {
@Override
public boolean filter(SubEvent value, Context<SubEvent> ctx) throws Exception {
if (!value.getName().startsWith("foo")) {
return false;
}
double sum = value.getPrice();
for (Event event : ctx.getEventsForPattern("middle")) {
sum += event.getPrice();
}
return Double.compare(sum, 5.0) < 0;
}
});
首先这是个oneOrMore,可以匹配一个或多个,但匹配每一个时,除了判断是否以“foo”开头外
还要判断和之前匹配的event的price的求和小于5
这里主要用到ctx.getEventsForPattern,取出某个名字的pattern当前的所有的匹配
Combining Conditions
pattern.where(new SimpleCondition<Event>() {
@Override
public boolean filter(Event value) {
return ... // some condition
}
}).or(new SimpleCondition<Event>() {
@Override
public boolean filter(Event value) {
return ... // or condition
}
});
可以有多个条件,where表示“and”语义,而or表示“or” 语义
Pattern Sequence
sequence是有多个pattern组成,那么多个pattern之间是什么关系?
A pattern sequence has to start with an initial pattern, as shown below:
Pattern<Event, ?> start = Pattern.<Event>begin("start");
每个sequence都必须要有个开始,begin
Next, you can append more patterns to your pattern sequence by specifying the desired contiguity conditions between them.
next(), for strict,followedBy(), for relaxed, andfollowedByAny(), for non-deterministic relaxed contiguity.
or
notNext(), if you do not want an event type to directly follow anothernotFollowedBy(), if you do not want an event type to be anywhere between two other event types
在begin开始后, 可以加上各种pattern,pattern之间的Contiguity关系有上面几种
例子,
As an example, a pattern a b, given the event sequence"a", "c", "b1", "b2", will give the following results:
Strict Contiguity between
aandb:{}(no match) – the"c"after"a"causes"a"to be discarded.Relaxed Contiguity between
aandb:{a b1}– as relaxed continuity is viewed as “skip non-matching events till the next matching one”.Non-Deterministic Relaxed Contiguity between
aandb:{a b1},{a b2}– as this is the most general form.
temporal constraint
一个sequence还可以指定时间限制,supported for both processing and event time
next.within(Time.seconds(10));
Detecting Patterns
当定义好pattern sequence后,我们需要真正的去detect,
DataStream<Event> input = ...
Pattern<Event, ?> pattern = ... PatternStream<Event> patternStream = CEP.pattern(input, pattern);
生成PatternStream
The input stream can be keyed or non-keyed depending on your use-case
Applying your pattern on a non-keyed stream will result in a job with parallelism equal to 1
如果non-keyed stream,并发只能是1
如果是keyed stream,不同的key可以单独的detect pattern,所以可以并发
Once you have obtained a PatternStream you can select from detected event sequences via the select or flatSelect methods.
对于PatternStream,可以用
PatternSelectFunction
PatternFlatSelectFunction
class MyPatternSelectFunction<IN, OUT> implements PatternSelectFunction<IN, OUT> {
@Override
public OUT select(Map<String, List<IN>> pattern) {
IN startEvent = pattern.get("start").get(0);
IN endEvent = pattern.get("end").get(0);
return new OUT(startEvent, endEvent);
}
}
对于PatternSelectFunction需要实现select接口,
参数是Map<String, List<IN>> pattern,这是一个匹配成功的pattern sequence,key是pattern名,后面是list是因为对于looping可能有多个匹配值
而对于PatternFlatSelectFunction,只是在接口上多了Collector,这样可以输出多个值
class MyPatternFlatSelectFunction<IN, OUT> implements PatternFlatSelectFunction<IN, OUT> {
@Override
public void select(Map<String, List<IN>> pattern, Collector<OUT> collector) {
IN startEvent = pattern.get("start").get(0);
IN endEvent = pattern.get("end").get(0);
for (int i = 0; i < startEvent.getValue(); i++ ) {
collector.collect(new OUT(startEvent, endEvent));
}
}
}
源码
首先是定义pattern,虽然pattern定义比较复杂,但是实现比较简单
最终,
org.apache.flink.cep.nfa.compiler.NFACompiler
会将pattern sequence转化为 NFA,非确定有限状态机,sequence匹配的大部分逻辑都是通过NFA来实现的,就不详细描写了
最终调用到,patternStream.select产生结果流
public <R> SingleOutputStreamOperator<R> select(final PatternSelectFunction<T, R> patternSelectFunction, TypeInformation<R> outTypeInfo) {
SingleOutputStreamOperator<Map<String, List<T>>> patternStream =
CEPOperatorUtils.createPatternStream(inputStream, pattern);
return patternStream.map(
new PatternSelectMapper<>(
patternStream.getExecutionEnvironment().clean(patternSelectFunction)))
.returns(outTypeInfo);
}
CEPOperatorUtils.createPatternStream
if (inputStream instanceof KeyedStream) {
// We have to use the KeyedCEPPatternOperator which can deal with keyed input streams
KeyedStream<T, K> keyedStream = (KeyedStream<T, K>) inputStream;
TypeSerializer<K> keySerializer = keyedStream.getKeyType().createSerializer(keyedStream.getExecutionConfig());
patternStream = keyedStream.transform(
"KeyedCEPPatternOperator",
(TypeInformation<Map<String, List<T>>>) (TypeInformation<?>) TypeExtractor.getForClass(Map.class),
new KeyedCEPPatternOperator<>(
inputSerializer,
isProcessingTime,
keySerializer,
nfaFactory,
true));
} else {
KeySelector<T, Byte> keySelector = new NullByteKeySelector<>();
TypeSerializer<Byte> keySerializer = ByteSerializer.INSTANCE;
patternStream = inputStream.keyBy(keySelector).transform(
"CEPPatternOperator",
(TypeInformation<Map<String, List<T>>>) (TypeInformation<?>) TypeExtractor.getForClass(Map.class),
new KeyedCEPPatternOperator<>(
inputSerializer,
isProcessingTime,
keySerializer,
nfaFactory,
false
)).forceNonParallel();
}
关键就是,生成KeyedCEPPatternOperator
public class KeyedCEPPatternOperator<IN, KEY> extends AbstractKeyedCEPPatternOperator<IN, KEY, Map<String, List<IN>>>
AbstractKeyedCEPPatternOperator
最关键的就是当一个StreamRecord过来时,我们怎么处理他
@Override
public void processElement(StreamRecord<IN> element) throws Exception {
if (isProcessingTime) {
// there can be no out of order elements in processing time
NFA<IN> nfa = getNFA();
processEvent(nfa, element.getValue(), getProcessingTimeService().getCurrentProcessingTime());
updateNFA(nfa); } else { long timestamp = element.getTimestamp();
IN value = element.getValue(); // In event-time processing we assume correctness of the watermark.
// Events with timestamp smaller than the last seen watermark are considered late.
// Late events are put in a dedicated side output, if the user has specified one. if (timestamp >= lastWatermark) { //只处理非late record // we have an event with a valid timestamp, so
// we buffer it until we receive the proper watermark. saveRegisterWatermarkTimer(); List<IN> elementsForTimestamp = elementQueueState.get(timestamp);
if (elementsForTimestamp == null) {
elementsForTimestamp = new ArrayList<>();
} if (getExecutionConfig().isObjectReuseEnabled()) {
// copy the StreamRecord so that it cannot be changed
elementsForTimestamp.add(inputSerializer.copy(value));
} else {
elementsForTimestamp.add(element.getValue());
}
elementQueueState.put(timestamp, elementsForTimestamp);
}
}
}
可以看到,如果是isProcessingTime,非常简单,直接丢给NFA处理就好
但如果是eventTime,问题就复杂了,因为要解决乱序问题,不能直接交给NFA处理
需要做cache,所以看看elementQueueState
private transient MapState<Long, List<IN>> elementQueueState;
elementQueueState = getRuntimeContext().getMapState(
new MapStateDescriptor<>(
EVENT_QUEUE_STATE_NAME,
LongSerializer.INSTANCE,
new ListSerializer<>(inputSerializer)
)
);
elementQueueState,记录所有时间点上的record list
onEventTime中会触发对elementQueueState上数据的处理,
@Override
public void onEventTime(InternalTimer<KEY, VoidNamespace> timer) throws Exception { // 1) get the queue of pending elements for the key and the corresponding NFA,
// 2) process the pending elements in event time order by feeding them in the NFA
// 3) advance the time to the current watermark, so that expired patterns are discarded.
// 4) update the stored state for the key, by only storing the new NFA and priority queue iff they
// have state to be used later.
// 5) update the last seen watermark. // STEP 1
PriorityQueue<Long> sortedTimestamps = getSortedTimestamps(); // 把elementQueueState的key按时间排序,PriorityQueue就是堆排序
NFA<IN> nfa = getNFA(); // STEP 2
while (!sortedTimestamps.isEmpty() && sortedTimestamps.peek() <= timerService.currentWatermark()) { // peek从小的时间取起,如果小于currentWatermark,就触发
long timestamp = sortedTimestamps.poll();
for (IN element: elementQueueState.get(timestamp)) { // 把该时间对应的record list拿出来处理
processEvent(nfa, element, timestamp);
}
elementQueueState.remove(timestamp);
} // STEP 3
advanceTime(nfa, timerService.currentWatermark()); // STEP 4
if (sortedTimestamps.isEmpty()) {
elementQueueState.clear();
}
updateNFA(nfa); if (!sortedTimestamps.isEmpty() || !nfa.isEmpty()) {
saveRegisterWatermarkTimer();
} // STEP 5
updateLastSeenWatermark(timerService.currentWatermark()); // 更新lastWatermark
}
onEventTime在何时被调用,
AbstractStreamOperator中有个
InternalTimeServiceManager timeServiceManager
来管理所有的time service
在AbstractKeyedCEPPatternOperator中open的时候会,会创建这个time service,并把AbstractKeyedCEPPatternOperator作为triggerTarget传入
timerService = getInternalTimerService(
"watermark-callbacks",
VoidNamespaceSerializer.INSTANCE,
this);
在processElement会调用
saveRegisterWatermarkTimer();
long currentWatermark = timerService.currentWatermark();
// protect against overflow
if (currentWatermark + 1 > currentWatermark) {
timerService.registerEventTimeTimer(VoidNamespace.INSTANCE, currentWatermark + 1);
}
这个逻辑看起来非常tricky,其实就是往timeService你们注册currentWatermark + 1的timer
AbstractStreamOperator中,当收到watermark的时候,
public void processWatermark(Watermark mark) throws Exception {
if (timeServiceManager != null) {
timeServiceManager.advanceWatermark(mark);
}
output.emitWatermark(mark);
}
timeServiceManager.advanceWatermark其实就是调用其中每一个time service的advanceWatermark
当前time service的实现,只有HeapInternalTimerService
HeapInternalTimerService.advanceWatermark
public void advanceWatermark(long time) throws Exception {
currentWatermark = time; // 更新currentWatermark
InternalTimer<K, N> timer;
while ((timer = eventTimeTimersQueue.peek()) != null && timer.getTimestamp() <= time) { // 从eventTimeTimersQueue取出一个timer,判断如果小于当前的watermark,记得我们注册过一个上个watermark+1的timer
Set<InternalTimer<K, N>> timerSet = getEventTimeTimerSetForTimer(timer);
timerSet.remove(timer);
eventTimeTimersQueue.remove();
keyContext.setCurrentKey(timer.getKey());
triggerTarget.onEventTime(timer); // 调用到onEventTime
}
}
这里还有个需要注意的点,对于KeyedStream,怎么保证不同key独立detect pattern sequence?
对于keyed state,elementQueueState,本身就是按key独立的,所以天然就支持
FlinkCEP - Complex event processing for Flink的更多相关文章
- An Overview of Complex Event Processing
An Overview of Complex Event Processing 复杂事件处理技术概览(一) 翻译前言:我在理解复杂事件处理(CEP)方面一直有这样的困惑--为什么这种计算模式是有效的, ...
- How to scale Complex Event Processing (CEP)/ Streaming SQL Systems?
转自:https://iwringer.wordpress.com/2012/05/18/how-to-scale-complex-event-processing-cep-systems/ What ...
- Understanding Complex Event Processing (CEP)/ Streaming SQL Operators with WSO2 CEP (Siddhi)
转自:https://iwringer.wordpress.com/2013/08/07/understanding-complex-event-processing-cep-operators-wi ...
- An Overview of Complex Event Processing2
An Overview of Complex Event Processing 翻译前言:感觉作者有点夸夸其谈兼絮絮叨叨,但文章还是很有用的.原文<An Overview of Complex ...
- Flafka: Apache Flume Meets Apache Kafka for Event Processing
The new integration between Flume and Kafka offers sub-second-latency event processing without the n ...
- OpenGL的GLUT事件处理(Event Processing)窗口管理(Window Management)函数[转]
GLUT事件处理(Event Processing)窗口管理(Window Management)函数 void glutMainLoop(void) 让glut程序进入事件循环.在一个glut程序中 ...
- 流计算技术实战 - CEP
CEP,Complex event processing Wiki定义 "Complex event processing, or CEP, is event processing that ...
- 【翻译】FlinkCEP-Flink的复杂事件处理
本文翻译自官网:FlinkCEP - Complex event processing for Flink FlinkCEP是在Flink之上实现的复杂事件处理(CEP)库. 它使您可以检测无穷无尽的 ...
- Flink架构、原理与部署测试
Apache Flink是一个面向分布式数据流处理和批量数据处理的开源计算平台,它能够基于同一个Flink运行时,提供支持流处理和批处理两种类型应用的功能. 现有的开源计算方案,会把流处理和批处理作为 ...
随机推荐
- JAVA 线程池架构浅析
经历了Java内存模型.JUC基础之AQS.CAS.Lock.并发工具类.并发容器.阻塞队列.atomic类后,我们开始JUC的最后一部分:线程池.在这个部分你将了解到下面几个部分: 线程池的基础架构 ...
- win10安装windows live writer 错误:OnCatalogResult:0x80190194
到官网下载了一个在线安装程序,可是一运行就提示无法安装,显式错误"OnCatalogResult:0x80190194",如下图所示 找到windows live安装程序的安装日志 ...
- linux每日命令(11):cat命令
cat命令的用途是连接文件或标准输入并打印.这个命令常用来显示文件内容,或者将几个文件连接起来显示,或者从标准输入读取内容并显示,它常与重定向符号配合使用. 一.命令格式: cat [参数] [文件] ...
- 24款最好的jQuery日期时间选择器插件
如果你正在创建一个网络表单,有很多事情你需要在你的应用程序中使用.有时您需要特别的输入,从用户的日期和时间,如发票日期,生日,交货时间,或任何其他此类信息.如果你有这样的需要,可以极大地从动态的jQu ...
- [转] BootStrap table增加一列显示序号
原文地址:https://blog.csdn.net/aboboo5200/article/details/78839208 最近由于项目需要,使用BootStrap table做数据展示,其中要在第 ...
- 【iCore4 双核心板_FPGA】例程十四:基于I2C的ARM与FPGA通信实验
实验现象: 1.先烧写ARM程序,然后烧写FPGA程序. 2.打开串口精灵,通过串口精灵给ARM发送数据从而给FPGA发送数据 ,会接收到字符GINGKO. 3.通过串口精灵发送命令可以控制ARM·L ...
- java获取视频的第一帧
//------------maven配置文件--------------- <dependency> <groupId>org.bytedeco</groupId> ...
- Java知多少(47)多重catch语句的使用
某些情况,由单个代码段可能引起多个异常.处理这种情况,你可以定义两个或更多的catch子句,每个子句捕获一种类型的异常.当异常被引发时,每一个catch子句被依次检查,第一个匹配异常类型的子句执行.当 ...
- Java知多少(56)线程模型
Java运行系统在很多方面依赖于线程,所有的类库设计都考虑到多线程.实际上,Java使用线程来使整个环境异步.这有利于通过防止CPU循环的浪费来减少无效部分. 为更好的理解多线程环境的优势可以将它与它 ...
- Java知多少(60)isAlive()和join()的使用
如前所述,通常你希望主线程最后结束.在前面的例子中,这点是通过在main()中调用sleep()来实现的,经过足够长时间的延迟以确保所有子线程都先于主线程结束.然而,这不是一个令人满意的解决方法,它也 ...