Error Handling Elements in Apache Beam Pipelines

Mar 15

I have noticed a deficit of documentation or examples outside of the official Beam docs, as data pipelines are often intimately linked with business logic. While working with streaming pipelines, I developed a simple error handling technique, to reduce the disruption that errors cause to streaming or long-running jobs. Here I have an explanation of that technique, and a simple demo pipeline.

Apache Beam is a high level model for programming data processing pipelines. It provides language interfaces in both Java and Python, though Java support is more feature-complete.

Beam supports running in two modes: batch, and streaming. In batch mode, a finite data set is read in, processed, then output in one huge chunk. Streaming mode allows for data to be continuously read in from a streaming source (such as a message queue), processed in small chunks, and output as processing occurs. Streaming allows for analytics to be performed in “real time” as events occurs. This is extremely valuable for telemetry and logging, where engineers or other systems need feedback as events happen.

Beam pipelines are composed of a series of typed data sets (PCollections), and transforms. Transforms take a PCollection, perform a programmer-defined operation on the collection elements, then output zero or more new PCollections as a result.

The problem with these transforms is that they need to eventually operate on data. As anyone familiar with handling user input or data from large systems can attest, that data can be malformed, or just unexpected. If a bad piece of data enters the system, it may cause the entire pipeline to crash. This is a waste of time and compute resources at best, but can also result in losing in-memory streaming data, or disrupting downstream systems relying on the Beam output.

In order to stop a catastrophic failure, you need graceful error handling in your pipeline. The easiest way to do this is to add try-catch blocks within each transform, which prevents shutdown and allows all other elements to be processed.

A basic try/catch around a string conversion.

This is a start, but it’s not enough on its own. You’ll want to record failures — what data failed what transform, and why. To do this, you’ll want to create a data structure to store these errors, and an output channel for them.

The data structure for a failure should contain:

Source data in some form (data ID, the raw data fed into the transform, or the raw data precursor that was fed into the pipeline).
The reason for the failure.
The transform that failed.

Example constructor of a Failure object.

We can instantiate a Failure if an exception or error is thrown during a transform.

Parsing some fields out of auditd log strings. In this example, we use an inappropriately small number type. If the number is too large for an Integer, the transform outputs a Failure object, and continues processing elements.

Next, we need to be able to record the failure for developers to reference.

Beam transforms by default only have one output PCollection, but they can output multiple PCollections. A transform can return a PCollectionTuple, which uses TupleTag objects to reference which PCollection to put an element into, and which PCollection to fetch from the TupleTag. This has many uses, and we can use it here to separately output a PCollection of successful results, and a PCollection of Failure objects.

Accessing the PCollections stored in a PCollectionTuple.

In the demo repo, successes and failures are simply written to files. In a real pipeline, they would likely be sent to a database, or a message queue for additional processing or reporting.

You may also want to extend coverage beyond just handling thrown exceptions. For example, we could validate that all data falls within expected parameters (EG all user ids are ≥ 0) and is present, to prevent logical errors, missing records, or DB insertion failures further along. That validation could be extended into the Failure class, or it could be a new Invalid class and PCollection.

This covers the handling of elements themselves, but there are many design decisions beyond that, such as: what next? Data scientists or developers must review the errors, and discard data that is outright bad. If data is merely in an unexpected format, or exposed a now-fixed bug in the pipeline, then that data should be re-processed. It’s common (moreso in batch pipelines) to retry a whole dataset after any bugs in the pipeline are addressed. This is time consuming to process, but easy to support, and allows for grouped data (sums, aggregates, etc) to be corrected by adding the missing data. Some pipelines may only retry individual elements, if the pipeline is a 1-in-1-out process.

There is a GitHub repo at https://github.com/vllry/beam-errorhandle-example which shows the full proof of concept using auditd log files.

final TupleTag<Output> successTag = new TupleTag<>() {};

final TupleTag<Input> deadLetterTag = new TupleTag<>() {};

PCollection<Input> input = /* … */;

PCollectionTuple outputTuple = input.apply(ParDo.of(new DoFn<Input, Output>() {

  @Override

  void processElement(ProcessContext c) {

  try {

    c.output(process(c.element());

  } catch (Exception e) {

    LOG.severe("Failed to process input {} -- adding to dead letter file",

      c.element(), e);

    c.sideOutput(deadLetterTag, c.element());

  }

}).withOutputTags(successTag, TupleTagList.of(deadLetterTag)));

// Write the dead letter inputs to a BigQuery table for later analysis

outputTuple.get(deadLetterTag)

  .apply(BigQueryIO.write(...));

// Retrieve the successful elements...

PCollection<Output> success = outputTuple.get(successTag);

// and continue processing as desired ...

beam 的异常处理 Error Handling Elements in Apache Beam Pipelines的更多相关文章

Spring Boot 2.x 系列教程：WebFlux REST API 全局异常处理 Error Handling
摘要: 原创出处 https://www.bysocket.com 「公众号:泥瓦匠BYSocket 」欢迎关注和转载,保留摘要,谢谢! 本文内容为什么要全局异常处理? WebFlux REST 全 ...
Apache Beam WordCount编程实战及源码解读
概述:Apache Beam WordCount编程实战及源码解读,并通过intellij IDEA和terminal两种方式调试运行WordCount程序,Apache Beam对大数据的批处理和流 ...
Beam编程系列之Apache Beam WordCount Examples（MinimalWordCount example、WordCount example、Debugging WordCount example、WindowedWordCount example）（官网的推荐步骤）
不多说,直接上干货! https://beam.apache.org/get-started/wordcount-example/ 来自官网的: The WordCount examples demo ...
Apache Beam WordCount编程实战及源代码解读
概述:Apache Beam WordCount编程实战及源代码解读,并通过intellij IDEA和terminal两种方式调试执行WordCount程序,Apache Beam对大数据的批处理和 ...
Apache Beam，批处理和流式处理的融合！
1. 概述在本教程中,我们将介绍 Apache Beam 并探讨其基本概念. 我们将首先演示使用 Apache Beam 的用例和好处,然后介绍基本概念和术语.之后,我们将通过一个简单的例子来说明 ...
Apache Beam入门及Java SDK开发初体验
1 什么是Apache Beam Apache Beam是一个开源的统一的大数据编程模型,它本身并不提供执行引擎,而是支持各种平台如GCP Dataflow.Spark.Flink等.通过Apache ...
Apache Beam编程指南
术语 Apache Beam:谷歌开源的统一批处理和流处理的编程模型和SDK. Beam: Apache Beam开源工程的简写 Beam SDK: Beam开发工具包 **Beam Java SDK ...
setjmp()、longjmp() Linux Exception Handling/Error Handling、no-local goto
目录 . 应用场景 . Use Case Code Analysis . 和setjmp.longjmp有关的glibc and eglibc 2.5, 2.7, 2.13 - Buffer Over ...
Error Handling in ASP.NET Core
Error Handling in ASP.NET Core 前言在程序中,经常需要处理比如 404,500 ,502等错误,如果直接返回错误的调用堆栈的具体信息,显然大部分的用户看到是一脸懵逼的 ...

随机推荐

CentOS 7下使用Gitolite搭建Git私服
1. 搭建环境 CentOS7, git version 1.8.3.1 2. 安装依赖包 yum install curl-devel expat-devel gettext-devel opens ...
微信小程序发红包
背景: 近期一个朋友公司要做活动,活动放在小程序上.小程序开发倒是不难,不过要使用小程序给微信用户发红包,这个就有点麻烦确定模式: 小程序目前没有发红包接口,要实现的话,只能是模拟红包,即小程序上做 ...
libstdc++适配Xcode10与iOS12
编译报错当你开心得升级完新 macOS,以及新 XCode,准备体验了一把 Dark Mode 编程模式,开心的打开自己的老项目的时候,发现编译不通过了╮(╯_╰)╭ 如果你的工程中如果依赖 lib ...
spark读写hbase性能对比
一.spark写入hbase hbase client以put方式封装数据,并支持逐条或批量插入.spark中内置saveAsHadoopDataset和saveAsNewAPIHadoopDatas ...
shell杀死指定端口的进程
杀死端口代码如下: lsof -i: kill - PID 上面的与下面的代码作用相同. 命令如下所示(这种方式更自动化): kill - $(netstat -nlp | grep : | awk ...
实现一个book类
设计实现一个book类具体要求定义义成Book.java,Book 包含书名,作者,出版社和出版日期,这些数据都要定义getter和setter. 定义至少三个构造方法,接收并初始化这些数据. 覆 ...
PHP性能优化：in_array和isset 在大数组查询中耗时相差巨大，以及巧妙使用array_flip
今天在PHP业务开发中,发现了一个问题. 两个较大数组(20万+元素),遍历其中一个$a,另一个数组$b用于查找元素. 比如 foreach($a as $val){ if(in_array($xx, ...
DIY 空气质量检测表
DIY 空气质量检测表前几天逛淘宝看到有空气颗粒物浓度测量的传感器,直接是 3.3V TTL 电压串口输出的,也不贵,也就 100 多一点.觉得挺好就买了个,这两天自己捣鼓了个小程序,搞了个软件界面 ...
在项目中迁移MS SQLServer到Mysql数据库，实现MySQL数据库的快速整合
在开发项目的时候,往往碰到的不同的需求情况,兼容不同类型的数据库是我们项目以不变应万变的举措之一,在底层能够兼容多种数据库会使得我们开发不同类型的项目得心应手,如果配合快速的框架支持,那更是锦上添花的 ...
HTML60秒倒计时
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

beam 的异常处理 Error Handling Elements in Apache Beam Pipelines

Error Handling Elements in Apache Beam Pipelines

beam 的异常处理 Error Handling Elements in Apache Beam Pipelines的更多相关文章

随机推荐

热门专题