Debezium for PostgreSQL to Kafka
In this article, we discuss the necessity of segregate data model for read and write and use event sourcing for capture detailed data changing. These two aspects are critical for data analysis in big data world. We will compare some candidate solutions and draw a conclusion that CDC strategy is a perfect match for CQRS pattern.
Context and Problem
To support business decision-making, we demand fresh and accurate data that’s available where and when we need it, often in real-time.
But,
- as business analysts try to run analysis, the production databases are (will be) overloaded;
- some process details (transaction stream) valuable for analysis may have been overwritten;
- OLTP data models may not be friendly to analysis purpose.
We hope to come out with a efficient solution to capture detailed transaction stream and ingest data to Hadoop for analysis.

CQRS and Event Sourcing Pattern
CQRS-based systems use separate read and write data models, each tailored to relevant tasks and often located in physically separate stores.
Event-sourcing: Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data.

Decouple: one team of developers can focus on the complex domain model that is part of the write model, and another team can focus on the read model and the user interfaces.
Ingest Solutions - dual writes
Dual Write
- brings complexity in business system
- is less fault tolerant when backend message queue is blocked or under maintenance
- suffers from race conditions and consistency problems
Business log
- concerns of data sensitivity
- brings complexity in business system

Ingest Solutions - database operations
Snapshot
- data in the database is constantly changing, so the snapshot is already out-of-date by the time it’s loaded
- even if you take a snapshot once a day, you still have one-day-old data in the downstream system
- on a large database those snapshots and bulk loads can become very expensive
Data offload
- brings operational complexity
- is inability to meet low-latency requirements
- can’t handle delete operations
Ingest Solutions - capture data change
process only “diff” of changes
- write all your data to only one primary DB;
- extract two things from that database:
- a consistent snapshot and
- a real-time stream of changes
Benefits:
- decouple with business system
- get a latency of less than a second
- stream is ordering of writes, less race conditions
- pull strategy is robust to data corruption (log replaying)
- support as many variant data consumers as required

Ingest Solutions - wrapup
Considering data application under the picture of business application, we will focus on the ‘capture changes to data’ components.

Open Source for Postgres to Kafka
Sqoop
can only take full snapshots of a database, and not capture an ongoing stream of changes. Also, transactional consistency of its snapshots is not wells supported (Apache).
pg_kafka
is a Kafka producer client in a Postgres function, so we could potentially produce to Kafka from a trigger. (MIT license)
bottledwater-pg
is a change data capture (CDC) specifically from PostgreSQL into Kafka (Apache License 2.0, from confluent inc.)
debezium-pg
is a change data capture for a variety of databases (Apache License 2.0, from redhat)

Debezium for Postgres is comparatively better.
Debezium for Postgres Architecture
debezium/postgres-decoderbufs
- manually build the output plugin
- change PG configuration, preload the lib file and restart PG service
debezium/debezium
- compile and package the dependent jar files
Kafka connect
- deploy distributed kafka connect service
- start a debezium connector in Kafka connect
HBase connect
- development work: implement a hbase connect for PG CDC events
- Start a hbase connector in Kafka connect
Spark streaming
- development work: implement data process functions atop Spark streaming

Considerations
Reliability
For example
- be aware of data source exception or source relocation, and automatically/manually restart data capture tasks or redirect data source;
- monitor data quality and latency;
Scalability
- be aware of data source load pressure, and automatically/manually scale out data capture tasks;
Maintainability
- GUI for system monitoring, data quality check, latency statistics etc.;
- GUI for configuring data capture task scale out
Other CDC solutions
Databus (linkedIn): no native support for PG
Wormhole (facebook): not opensource
Sherpa (yahoo!) : not opensource
BottledWater (confluent): postgres Only (NOT maintained any more!!)
Maxwell: mysql Only
Debezium (redhat): good
Mongoriver: only for MongiDB
GoldenGate (Oracle): for Oracle and mysql, free but not opensource
Canal & otter (alibaba): for mysql world replication
Debezium for PostgreSQL to Kafka的更多相关文章
- kafka connect rest api
1. 获取 Connect Worker 信息curl -s http://127.0.0.1:8083/ | jq lenmom@M1701:~/workspace/software/kafka_2 ...
- debezium关于cdc的使用(上)
博文原址:debezium关于cdc的使用(上) 简介 debezium是一个为了捕获数据变更(cdc)的开源的分布式平台.启动并指向数据库,当其他应用对此数据库执行inserts.updates.d ...
- 基于Apache Hudi和Debezium构建CDC入湖管道
从 Hudi v0.10.0 开始,我们很高兴地宣布推出适用于 Deltastreamer 的 Debezium 源,它提供从 Postgres 和 MySQL 数据库到数据湖的变更捕获数据 (CDC ...
- 几篇关于MySQL数据同步到Elasticsearch的文章---第一篇:Debezium实现Mysql到Elasticsearch高效实时同步
文章转载自: https://mp.weixin.qq.com/s?__biz=MzI2NDY1MTA3OQ==&mid=2247484358&idx=1&sn=3a78347 ...
- Build an ETL Pipeline With Kafka Connect via JDBC Connectors
This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via ...
- Kafka设计解析(八)- Exactly Once语义与事务机制原理
原创文章,首发自作者个人博客,转载请务必将下面这段话置于文章开头处. 本文转发自技术世界,原文链接 http://www.jasongj.com/kafka/transaction/ 写在前面的话 本 ...
- Kafka设计解析(八)Exactly Once语义与事务机制原理
转载自 技术世界,原文链接 Kafka设计解析(八)- Exactly Once语义与事务机制原理 本文介绍了Kafka实现事务性的几个阶段——正好一次语义与原子操作.之后详细分析了Kafka事务机制 ...
- pg 资料大全1
https://github.com/ty4z2008/Qix/blob/master/pg.md?from=timeline&isappinstalled=0 PostgreSQL(数据库) ...
- Awesome Go精选的Go框架,库和软件的精选清单.A curated list of awesome Go frameworks, libraries and software
Awesome Go financial support to Awesome Go A curated list of awesome Go frameworks, libraries a ...
随机推荐
- linux、centos下查看系统版本、bios版本,内存信息等
1.查看系统版本 [root@localhost ~]# more /etc/issueCentOS release 6.2 (Final)Kernel \r on an \m 2.查看CPU信息 : ...
- Pie(浮点数二分)
Pie http://poj.org/problem?id=3122 Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 2454 ...
- New Game! (最短路+建图)
New Game! https://www.nowcoder.com/acm/contest/201/L 题目描述 Eagle Jump公司正在开发一款新的游戏.Hifumi Takimoto作为其中 ...
- Hibernate一级缓存(补)
------------------siwuxie095 什么是缓存 缓存是介于应用程序和永久性数据存储源(如:硬盘上的 ...
- .NET中的文件IO操作实例
1.写入文件代码: //1.1 生成文件名和设置文件物理路径 Random random = new Random(DateTime.Now.Millisecond); ); string Physi ...
- 鼠标滑过图片添加边框图片无位移[xyytit]
实现下面的效果,鼠标滑过图片添加边框图片无位移——鼠标滑过,图片只是加了边框,不会晃动: 参考代码: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...
- 【校招面试 之 C/C++】第13题 C++ 指针和引用的区别
1.指针和引用的定义和性质区别: (1)指针:指针是一个变量,只不过这个变量存储的是一个地址,指向内存的一个存储单元:而引用跟原来的变量实质上是同一个东西,只不过是原变量的一个别名而已.如: int ...
- 将Boost库添加到Visual Studio 2017
在windows 环境中,一般比较推荐的打包软件的方式是,将自己所需要的共享库放在软件自己的文件夹中,并且避免与其它的软件共用.除非是微软的官方组件,比如微软自家的VC Runtime. Boost库 ...
- Django之XSS攻击
一.什么是XSS攻击 xss攻击:----->web注入 xss跨站脚本攻击(Cross site script,简称xss)是一种“HTML注入”,由于攻击的脚本多数时候是跨域的,所以称 ...
- 转:从框架看PHP的五种境界及各自的薪资待遇(仅限于二三线城市,一线除外)
在撰写此文前首先必须申明的是本人不鄙视任何一种框架,也无意于挑起PHP框架间的战争,更没有贬低某个框架使用者的用意,本文纯粹个人的看法.你可以认为我无知也好,或者装逼也好,请不要试着在任何情况下,随便 ...