PipelineWise illustrates the power of Singer

Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.

Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERTstatements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTSandUPDATEScan be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
PipelineWise illustrates the power of Singer的更多相关文章
- pipelinewise 学习二 创建一个简单的pipeline
pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...
- xv6课本翻译之——第0章 操作系统接口
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- NCE3
Lesson1 A puma at large Pumas are large, cat-like animals which are found in America. When reports ...
- vim 的寄存器
If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...
- 2 Advanced Read/Write Splitting with PHP’s MySQLnd
原文地址需FQ才能看 https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...
- New Concept English three (45)
31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...
- book-rev8 Chapter 0 Operating system interfaces
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- 无线电源传输 Wireless Power Consortium (WPC) Communication
Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...
随机推荐
- drools -规则语法
文章结构 1. 基础api 2. FACT对象 3. 规则 4. 函数 1. 基础api 在 Drools 当中,规则的编译与运行要通过Drools 提供的各种API 来实现,这些API 总体来讲可以 ...
- CentOS7 firewalld防火墙 启动 关闭 禁用 添加删除规则等 常用命令
CentOS7 firewalld防火墙 常用命令1.firewalld的基本使用启动: systemctl start firewalld关闭: systemctl stop firewalld查看 ...
- html引入公共模块
如果没有母版页,那么大量相同布局的页面会有很多相同的代码,那么这就提到了一个概念,叫重用性:可以将相同布局的代码放在一个单独的文件,里面写一些公共模块,那么在其他页面只需要在指定位置引入他们就可以了写 ...
- javaScript 对象的hasOwnProperty方法打印window自定义属性
for (var name in window) { if (window.hasOwnProperty(name)) { window.console.log ( name + " : & ...
- ASP.NET 异步编程之Async await
本文重点介绍的是.NET Framework4.5 推出的异步编程方案 async await 请先看个5分钟的微软演示的视频:视频地址: https://channel9.msdn.com/Blo ...
- ssh tunneling应用案例-AWS EC2 vnc图形化桌面的支持
一般地,无论是AWS EC2还是阿里云的云主机,linux系统默认都只提供ssh登录方式.如果你是一个技术控,非常希望把图形化界面给折腾出来,这其中就不需有vnc server的支持,除此之外,还涉及 ...
- Spring AOP 复习
Aspect Oriented Programming 通过预编译方式和运行期动态代理实现程序功能的统一维护的一种技术,利用aop可以对业务逻辑的各个部分进行隔离,从而使得业务逻辑各部分之间的耦合度降 ...
- BZOJ1040: [ZJOI2008]骑士(奇环树,DP)
题目: 1040: [ZJOI2008]骑士 解析: 假设骑士\(u\)讨厌骑士\(v\),我们在\(u\),\(v\)之间连一条边,这样我们就得到了一个奇环树(奇环森林),既然是一颗奇环树,我们就先 ...
- SAP CO-PA(盈利能力分析)
为了在这个现代和动态的环境中保持和发展,快速和及时的决策对于做出正确的决策更为重要.组织盈利能力是评估设计组织目标,目标和实现目标的核心参数.在本博客中,我将讨论SAP ERP中包含的工具,以分析组织 ...
- IT之快速提高效率的方法与思考
前言 文章也没什么很高深的问题,大概花个5分钟能看完.是一些大家都知道的道理,作为提醒与总结. 关于提高方面的内容,一般都有个人的方法,但大致都一致.可分为几个步骤. 框架.工具使用相关 使用框架.工 ...