PipelineWise illustrates the power of Singer

Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.

Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERTstatements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTSandUPDATEScan be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
PipelineWise illustrates the power of Singer的更多相关文章
- pipelinewise 学习二 创建一个简单的pipeline
pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...
- xv6课本翻译之——第0章 操作系统接口
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- NCE3
Lesson1 A puma at large Pumas are large, cat-like animals which are found in America. When reports ...
- vim 的寄存器
If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...
- 2 Advanced Read/Write Splitting with PHP’s MySQLnd
原文地址需FQ才能看 https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...
- New Concept English three (45)
31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...
- book-rev8 Chapter 0 Operating system interfaces
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- 无线电源传输 Wireless Power Consortium (WPC) Communication
Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...
随机推荐
- .NET Core如何使用NLog
1.新建ASP.NET Core项目 1.1选择项目 1.2选择.Net版本 2. 添加NLog插件 2.1 通过Nuget安装 2.2下载相关的插件 3.修改NLog配置文件 3.1添加NLog配置 ...
- FusionInsight大数据开发---Kafka应用开发
Kafka应用开发 了解Kafka应用开发适用场景 熟悉Kafka应用开发流程 熟悉并使用Kafka常用API 进行Kafka应用开发 Kafka的定义Kafka是一个高吞吐.分布式.基于发布订阅的消 ...
- TeamViewer14试用版到期-怎么解决
Teamviewer14提示试用期已到期怎么办? 问题分析: 出现这种问题,是因为在安装是选择了[公司/商务用途]或者[以上都是]这两个选项中的一个 解决方法: 1.退出TeamViewer远程软件, ...
- .net core使用ocelot---第二篇 身份验证
简介原文链接 .net core使用ocelot---第一篇 简单使用 接上文,我将继续介绍使用asp.net core 创建API网关,主要介绍身份验证(authentication )相 ...
- C#读写设置修改调整UVC摄像头画面-焦点
有时,我们需要在C#代码中对摄像头的焦点进行读和写,并立即生效.如何实现呢? 建立基于SharpCamera的项目 首先,请根据之前的一篇博文 点击这里 中的说明,建立基于SharpCamera的摄像 ...
- 类嵌套_list泛型_餐馆点菜例
form1内容: private void button1_Click(object sender, EventArgs e) { //声明并初始化一张点菜清单 yiduicai danzi = ne ...
- Jenkins首次进入的一些设置及配置
1.将Jenkins显示页面修改为中文环境 首先安装中文的插件:在manage Jenkins-Manage Plugins-可选插件 下载完成之后,在系统设置里边,修改为中文格式:manage Je ...
- python中字典的建立
一.字典由键key与值value构成. 如: a={'d':6,'f':'va'}print(a['f']) 上面代码简单建立字典,其中需要访问字典需要输入键值. 二.又比如需要在某个关键字中添加数据 ...
- 【SQL】各取所需 | SQL JOIN连接查询各种用法总结
前面 在实际应用中,大多的查询都是需要多表连接查询的,但很多初学SQL的小伙伴总对各种JOIN有些迷糊.回想一下,初期很长一段时间,我常用的似乎也就是等值连接 WHERE 后面加等号,对各种JOIN也 ...
- 女性长期没有"恩爱",会出现这4个后果?提醒:频率最好能在这个数
一直以来,很多人认为:男性性欲比女性强! 其实:因人而异! 但不管怎么说,“性”话题在如今社会中已经不再成为隐晦谈资. 越来越多的人,可以把此话题拿到桌面上各抒己见. 总归,“性”是我们探索自我的一种 ...