PipelineWise illustrates the power of Singer

Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.
Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERT
statements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTS
andUPDATES
can be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
PipelineWise illustrates the power of Singer的更多相关文章
- pipelinewise 学习二 创建一个简单的pipeline
pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...
- xv6课本翻译之——第0章 操作系统接口
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- NCE3
Lesson1 A puma at large Pumas are large, cat-like animals which are found in America. When reports ...
- vim 的寄存器
If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...
- 2 Advanced Read/Write Splitting with PHP’s MySQLnd
原文地址需FQ才能看 https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...
- New Concept English three (45)
31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...
- book-rev8 Chapter 0 Operating system interfaces
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- 无线电源传输 Wireless Power Consortium (WPC) Communication
Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...
随机推荐
- 示例:WPF中Slider控件封装的缓冲播放进度条控件
原文:示例:WPF中Slider控件封装的缓冲播放进度条控件 一.目的:模仿播放器播放进度条,支持缓冲任务功能 二.进度: 实现类似播放器中带缓存的播放样式(播放区域.缓冲区域.全部区域等样式) 实现 ...
- php for循环a到z
首先先介绍2个php内置函数 ord(string):函数返回字符串的首个字符的 ASCII 值.//string:必需.要从中获得 ASCII 值的字符串. chr(ascll): 函数从指定的 A ...
- phoenix kerberos 连接配置
1. 官网资料 Use JDBC to get a connection to an HBase cluster like this: Connection conn = DriverManager. ...
- ubuntu开发常用收集
命令: 1.http://blog.csdn.net/simongeek/article/details/45271089 2.http://www.jianshu.com/p/654be9c0f13 ...
- System.ArgumentException:路由集合中已存在名为“XXX”的路由。路由名称必须唯一。
软件环境:Visual Studio 2017 + MVC4 + EF6 问题描述:System.ArgumentException:路由集合中已存在名为“XXX”的路由.路由名称必须唯一. 解决办法 ...
- Django--视图函数
目录 视图函数 HttpRequest对象 request属性 request常用方法 HttpResponse对象 render() redirect() JsonResponse 视图函数 一个视 ...
- 20、解决Vue使用bus兄弟组件间传值,第一次监听不到数据
1.新建bus.js文件: import Vue from 'vue' export default new Vue; 2.在需要通信组件A,B中引入bus: A组件: import Bus from ...
- 英语catarinite天铁托甲catarinite镍铁陨石
中文名称:天铁 镍铁陨石(catarinite)是铁质陨石的一种,发现地有南非,美国,澳洲,南极大陆等.镍铁陨石有著美丽的韦德曼交纹及高含量的镍成分,因此不易氧化,大部分都用来制造饰品或展现陨石韦德曼 ...
- 在Node.js中使用ejsexcel输出EXCEL文件
1.背景 在Nodejs应用程序中输出Excel,第一印象想到的一般是node-xlsx,这类插件不仅需要我们通过JS写入数据,还需要通过JS进行EXCEL显示样式的管理. 这是个大问题,不仅代码冗余 ...
- Android Drawable和Bitmap区别
一.相关概念 1.Drawable就是一个可画的对象,其可能是一张位图(BitmapDrawable),也可能是一个图形(ShapeDrawable),还有可能是一个图层(LayerDrawable) ...