PipelineWise illustrates the power of Singer

Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.

Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERTstatements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTSandUPDATEScan be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
PipelineWise illustrates the power of Singer的更多相关文章
- pipelinewise 学习二 创建一个简单的pipeline
pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...
- xv6课本翻译之——第0章 操作系统接口
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- NCE3
Lesson1 A puma at large Pumas are large, cat-like animals which are found in America. When reports ...
- vim 的寄存器
If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...
- 2 Advanced Read/Write Splitting with PHP’s MySQLnd
原文地址需FQ才能看 https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...
- New Concept English three (45)
31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...
- book-rev8 Chapter 0 Operating system interfaces
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- 无线电源传输 Wireless Power Consortium (WPC) Communication
Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...
随机推荐
- 离线方式快速安装python模块以及相关依赖模块
一般公司的服务器都是和外网隔离的,这个如果没有内部pip源的话,想要安装python的模块就需要去python官网一个一个下载依赖模块的包,然后逐个安装,这样做非常耗时间. 我们今天用的办法就是现在我 ...
- 【题解】子序列个数 [51nod1202] [FZU2129]
[题解]子序列个数 [51nod1202] [FZU2129] 传送门:子序列个数 \([51nod1202]\) \([FZU2129]\) [题目描述] 对于给出长度为 \(n\) 的一个序列 \ ...
- python操作jenkins、python-jenkins api
Jenkins作为最流行的自动化流程的核心工具,我们使用它自带的web-ui完全可以满足日常的构建及发布工作,但是如果需要和其他系统做集成就必须二次开发或者通过API方式进行交互了. Jenkins介 ...
- 解决老大难疑惑:指针 vs 引用
▶疑问描述 1. 引用reference的本质: 常指针 ——> 什么时候用指针?= 就按Java中的引用变量那样用? ——> 什么时候用引用? ①函数的入参/返回值时 ②T&am ...
- JDBC 复习
概念 Java DataBase Connectivity java数据库连接 定义了操作所有关系型数据库的规则(接口),不同的数据库厂商编写类实现这些接口,这些类就叫数据库驱动,使得用户只需要使用统 ...
- English--五大基本句型基本概念
English|五大基本句型基本概念 英语的基本句型是整个英语的框架体系,所以,祝愿看到此文的伙伴们,都可以牢牢掌握! 前言 目前所有的文章思想格式都是:知识+情感. 知识:对于所有的知识点的描述.力 ...
- Python - 实现矩阵转置
有个朋友提出了一个问题:手头上现在有一个二维列表,比如[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],现在要把该二维列表变成为[[1, 4, 7, 10 ...
- APS系统如何让企业实现“多赢”?看高博通信是怎么做的
高博通信(上海)有限公司凭籍在超精密产业中的技术积累, 强大的资金优势以及与一流大学的联合,使得其正成为国内超精密电子制造行业的领导者. 雄厚的技术实力和专业的团队赢得了波音,空客公司等国际航空器制造 ...
- BERT解析及文本分类应用
目录 前言 BERT模型概览 Seq2Seq Attention Transformer encoder部分 Decoder部分 BERT Embedding 预训练 文本分类试验 参考文献 前言 在 ...
- 【Spring Boot】Spring Boot之使用Alibaba Cloud Toolkit(Idea插件)本地一键部署Spring Boot项目到远程服务器
一.Alibaba Cloud Toolkit(Idea插件)的安装 1)Alibaba Cloud Toolkit 介绍 Cloud Toolkit 是本地 IDE 插件,帮助开发者更高效地开发.测 ...