PipelineWise illustrates the power of Singer

Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.

Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERTstatements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTSandUPDATEScan be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
PipelineWise illustrates the power of Singer的更多相关文章
- pipelinewise 学习二 创建一个简单的pipeline
pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...
- xv6课本翻译之——第0章 操作系统接口
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- NCE3
Lesson1 A puma at large Pumas are large, cat-like animals which are found in America. When reports ...
- vim 的寄存器
If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...
- 2 Advanced Read/Write Splitting with PHP’s MySQLnd
原文地址需FQ才能看 https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...
- New Concept English three (45)
31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...
- book-rev8 Chapter 0 Operating system interfaces
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- 无线电源传输 Wireless Power Consortium (WPC) Communication
Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...
随机推荐
- k8s-架构中各个组件介绍
参考链接:https://github.com/opsnull/follow-me-install-kubernetes-cluster kubernetes 概述 1.kubernetes 是什么 ...
- SAP替代,出口U904在RGGBS000中未生成
报错.提示出口U904在RGGBS000中未生成. 一般情况下需要到 程序RGGBS000 中,在form:get_exit_titles 中增加下列代码. exits-name = 'U904. e ...
- Vert.x(vertx)发送 HTTP/HTTPS请求
Vert.x Web服务有两种协议,一种是HTTP,另外一种是使用ssl的HTTPS,请求的方式有五种,分别是get.post.put.delete.head.为了简单,服务端主要实现对HTTP协议的 ...
- [Windows] - Windows/Office纯绿色一键激活工具及方法
瘟到死网上有很多一件键激活工具(如KMS),但许多带毒或报毒.这里给出一个纯绿色命令行一键激活,及自已搭建激活服务器的方法. KMS现在算法都是公开的了,可以自行在网上找到,这里不详述. 使用命令行一 ...
- windows下vmware和Hyper-v共存方法
问题描述:环境:windows server 2012 r2系统下安装Hyper-v后,再安装Vmware 在Vmware中创建虚拟机,安装虚拟机系统的时候,vmware提示:VMware Works ...
- linux memcached 的安装
linux memcached安装yum -y install libevent libevent-deve yum list memcached yum -y install memcached m ...
- pandas-18 reindex用法
pandas-18 reindex用法 pandas中的reindex方法可以为series和dataframe添加或者删除索引. 方法:serise.reindex().dataframe.rein ...
- vue.js 如何加载本地json文件
在项目开发的过程中,因为无法和后台的数据做交互,所以我们可以自建一个假数据文件(如data.json)到项目文件夹中,这样我们就可以模仿后台的数据进行开发.但是,如何在一个vue.js 项目中引入本地 ...
- springBoot 发布war/jar包到tomcat(idea)
参考链接:https://blog.csdn.net/qq1076472549/article/details/81318729 1.启动类目录新增打包类: 2.pom.xml新增依赖:<pa ...
- SYN泛洪攻击原理及防御
拒绝服务攻击时,攻击者想非法占用被攻击者的一些资源,比如如:带宽,CPU,内存等等,使得被攻击者无法响应正常用户的请求. 讲泛洪攻击之前,我们先了解一下DoS攻击和DDoS攻击,这两个攻击大体相同,前 ...