Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.

TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:

  • Apply schema changes automatically
  • Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
  • Keep configuration as code

We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)

After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.

A data pipeline is born

Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.


Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances

Like any tool, PipelineWise has limitations:

  • Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual INSERT statements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source.
  • Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only INSERTS and UPDATES can be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.

An evolving solution

PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.

For detailed information on PipelineWise features and architecture, check out the documentation.

PipelineWise illustrates the power of Singer的更多相关文章

  1. pipelinewise 学习二 创建一个简单的pipeline

    pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...

  2. xv6课本翻译之——第0章 操作系统接口

    Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...

  3. NCE3

    Lesson1  A puma at large Pumas are large, cat-like animals which are found in America. When reports ...

  4. vim 的寄存器

    If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...

  5. 2 Advanced Read/Write Splitting with PHP’s MySQLnd

    原文地址需FQ才能看  https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...

  6. New Concept English three (45)

    31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...

  7. book-rev8 Chapter 0 Operating system interfaces

    Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...

  8. pipelinewise 基于singer 指南的的数据pipeline 工具

    pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...

  9. 无线电源传输 Wireless Power Consortium (WPC) Communication

    Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...

随机推荐

  1. Django CBV和FBV

    Django CBV和FBV Django内部CBV内部接收方法操作: 1.通过客户端返回的请求头RequestMethod与RequesrtURL,会以字符串形式发送到服务器端. 2.取到值后通过d ...

  2. 【题解】PERIOD - Period [POJ1961] [SP263]

    [题解]PERIOD - Period [POJ1961] [SP263] 在进入这道题之前,我们需要了解 kmp 算法 不知道的童鞋可以去看一下Silent_EAG(一个可爱的女孩纸)的讲解. 关于 ...

  3. - XML 解析 总结 DOM SAX PULL MD

    目录 目录 XML 解析 总结 DOM SAX PULL MD 几种解析方式简介 要解析的内容 DOM 解析 代码 输出 SAX 解析 代码 输出 JDOM 解析 代码 输出 DOM4J 解析 代码 ...

  4. .Net Core2.2 在IIS发布

    .Net Core应用发布到IIS主要是如下的三个步骤: (1)在Windows Server上安装 .Net Core Hosting Bundle (2)在IIS管理器中创建IIS站点 (3)部署 ...

  5. linux学习-防火墙指令

    Redhat7之前的版本(iptables) 开启关闭防火墙 放行端口 RedHat7防火墙相关的指令(firewall-cmd) 安装firewall 本文内容适用于 redhat 和 centos ...

  6. Python网络编程、爬虫之requests模块使用

    一.python操作网络,也就是打开一个网站,或者请求一个http接口,使用urllib模块. urllib模块是一个标准模块,直接import urllib即可,在python3里面只有urllib ...

  7. Nginx 开启status用以监控状态信息

    Nginx 可以通过with-http_stub_status_module模块来监控nginx的一些状态信息.1.通过nginx -V来查看是否有with-http_stub_status_modu ...

  8. IT之快速提高效率的方法与思考

    前言 文章也没什么很高深的问题,大概花个5分钟能看完.是一些大家都知道的道理,作为提醒与总结. 关于提高方面的内容,一般都有个人的方法,但大致都一致.可分为几个步骤. 框架.工具使用相关 使用框架.工 ...

  9. 【转载】Asp.Net生成图片验证码工具类

    在Asp.Net应用程序中,很多时候登陆页面以及其他安全重要操作的页面需要输入验证码,本文提供一个生成验证码图片的工具类,该工具类通过随机数生成验证码文本后,再通过C#中的图片处理类位图类,字体类,一 ...

  10. Spring Data JPA的低级错误

    //课程表 @Entity public class Class { @GeneratedValue(strategy = GenerationType.AUTO) @Id private Long ...