Supercharging your ETL with Airflow and Singer

Earlier this year we introduced Singer, an open source project that helps data teams build simple, composable ETL. Singer provides a standard way for anyone to pull data from and send data to any source.
For many companies, however, being able to push and pull data or move things from A to B is the only part of the problem. Data extraction is often part of a more complex workflow that involves scheduled tasks, complex dependencies, and the need for scalable, distributed architecture.
Enter Apache Airflow. Originally developed at Airbnb and now a part of the Apache Incubator, Airflow takes the simplicity of a cron scheduler and adds all the facets of a modern workflow tool: dependency graphs, detailed logging, automated notifications, scalable infrastructure, and a graphical user interface.
A dependency tree and history of task runs from Airflow’s UI
Imagine a company that relies on data from multiple data sources, including SaaS tools, databases, and flat files. Several times a day this company might want to ingest new data from these sources in parallel. The company might manipulate it in some way, then dump the output into a data warehouse.
Airflow and Singer can make all of that happen. With a few lines of code, you can use Airflow to easily schedule and run Singer tasks, which can then trigger the remainder of your workflow.
A real-world example
Let’s look at a real-world example developed by a member of the Singer community. In this scenario we’re going to be pulling in CSV files, but Singer can work with any data source.
Our user has a specific sequence of tasks they need to complete each morning:
- Download new compressed CSV files from an AWS S3 bucket
- Decompress those files
- Use a Singer CSV tap to push the data to a Singer Stitch target. In this example we’re dumping data into Amazon Redshift, but you could target Google BigQuery or Postgres, too.
- Delete the compressed and decompressed files
This entire workflow, including all scripts, logging, and the Airflow implementation itself, is accomplished in fewer than 160 lines of Python code in this repo. Let’s see how it’s done.
Firing up Airflow
First we get Airflow running as described on the project’s Quick Start page with four commands:
# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow
# install from pypi using pip
pip install airflow
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
Upon running that last command, you should see some ASCII art, letting you know that the web server is online:

Now, point your browser to http://localhost:8080/ to see a screen that looks like this:

At this point, you’re ready to create your own Airflow DAG (Directed Acyclic Graph) to perform data workflow tasks. For our purposes, we can get a ready-made DAG by cloning the airflow-singer repo:
git clone git@github.com:robertjmoore/airflow-singer.git
Customizing the repo
For Airflow to find the DAG in this repo, you’ll need to tweak the dags_folder variable the ~/airflow/airflow.cfg file to point to the dags directory inside the repo:

You’ll also want to make a few tweaks to the singer.py file in the repo’s dags folder to reflect your contact info and the location of the repo on your local file system:

Restart the web server with the command airflow webserver -p 8080, then refresh the Airflow UI in your browser. You should now see the DAG from our repo:

Clicking on it will show us the Graph View, which lays out the steps taken each morning when the DAG is run:

This dependency map is governed by a few lines of code inside the dags/singer.py file. Let’s unpack a little of what’s going on.
Exploring the DAG

This tiny file defines the whole graph. Each of these tasks is a step in the DAG, and the final four lines draw out the dependencies that exist between them.
You’ll notice that, in this file, each step is a BashOperator that calls a specific command-line task and waits for its successful completion. Airflow supports a number of other operators and allows you to build your own. This makes it easy for a DAG to include interactions with databases, email services, and chat tools like Slack.
Interacting with Singer
To get a better idea of how Singer is integrated, check out the individual files in the scripts/ directory. You'll find Python scripts that download data from an Amazon S3 bucket, extract that data, and delete the files on completion.
The most interesting step is the process of using Singer to extract the data from the CSV files and push it to a target – namely Stitch.
We should also note that the CSV tap requires a config file that tells it where to find the CSV files to push to Singer, so one step in our DAG is to generate that JSON config file and then point it to the files we just extracted. We do this by generating a few lines of JSON code. Note that we use the global Airflow variable execution_date across our various scripts to be sure we deposit and retrieve the files from the same path.

Once that config file has been generated, we call Singer to do all the work in a single command line:
tap-csv -c ~/config/csv-config.json | target-stitch -c ~/config/stitch_config.json
This doesn’t even require a special Python script — the entire instruction is laid out in a single line of the singer.py DAG file.

Conclusion
As you can see, incorporating Singer into your Airflow DAGs gives you a powerful way to move data automatically. Anyone can extract and load data with a one-line instruction, using a growing ecosystem of taps and targets.
Supercharging your ETL with Airflow and Singer的更多相关文章
- Airbnb架构要点分享——阅读心得
目前,Airbnb已经使用了大约5000个AWS EC2实例,其中大约1500个实例用于部署其应用程序中面向Web的部分,其余的3500个实例用于各种分析和机器学习算法.而且,随着Airbnb的发展, ...
- 《Airbnb架构要点分享》阅读笔记
Airbnb成立于2008年8月,总部位于加利福尼亚州旧金山市.Airbnb是一个值得信赖的社区型市场,在这里人们可以通过网站.手机或平板电脑发布.发掘和预订世界各地的独特房源,其业务已经覆盖190个 ...
- Singer 开源便捷的ETL 工具
singer 是一个强大,灵活的etl 工具,我们可以方便的提取web api,file,queue,基本上各种你可以想到的 数据源. singer 有一套自己的数据处理规范, taps, targe ...
- Singer 学习三 使用Singer进行mongodb 2 postgres 数据转换
Singer 可以方便的进行数据的etl 处理,我们可以处理的数据可以是api 接口,也可以是数据库数据,或者 是文件 备注: 测试使用docker-compose 运行&&提供数据库 ...
- Singer 学习二 使用Singer进行gitlab 2 postgres 数据转换
Singer 可以方便的进行数据的etl 处理,我们可以处理的数据可以是api 接口,也可以是数据库数据,或者 是文件 备注: 测试使用docker-compose 运行&&提供数据库 ...
- 3.Airflow使用
1. airflow简介2. 相关概念2.1 服务进程2.1.1. web server2.1.2. scheduler2.1.3. worker2.1.4. celery flower2.2 相关概 ...
- 4.airflow测试
1.测试sqoop任务1.1 测试全量抽取1.1.1.直接执行命令1.1.2.以shell文件方式执行sqoop或hive任务1.2 测试增量抽取2.测试hive任务3.总结 当前生产上的任务主要分为 ...
- 【airflow实战系列】 基于 python 的调度和监控工作流的平台
简介 airflow 是一个使用python语言编写的data pipeline调度和监控工作流的平台.Airflow被Airbnb内部用来创建.监控和调整数据管道.任何工作流都可以在这个使用Pyth ...
- 调度系统Airflow的第一个DAG
Airflow的第一个DAG 考虑了很久,要不要记录airflow相关的东西, 应该怎么记录. 官方文档已经有比较详细的介绍了,还有各种博客,我需要有一份自己的笔记吗? 答案就从本文开始了. 本文将从 ...
随机推荐
- Debian 9 / Debian 10 / Ubuntu 18.04 / Ubuntu 18.10快速开启BBR加速 或 关闭BBR加速
如果使用的是Debian 9.Debian 10.Ubuntu 18.04.Ubuntu 18.10等内核高于4.9版本的系统,均可以使用此方法开启BBR加速,若你使用了Ubuntu 19.04的系统 ...
- Web前端推荐学习站点
http://javascript.ruanyifeng.com/ JavaScript参考标准教程,写的很不错. https://www.xiaohuochai.cc/ 小火柴前端站 http ...
- 使用基础知识完成java小作业?强化练习-1.输入数组计算最大值-2.输出数组反向打印-3.求数组平均值与总和-4.键盘输两int,并求总和-5.键盘输三个int,并求最值;
完成几个小代码练习?让自己更加强大?学习新知识回顾一下基础? 1.输入数组计算最大值 2.输出数组反向打印 3.求数组平均值与总和 4.键盘输两int,并求总和 5.键盘输三个int,并求最值 /* ...
- python基础08--迭代器,生成器
1.1 迭代器 1.可迭代对象:str,list,tuple,set,dict, 迭代器:f(文件),range() 可迭代对象和迭代器都可以for循环 可迭代对象不会计数, 迭代器会计数,上面操作到 ...
- Linux环境下:vmware安装Windows报错误-缺少所需的CD/DVD驱动器设备驱动程序
解决方法:将硬盘格式从SCSI改为IDE. 方法如下: 右键点击你新建的虚拟机名,点击最下面的setting,看到左侧第二行是hard disk 了么,你那里肯定是SCSI的,选中它,点最下面的rem ...
- 米尔科技MPSoC开发板评测
米尔科技推出的MYD-CZU3EG开发板搭载的就是UltraScale+ MPSoC平台器件 — XCZU3EG,它集成了四核Cortex-A53 处理器,双核 Cortex-R5 实时处理单元以及M ...
- ionic创建组件、页面或者过滤器
ionic可以直接 用命令来创建组件.页面或者过滤器. 在ionic项目根目录打开命令窗口.输入下列命令: ionic g page pageName //创建新页面 ionic g componen ...
- Cheat Engine 修改汇编指令
打开游戏 扫描阳光 扫描过程就不讲了 找到阳光的地址 显示反汇编 找到使阳光减少的反汇编代码 空指令替换 将阳光减少汇编指令,用空指令替换.这样阳光就不再减少了 指令替换 也可以将汇编指令修改,减少变 ...
- Prometheus学习
简介 Prometheus 最初是 SoundCloud 构建的开源系统监控和报警工具,是一个独立的开源项目,于2016年加入了 CNCF 基金会,作为继 Kubernetes 之后的第二个托管项目. ...
- scala快速入门之文档注释
scala快速入门之文档注释 1.在项目栏的目录树中找到该源码,右击点击Show in Explorer, 即可找到该源码的本地路径,在路径中输入cmd 2.执行scaladoc -d 生成文档注释 ...