认识Airflow的DAG
前文Airflow的第一个DAG已经跑起来了我们的第一个任务. 本文就来丰富这个任务.
回顾我们的任务内容
- 我们定义了DAG的名称为
Hello-World
, 这个叫dag_id
, - 补充说明description
- 定义了调度间隔schedule_interval, 这是一个cron表达式
- 引入了一个bash任务
- 有一个重要的参数default_args, 这是dag定义的参数
如何执行不同的任务
airflow里通过引入不同的operator来执行不同的操作. 目前,内置了一些:
https://github.com/apache/airflow/tree/master/airflow/operators
第三方也贡献了一些:
https://github.com/apache/airflow/tree/master/airflow/contrib/operators
还可以自己编写plugin, 制作自己的任务类型插件.
当想要使用这些插件的时候,只要引入
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import BranchPythonOperator
from operators.rdbms_to_redis_operator import RDBMS2RedisOperator
from operators.rdbms_to_hive_operator import RDBMS2HiveOperator
from operators.hive_to_rdbms_operator import Hive2RDBMSOperator
然后填充需要的参数:
t1 = BashOperator(task_id="hello",
bash_command="echo 'Hello World, today is {{ ds }}'",
dag=dag)
可以参照https://github.com/apache/airflow/tree/master/airflow/example_dags 以及源码来使用这些任务插件。
如何获取任务执行日期
这个值得单独扯一篇文章, 这里简单带一下. 通过jinja模板变量可以获取任务日期.
以下几个变量用户基本够用
templated_command = """
echo "current bizdate is: {{ ds }} "
echo "current bizdate in number: {{ ds_nodash }} "
echo "7days after: {{ macros.ds_add(ds, 7)}} "
echo "5 days ago: {{ macros.ds_add(ds, -5) }} "
echo "bizdate iso8601 {{ ts }} "
echo "bizdate format: {{ execution_date.strftime("%d-%m-%Y") }} "
echo "bizdate 5 days ago format: {{ (execution_date - macros.timedelta(days=5)).strftime("%Y-%m-%d") }} "
"""
t1 = BashOperator(
task_id='print_date1',
bash_command=templated_command,
# on_success_callback=compass_utils.success_callback(dingding_conn_id='dingding_bigdata', receivers="ryanmiao"),
dag=dag)
执行结果日志为:
echo "current bizdate is: 2019-09-28 "
echo "current bizdate in number: 20190928 "
echo "7days after: 2019-10-05 "
echo "5 days ago: 2019-09-23 "
echo "bizdate iso8601 2019-09-28T01:00:00+08:00 "
echo "bizdate format: 28-09-2019 "
echo "bizdate 5 days ago format: 2019-09-23 "
告警
任务自己跑, 跑的结果如何, 我们需要一个通知. 可以成功的时候告诉我, 也可以失败的时候告诉我.
default_args = {
'owner': 'ryanmiao',
'depends_on_past': False,
'start_date': datetime(2019, 5, 1, 9),
'email': ['ryan.miao@nf-3.com'],
'email_on_failure': False,
'email_on_retry': False,
# 'on_failure_callback': compass_utils.ding_failure_callback('dingding_bigdata'),
# 'on_success_callback': compass_utils.ding_success_callback('dingding_bigdata')
}
默认自带的email on failure邮件通知,需要在配置文件里设置email。当然,我们通常是有自己的通知服务的,还集成自己的认证之类的。所以,Airflow提供了通知回调。
on_failure_callback
一个Python函数,失败的时候执行
on_success_callback
一个Python函数,成功的时候执行
比如,我需要添加钉钉通知。
from airflow.contrib.operators.dingding_operator import DingdingOperator
def failure_callback(context):
"""
The function that will be executed on failure.
:param context: The context of the executed task.
:type context: dict
"""
message = 'AIRFLOW TASK FAILURE TIPS:\n' \
'DAG: {}\n' \
'TASKS: {}\n' \
'Reason: {}\n' \
.format(context['task_instance'].dag_id,
context['task_instance'].task_id,
context['exception'])
return DingdingOperator(
task_id='dingding_success_callback',
dingding_conn_id='dingding_default',
message_type='text',
message=message,
at_all=True,
).execute(context)
args['on_failure_callback'] = failure_callback
后台admin-connections去配置钉钉的群组token,然后这里引用connId即可。
同样的,我们可以使用http请求调用我们自己的通知服务啊,用来发邮件,打电话什么的,都可以自定义。后面介绍自定义插件来实现这种自定义通知功能。
DAG的任务依赖
dag的任务依赖定义很简单:
a >> b b依赖a
a << b a依赖b
a >> b >> c 依赖可以串起来
[a,b] >> c 可以依赖多个
每个依赖语句通过换行分割, 最终会组装一个完整的依赖。
DAG的一些参数
先看看源码的注释
"""
A dag (directed acyclic graph) is a collection of tasks with directional
dependencies. A dag also has a schedule, a start date and an end date
(optional). For each schedule, (say daily or hourly), the DAG needs to run
each individual tasks as their dependencies are met. Certain tasks have
the property of depending on their own past, meaning that they can't run
until their previous schedule (and upstream tasks) are completed.
DAGs essentially act as namespaces for tasks. A task_id can only be
added once to a DAG.
:param dag_id: The id of the DAG
:type dag_id: str
:param description: The description for the DAG to e.g. be shown on the webserver
:type description: str
:param schedule_interval: Defines how often that DAG runs, this
timedelta object gets added to your latest task instance's
execution_date to figure out the next schedule
:type schedule_interval: datetime.timedelta or
dateutil.relativedelta.relativedelta or str that acts as a cron
expression
:param start_date: The timestamp from which the scheduler will
attempt to backfill
:type start_date: datetime.datetime
:param end_date: A date beyond which your DAG won't run, leave to None
for open ended scheduling
:type end_date: datetime.datetime
:param template_searchpath: This list of folders (non relative)
defines where jinja will look for your templates. Order matters.
Note that jinja/airflow includes the path of your DAG file by
default
:type template_searchpath: str or list[str]
:param template_undefined: Template undefined type.
:type template_undefined: jinja2.Undefined
:param user_defined_macros: a dictionary of macros that will be exposed
in your jinja templates. For example, passing ``dict(foo='bar')``
to this argument allows you to ``{{ foo }}`` in all jinja
templates related to this DAG. Note that you can pass any
type of object here.
:type user_defined_macros: dict
:param user_defined_filters: a dictionary of filters that will be exposed
in your jinja templates. For example, passing
``dict(hello=lambda name: 'Hello %s' % name)`` to this argument allows
you to ``{{ 'world' | hello }}`` in all jinja templates related to
this DAG.
:type user_defined_filters: dict
:param default_args: A dictionary of default parameters to be used
as constructor keyword parameters when initialising operators.
Note that operators have the same hook, and precede those defined
here, meaning that if your dict contains `'depends_on_past': True`
here and `'depends_on_past': False` in the operator's call
`default_args`, the actual value will be `False`.
:type default_args: dict
:param params: a dictionary of DAG level parameters that are made
accessible in templates, namespaced under `params`. These
params can be overridden at the task level.
:type params: dict
:param concurrency: the number of task instances allowed to run
concurrently
:type concurrency: int
:param max_active_runs: maximum number of active DAG runs, beyond this
number of DAG runs in a running state, the scheduler won't create
new active DAG runs
:type max_active_runs: int
:param dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
:type dagrun_timeout: datetime.timedelta
:param sla_miss_callback: specify a function to call when reporting SLA
timeouts.
:type sla_miss_callback: types.FunctionType
:param default_view: Specify DAG default view (tree, graph, duration,
gantt, landing_times)
:type default_view: str
:param orientation: Specify DAG orientation in graph view (LR, TB, RL, BT)
:type orientation: str
:param catchup: Perform scheduler catchup (or only run latest)? Defaults to True
:type catchup: bool
:param on_failure_callback: A function to be called when a DagRun of this dag fails.
A context dictionary is passed as a single parameter to this function.
:type on_failure_callback: callable
:param on_success_callback: Much like the ``on_failure_callback`` except
that it is executed when the dag succeeds.
:type on_success_callback: callable
:param access_control: Specify optional DAG-level permissions, e.g.,
"{'role1': {'can_dag_read'}, 'role2': {'can_dag_read', 'can_dag_edit'}}"
:type access_control: dict
:param is_paused_upon_creation: Specifies if the dag is paused when created for the first time.
If the dag exists already, this flag will be ignored. If this optional parameter
is not specified, the global config setting will be used.
:type is_paused_upon_creation: bool or None
"""
emmm, 这里就不一一拆解了,我倾向于用一个了解一个。用的时候对着看。
小结
dag的组成很简单, Python语法式的声明比起property和yaml的配置来说,更容易组织和理解。
定义好dag参数,定义任务类型Operator, 定义任务依赖就完事了。
认识Airflow的DAG的更多相关文章
- airflow删除dag不在页面显示
当我们需要把dag删除的时候,遇到了删除了相应的dag文件,但页面还是显示 这个时候需要重启airflow 的webserver ps -ef|egrep rm -rf /home/airflow ...
- [AirFlow]AirFlow使用指南三 第一个DAG示例
经过前两篇文章的简单介绍之后,我们安装了自己的AirFlow以及简单了解了DAG的定义文件.现在我们要实现自己的一个DAG. 1. 启动Web服务器 使用如下命令启用: airflow webserv ...
- [AirFlow]AirFlow使用指南二 DAG定义文件
1. Example """ Code that goes along with the Airflow tutorial located at: https://git ...
- 调度系统Airflow的第一个DAG
Airflow的第一个DAG 考虑了很久,要不要记录airflow相关的东西, 应该怎么记录. 官方文档已经有比较详细的介绍了,还有各种博客,我需要有一份自己的笔记吗? 答案就从本文开始了. 本文将从 ...
- 系统研究Airbnb开源项目airflow
开源项目airflow的一点研究 调研了一些几个调度系统, airflow 更满意一些. 花了些时间写了这个博文, 这应该是国内技术圈中最早系统性研究airflow的文章了. 转载请注明出处 htt ...
- 【原创】大数据基础之Airflow(1)简介、安装、使用
airflow 1.10.0 官方:http://airflow.apache.org/ 一 简介 Airflow is a platform to programmatically author, ...
- 搭建Airflow数据流调度器
服务器使用的是centos系统,需要安装好pip和setuptools,同时注意更新安装的版本 接下来参考安装好Airflow Airflow 1.8 工作流平台搭建 http://blog.csdn ...
- apache airflow docker 运行简单试用
airflow 是一个编排.调度和监控workflow的平台,由Airbnb开源,现在在Apache Software Foundation 孵化. airflow 将workflow编排为tasks ...
- Airflow使用入门指南
Airflow能做什么 关注公众号, 查看更多 http://mp.weixin.qq.com/s/xPjXMc_6ssHt16J07BC7jA Airflow是一个工作流分配管理系统,通过有向非循环 ...
随机推荐
- js循环(while循环,do while循环,for循环)相关知识点及练习
08.循环 1.循环! 循环的作用: 简化代码,处理重复执行的代码 遍历数组.json对象.节点集合 2.while循环 语法: while(循环的条件){ 循环体 } 3.循环的五大要素 循环变量 ...
- 设计模式(C#)——06桥接模式
推荐阅读: 我的CSDN 我的博客园 QQ群:704621321 在早先,几乎每个手机的充电器接口都是不同的.每个型号的手机都有一个充电器,此时我们把充电器作为一个抽象类,抽象类中提 ...
- 分布式任务调度框架 Azkaban —— Flow 2.0 的使用
一.Flow 2.0 简介 1.1 Flow 2.0 的产生 Azkaban 目前同时支持 Flow 1.0 和 Flow2.0 ,但是官方文档上更推荐使用 Flow 2.0,因为 Flow 1.0 ...
- 从零开始搭建Java开发环境第二篇:如何在windows10里安装MySQL
1 下载安装包 1.1 压缩包 https://dev.mysql.com/downloads/mysql/ [外链图片转存失败(img-oesO8K09-1566652568838)(data:im ...
- 牛客小白月赛6 B 范围 数学
链接:https://www.nowcoder.com/acm/contest/135/B来源:牛客网 题目描述 已知与均为实数,且满足: 给定A,B,求x的取值范围? 由于Apojacsleam的计 ...
- CF1005E1 Median on Segments (Permutations Edition) 思维
Median on Segments (Permutations Edition) time limit per test 3 seconds memory limit per test 256 me ...
- .NET Core CSharp 中级篇2-8 特性标签
.NET Core CSharp 中级篇2-8 本节内容为特性标签 简介 标签Attribute是一个非常重要的技术,你可以使用Attribute技术优化精简你的代码.特性标签可以运用在程序集,模块, ...
- 对git使用的初步总结
使用git也才一周多,就已经深深爱上这款软件了. 之前公司一直用的是clearcase,一款老到除了公司内部的人和曾经开发这款软件的人,估计再也不会有人知道了吧! (当然也许还会有其他公司也会使用,因 ...
- PHP-02.文件上传、php保存/转移上传的文件、常见的网络传输协议、请求报文及属性、响应报文及属性
关系数组 array("key"=>"value",...) ; get没有数据大小的限制 post上传大小没有限制 不指定上传方式,默认是get 文件上 ...
- Android如何管理sqlite
Android中使用SQlite进行数据操作 标签: sqliteandroid数据库sqlintegerstring 2012-02-28 14:21 8339人阅读 评论(2) 举报 分类: a ...