Azkaban3.45

一简介

1 官网

https://azkaban.github.io/

Azkaban was implemented at LinkedIn to solve the problem of Hadoop job dependencies. We had jobs that needed to run in order, from ETL jobs to data analytics products.

Initially a single server solution, with the increased number of Hadoop users over the years, Azkaban has evolved to be a more robust solution.

Azkaban是由LinkedIn为了解决Hadoop环境下任务依赖问题而开发的，LinkedIn团队有很多任务需要按照顺序运行，包括ETL任务以及数据分析任务；

Azkaban一开始是单server方案，现在已经演化为一个更健壮的方案；（可惜当前版本的WebServer还是单点）

Azkaban consists of 3 key components:

Relational Database (MySQL)
AzkabanWebServer
AzkabanExecutorServer

Azkaban有3个核心组件：Mysql、WebServer、ExecutorServer；

2 部署

3 数据库表结构

projects：项目

project_flows：工作流定义

execution_flows：工作流实例

execution_jobs：任务实例

triggers：调度定义

ps：表中很多数据都是编码的，enc_type是编码类型（对应的枚举为EncodingType），2是gzip编码，其他为无编码，2需要调用GZIPUtils.transformBytesToObject解析得到原始字符串；

4 概念

l Job：最小的执行单元，作为DAG的一个结点，即任务

l Flow：由多个Job组成，并通过dependent配置Job的依赖属性，即工作流

l Tirgger：根据指定Cron信息触发Flow，即调度

二代码解析

1 启动过程

Web Server

AzkabanWebServer.main

launch

prepareAndStartServer

configureRoutes

TriggerManager.start

FlowTriggerService.start

recoverIncompleteTriggerInstances

SELECT %s FROM execution_dependencies WHERE trigger_instance_id in (SELECT trigger_instance_id FROM execution_dependencies WHERE dep_status = %s or dep_status = %s or (dep_status = %s and flow_exec_id = %s))

FlowTriggerScheduler.start

ExecutorManager

setupExecutors

loadRunningFlows

QueueProcessorThread.run

ExecutingManagerUpdaterThread.run

Executor Server

AzkabanExecutorServer.main

launch

AzkabanExecutorServer.start

insertExecutorEntryIntoDB

2 工作流执行过程

Web Server两个入口：

ExecuteFlowAction.doAction

ExecutorServlet.ajaxExecuteFlow

Web Server分配任务：

ExecutorManager.submitExecutableFlow

JdbcExecutorLoader.uploadExecutableFlow

INSERT INTO execution_flows (project_id, flow_id, version, status, submit_time, submit_user, update_time) values (?,?,?,?,?,?,?)

ExecutorLoader.addActiveExecutableReference

INSERT INTO active_executing_flows (exec_id, update_time) values (?,?)

queuedFlows.enqueue

QueueProcessorThread.run

processQueuedFlows

ExecutorManager.selectExecutorAndDispatchFlow (get from queuedFlows)

selectExecutor

dispatch

JdbcExecutorLoader.assignExecutor

UPDATE execution_flows SET executor_id=? where exec_id=?

ExecutorApiGateway.callWithExecutable （调用Executor Server）

Executor Server执行任务：

ExecutorServlet.doGet

handleAjaxExecute

FlowRunnerManager.submitFlow

JdbcExecutorLoader.fetchExecutableFlow

SELECT exec_id, enc_type, flow_data FROM execution_flows WHERE exec_id=?

FlowPreparer.setup

FlowRunner.run

setupFlowExecution

updateFlow

UPDATE execution_flows SET status=?,update_time=?,start_time=?,end_time=?,enc_type=?,flow_data=? WHERE exec_id=?

runFlow

progressGraph

runReadyJob

runExecutableNode

JobRunner.run

uploadExecutableNode

INSERT INTO execution_jobs (exec_id, project_id, version, flow_id, job_id, start_time, end_time, status, input_params, attempt) VALUES (?,?,?,?,?,?,?,?,?,?)

prepareJob

runJob

Job.run (ProcessJob, JavaJob)

Web Server轮询流程状态：

ExecutingManagerUpdaterThread.run

getFlowToExecutorMap

ExecutorApiGateway.callWithExecutionId

updateExecution

3 调度执行过程

TriggerManager.start

loadTriggers

SELECT trigger_id, trigger_source, modify_time, enc_type, data FROM triggers

TriggerScannerThread.start

checkAllTriggers

onTriggerTrigger

TriggerAction.doAction

ExecuteFlowAction.doAction

PS：还有另一套完全独立的定时任务逻辑，通过azkaban.server.schedule.enable_quartz控制（默认false），以下为register job到quartz：

ProjectManagerServlet.ajaxHandleUpload

SELECT id, name, active, modified_time, create_time, version, last_modified_by, description, enc_type, settings_blob FROM projects WHERE name=? AND active=true

ProjectManager.loadAllProjectFlows

SELECT project_id, version, flow_id, modified_time, encoding_type, json FROM project_flows WHERE project_id=? AND version=?

FlowTriggerScheduler.scheduleAll

SELECT MAX(flow_version) FROM project_flow_files WHERE project_id=? AND project_version=? AND flow_name=?

SELECT flow_file FROM project_flow_files WHERE project_id=? AND project_version=? AND flow_name=? AND flow_version=?

registerJob

以下为quartz job执行：

FlowTriggerQuartzJob.execute

FlowTriggerService.startTrigger

TriggerInstanceProcessor.processSucceed

TriggerInstanceProcessor.executeFlowAndUpdateExecID

ExecutorManager.submitExecutableFlow

4 任务执行过程

Job是任务的核心接口，所有具体任务都是该接口的子类：

Job

AbstractJob

AbstractProcessJob

ProcessJob （Shell任务）

JavaProcessJob （Java任务）

JavaJob

【原创】大数据基础之Azkaban（1）简介、源代码解析的更多相关文章

【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Impala（1）简介、安装、使用
impala2.12 官方:http://impala.apache.org/ 一简介 Apache Impala is the open source, native analytic datab ...
【原创】大数据基础之Benchmark（2）TPC-DS
tpc 官方:http://www.tpc.org/ 一简介 The TPC is a non-profit corporation founded to define transaction pr ...
【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
大数据基础知识：分布式计算、服务器集群[zz]
大数据中的数据量非常巨大,达到了PB级别.而且这庞大的数据之中,不仅仅包括结构化数据(如数字.符号等数据),还包括非结构化数据(如文本.图像.声音.视频等数据).这使得大数据的存储,管理和处理很难利用 ...
大数据基础知识问答----spark篇，大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
大数据基础知识问答----hadoop篇
handoop相关知识点 1.Hadoop是什么? Hadoop是一个由Apache基金会所开发的分布式系统基础架构.用户可以在不了解分布式底层细节的情况下,开发分布式程序.充分利用集群的威力进行高速 ...
hadoop大数据基础框架技术详解
一.什么是大数据进入本世纪以来,尤其是2010年之后,随着互联网特别是移动互联网的发展,数据的增长呈爆炸趋势,已经很难估计全世界的电子设备中存储的数据到底有多少,描述数据系统的数据量的计量单位从MB ...
大数据基础总结---HDFS分布式文件系统
HDFS分布式文件系统文件系统的基本概述文件系统定义:文件系统是一种存储和组织计算机数据的方法,它使得对其访问和查找变得容易. 文件名:在文件系统中,文件名是用于定位存储位置. 元数据(Metad ...

随机推荐

基于vue-simple-uploader封装文件分片上传、秒传及断点续传的全局上传插件
目录 1. 前言 2. 关于vue-simple-uploader 3. 基于vue-simple-uploader封装全局上传组件 4. 文件上传流程概览 5. 文件分片 6. MD5的计算过程 7 ...
基于 WebGL 的 HTML5 楼宇自控 3D 可视化监控
前言智慧楼宇和人们的生活息息相关,楼宇智能化程度的提高,会极大程度的改善人们的生活品质,在当前工业互联网大背景下受到很大关注.目前智慧楼宇可视化监控的主要优点包括: 智慧化 -- 智慧楼宇是一个生态 ...
jquery动态设置图片路径和超链接href属性
js document.getElementById("myImage").src="hackanm.gif"; jquery $("#img&quo ...
安装VM-tools
win10系统 VMware12 Ubuntu64位安装VM-tools时所遇到的提示信息: open-vm-tools are available from the OS vendor and VM ...
(hdu) 4857 逃生（拓扑排序+优先队列）
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=4857 Problem Description 糟糕的事情发生啦,现在大家都忙着逃命.但是逃命的通道很窄 ...
【zabbix教程系列】二、zabbix特点
一.度量收集从任何设备,系统,应用上收集指标,收集指标的方法有: 多平台zabbix代理 SNMP and IPMI 代理无代理监控用户服务自定义方法计算和聚合用户端web监控二.问题发 ...
vue生命週期
https://www.cnblogs.com/fly_dragon/p/6220273.html https://www.cnblogs.com/fly_dragon/p/6220273.html
SUCTF 2016 : dMd
这个题可以说是比较坑了(还不是我很弱...) Linux跑一下: 要输密码 ida打开看看: int __cdecl main(int argc, const char **argv, const c ...
Qt如何去掉按钮等控件的虚线框（焦点框）
方法1:可以通过代码ui->pushButton->setFocusPolicy(Qt::NoFocus)或在Qt Creator的属性列表中设置. 方法2:如果在嵌入式设备中需要通过按键 ...
linux下find命令的使用和总结
背景:find命令十分的好用,特别是在查找文件的时候,这个时候需要和文件通配符一起使用. 1 前言我们为什么要学会使用find命令? 每一种操作系统都有成千上万的文件组成,对于linux这样“一切皆 ...

【原创】大数据基础之Azkaban（1）简介、源代码解析

一 简介