Azkaban3.45

一简介

1 官网

https://azkaban.github.io/

Azkaban was implemented at LinkedIn to solve the problem of Hadoop job dependencies. We had jobs that needed to run in order, from ETL jobs to data analytics products.

Initially a single server solution, with the increased number of Hadoop users over the years, Azkaban has evolved to be a more robust solution.

Azkaban是由LinkedIn为了解决Hadoop环境下任务依赖问题而开发的，LinkedIn团队有很多任务需要按照顺序运行，包括ETL任务以及数据分析任务；

Azkaban一开始是单server方案，现在已经演化为一个更健壮的方案；（可惜当前版本的WebServer还是单点）

Azkaban consists of 3 key components:

Relational Database (MySQL)
AzkabanWebServer
AzkabanExecutorServer

Azkaban有3个核心组件：Mysql、WebServer、ExecutorServer；

2 部署

3 数据库表结构

projects：项目

project_flows：工作流定义

execution_flows：工作流实例

execution_jobs：任务实例

triggers：调度定义

ps：表中很多数据都是编码的，enc_type是编码类型（对应的枚举为EncodingType），2是gzip编码，其他为无编码，2需要调用GZIPUtils.transformBytesToObject解析得到原始字符串；

4 概念

l Job：最小的执行单元，作为DAG的一个结点，即任务

l Flow：由多个Job组成，并通过dependent配置Job的依赖属性，即工作流

l Tirgger：根据指定Cron信息触发Flow，即调度

二代码解析

1 启动过程

Web Server

AzkabanWebServer.main

launch

prepareAndStartServer

configureRoutes

TriggerManager.start

FlowTriggerService.start

recoverIncompleteTriggerInstances

SELECT %s FROM execution_dependencies WHERE trigger_instance_id in (SELECT trigger_instance_id FROM execution_dependencies WHERE dep_status = %s or dep_status = %s or (dep_status = %s and flow_exec_id = %s))

FlowTriggerScheduler.start

ExecutorManager

setupExecutors

loadRunningFlows

QueueProcessorThread.run

ExecutingManagerUpdaterThread.run

Executor Server

AzkabanExecutorServer.main

launch

AzkabanExecutorServer.start

insertExecutorEntryIntoDB

2 工作流执行过程

Web Server两个入口：

ExecuteFlowAction.doAction

ExecutorServlet.ajaxExecuteFlow

Web Server分配任务：

ExecutorManager.submitExecutableFlow

JdbcExecutorLoader.uploadExecutableFlow

INSERT INTO execution_flows (project_id, flow_id, version, status, submit_time, submit_user, update_time) values (?,?,?,?,?,?,?)

ExecutorLoader.addActiveExecutableReference

INSERT INTO active_executing_flows (exec_id, update_time) values (?,?)

queuedFlows.enqueue

QueueProcessorThread.run

processQueuedFlows

ExecutorManager.selectExecutorAndDispatchFlow (get from queuedFlows)

selectExecutor

dispatch

JdbcExecutorLoader.assignExecutor

UPDATE execution_flows SET executor_id=? where exec_id=?

ExecutorApiGateway.callWithExecutable （调用Executor Server）

Executor Server执行任务：

ExecutorServlet.doGet

handleAjaxExecute

FlowRunnerManager.submitFlow

JdbcExecutorLoader.fetchExecutableFlow

SELECT exec_id, enc_type, flow_data FROM execution_flows WHERE exec_id=?

FlowPreparer.setup

FlowRunner.run

setupFlowExecution

updateFlow

UPDATE execution_flows SET status=?,update_time=?,start_time=?,end_time=?,enc_type=?,flow_data=? WHERE exec_id=?

runFlow

progressGraph

runReadyJob

runExecutableNode

JobRunner.run

uploadExecutableNode

INSERT INTO execution_jobs (exec_id, project_id, version, flow_id, job_id, start_time, end_time, status, input_params, attempt) VALUES (?,?,?,?,?,?,?,?,?,?)

prepareJob

runJob

Job.run (ProcessJob, JavaJob)

Web Server轮询流程状态：

ExecutingManagerUpdaterThread.run

getFlowToExecutorMap

ExecutorApiGateway.callWithExecutionId

updateExecution

3 调度执行过程

TriggerManager.start

loadTriggers

SELECT trigger_id, trigger_source, modify_time, enc_type, data FROM triggers

TriggerScannerThread.start

checkAllTriggers

onTriggerTrigger

TriggerAction.doAction

ExecuteFlowAction.doAction

PS：还有另一套完全独立的定时任务逻辑，通过azkaban.server.schedule.enable_quartz控制（默认false），以下为register job到quartz：

ProjectManagerServlet.ajaxHandleUpload

SELECT id, name, active, modified_time, create_time, version, last_modified_by, description, enc_type, settings_blob FROM projects WHERE name=? AND active=true

ProjectManager.loadAllProjectFlows

SELECT project_id, version, flow_id, modified_time, encoding_type, json FROM project_flows WHERE project_id=? AND version=?

FlowTriggerScheduler.scheduleAll

SELECT MAX(flow_version) FROM project_flow_files WHERE project_id=? AND project_version=? AND flow_name=?

SELECT flow_file FROM project_flow_files WHERE project_id=? AND project_version=? AND flow_name=? AND flow_version=?

registerJob

以下为quartz job执行：

FlowTriggerQuartzJob.execute

FlowTriggerService.startTrigger

TriggerInstanceProcessor.processSucceed

TriggerInstanceProcessor.executeFlowAndUpdateExecID

ExecutorManager.submitExecutableFlow

4 任务执行过程

Job是任务的核心接口，所有具体任务都是该接口的子类：

Job

AbstractJob

AbstractProcessJob

ProcessJob （Shell任务）

JavaProcessJob （Java任务）

JavaJob

【原创】大数据基础之Azkaban（1）简介、源代码解析的更多相关文章

【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Impala（1）简介、安装、使用
impala2.12 官方:http://impala.apache.org/ 一简介 Apache Impala is the open source, native analytic datab ...
【原创】大数据基础之Benchmark（2）TPC-DS
tpc 官方:http://www.tpc.org/ 一简介 The TPC is a non-profit corporation founded to define transaction pr ...
【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
大数据基础知识：分布式计算、服务器集群[zz]
大数据中的数据量非常巨大,达到了PB级别.而且这庞大的数据之中,不仅仅包括结构化数据(如数字.符号等数据),还包括非结构化数据(如文本.图像.声音.视频等数据).这使得大数据的存储,管理和处理很难利用 ...
大数据基础知识问答----spark篇，大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
大数据基础知识问答----hadoop篇
handoop相关知识点 1.Hadoop是什么? Hadoop是一个由Apache基金会所开发的分布式系统基础架构.用户可以在不了解分布式底层细节的情况下,开发分布式程序.充分利用集群的威力进行高速 ...
hadoop大数据基础框架技术详解
一.什么是大数据进入本世纪以来,尤其是2010年之后,随着互联网特别是移动互联网的发展,数据的增长呈爆炸趋势,已经很难估计全世界的电子设备中存储的数据到底有多少,描述数据系统的数据量的计量单位从MB ...
大数据基础总结---HDFS分布式文件系统
HDFS分布式文件系统文件系统的基本概述文件系统定义:文件系统是一种存储和组织计算机数据的方法,它使得对其访问和查找变得容易. 文件名:在文件系统中,文件名是用于定位存储位置. 元数据(Metad ...

随机推荐

【数学建模】MatLab 数据读写方法汇总
1.读入 txt 文件数据. load xxx.txt A=load(‘xxx.txt’) A=dlmread(‘xxx.txt’) A=importdata(‘xxx.txt’) 例:将身高体重的 ...
Unity3d中如何查找一个脚本被挂在那些预设上面？
用一个脚本函数可以获取到选择的脚本文件被哪些预设和场景引用 [MenuItem("Assets/Tool/GetReference")] static void GetRefere ...
Elastic Search 语法总结
1. 插入 PUT /megacorp/employee/1 { "first_name" : "John", "last_name" : ...
Elasticsearch6.x和Kibana6.x的安装
Elasticsearch6.x的安装(centos6.x下) Elasticsearch6.x目前需要至少jdk8的支持,关于如何安装jdk不在讲述.Oracle的推荐安装文档可以在Oracle的网 ...
使用@property
@property装饰器就是负责把一个方法变成属性调用的,既能检查参数,又可以用类似属性这样简单的方式来访问类的变量 class Student(object): @property def scor ...
CSS3基础入门01
CSS3 基础入门 01 前言相对于css2来说,css3更新了很多的内容,其中包括选择器.颜色.阴影.背景.文本.边框.新的布局方案.2d.3d.动画等等. 而如果想要学习css3的诸多部分,不妨 ...
Input标签使用整理
0 写在前面对于程序而言I/O是一个程序的重要组成部分.程序的输入.输出接口,指定了程序与用户之间的交互方式.对于前端开发而言,input标签也有着其重要地位,它为用户向服务端提交数据提供了可能. ...
首次使用Oracle SQL Developer 提示： enter the full pathname for java.exe
https://www.cnblogs.com/520future/p/7699095.html 首次使用Oracle SQL Developer 提示: enter the full pathnam ...
Day039--HTML
HTML小马哥博客 HTML CSS + DIV实现整体布局 1. HTML 超文本标记语言对换行不敏感空白折叠现象标签要严格密封新建HTML文件,输入 html:5,按tab键后,自动生成的 ...
[SDOI2006] 保安站岗
题目链接第一遍不知道为什么就爆零了…… 第二遍改了一下策略,思路没变,结果不知道为什么就 A 了??? 树形 DP 经典问题:选择最少点以覆盖树上所有点(边). 对于本题,设 dp[i][0/1][ ...

【原创】大数据基础之Azkaban（1）简介、源代码解析

一 简介