联童科技基于incubator-dolphinscheduler从0到1构建大数据调度平台之路

联童科技是一家智能化母婴童产业平台，从事母婴童行业以及互联网技术多年，拥有丰富的母婴门店运营和系统开发经验，在会员经营和商品经营方面，能够围绕会员需求，深入场景，更贴近合作伙伴和消费者，提供最优服务产品，公司致力于以技术来驱动母婴童产业的发展，公司也希望借助于大数据为客户提供更多智能数据分析和决策分析，大数据是公司重点发展的一部分，公司从成立初期起就搭建了大数据团队，有了大数据团队后，大数据调度平台的构建自然是最基础也是最重要的环节。

一、为什么选择incubator-dolphinscheduler

1、incubator-dolphinscheduler是一个由国内公司发起的开源项目，中国本土社区成员非常活跃，更加容易去进行社区沟通，同时联童也希望能加入到这个社区中，一起把这个由本土成员为主成立的社区做的更好。

2、incubator-dolphinscheduler 能够支撑非常多的应用场景

以DAG图的方式将Task按照任务的依赖关系关联起来，可实时可视化监控任务的运行状态
支持丰富的任务类型：Shell、MR、Spark、SQL(mysql、postgresql、hive、sparksql),Python,Sub_Process、Procedure，flink，datax，sqoop，http等
支持工作流定时调度、依赖调度、手动调度、手动暂停/停止/恢复，同时支持失败重试/告警、从指定节点恢复失败、Kill任务等操作
支持工作流优先级、任务优先级及任务的故障转移及任务超时告警/失败
支持工作流全局参数及节点自定义参数设置
支持资源文件的在线上传/下载，管理等，支持在线文件创建、编辑
支持任务日志在线查看及滚动、在线下载日志等
实现集群HA，通过Zookeeper实现Master集群和Worker集群去中心化
支持对Master/Worker cpu load，memory，cpu在线查看
支持工作流运行历史树形/甘特图展示、支持任务状态统计、流程状态统计
支持补数
支持多租户
支持国际化

其中DAG图借鉴自spark ，在dolphinscheduler 一个工作流可以对应多个工作任务，每一个工作任务对应一个DAG中的节点。

3、incubator-dolphinscheduler在保证了高并发和高可用的设计时，架构思路也相对简单，技术架构中没有引入非常多的复杂技术组件，降低了学习和维护的成本。

incubator-dolphinscheduler在设计时，除了zookeeper外，没有引入太多复杂的技术组件。整个架构以zookeeper 作为集群管理，采用去中心化思想进行设计。

二、incubator-dolphinscheduler功能的不足

1、无法支持串行调度策略

incubator-dolphinscheduler 在一开始设计时，只支持并行调度，不支持串行调度，而在联童中，大部分场景都是需要串行运行的，也就是每一个工作流任务都只能有一个实例在运行，同一个工作流任务中必须要等前一个实例执行结束，下一个实例才能开始执行，这种场景大多出现在准实时任务中。

2、任务依赖不够强大，只能支持被动等待依赖执行成功，无法主动触发下游工作流实例运行

如下图所示，只能支持在创建任务时，被动去等待依赖执行成功，无法在当前任务执行成功后，主动去触发别的工作流任务执行。

3、部分模块中用户体验不足，并且在数据量大时，部分模块数据查询性能较慢

4、缺少比较完备的监控体系

在 incubator-dolphinscheduler 只提供了一些简单的监控，当有多大几千个任务在运行时，很难做到完备监控，更是缺少对每一个任务运行的性能分析。

三、我们对于incubator-dolphinscheduler的功能升级开发

1、增加串行调度的支持

如下图所示，我们在原有并行执行的基础上，增加了串行执行方式。

在串行执行时，我们还增加了串行执行的队列功能，每一任务都可以指定队列的长度大小。

2、增加主动触发下游工作流实例运行

如下图所示，我们在原有并行执行的基础上，增加主动触发下游一个或者多个工作流实例运行。

运行后效果如下：

3、一些较大的Bug修复

联童在使用 incubator-dolphinscheduler时，同样也踩过不少的坑，这里我们举其中一个例子，比如在内部使用时，同事反馈最多的问题就是调度任务的日志刷新不及时，有时候很久才能刷新出日志。后来经过源码分析，发现是源码中存在了一些不太健壮的处理导致了这个问题。

incubator-dolphinscheduler 中AbstractCommandExecutor.java 部分源码

/*

 * Licensed to the Apache Software Foundation (ASF) under one or more

 * contributor license agreements.  See the NOTICE file distributed with

 * this work for additional information regarding copyright ownership.

 * The ASF licenses this file to You under the Apache License, Version 2.0

 * (the "License"); you may not use this file except in compliance with

 * the License.  You may obtain a copy of the License at

 *

 *    http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS,

 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

 * See the License for the specific language governing permissions and

 * limitations under the License.

 */

package org.apache.dolphinscheduler.server.worker.task;

import static org.apache.dolphinscheduler.common.Constants.EXIT_CODE_FAILURE;

import static org.apache.dolphinscheduler.common.Constants.EXIT_CODE_KILL;

import static org.apache.dolphinscheduler.common.Constants.EXIT_CODE_SUCCESS;

import org.apache.dolphinscheduler.common.Constants;

import org.apache.dolphinscheduler.common.enums.ExecutionStatus;

import org.apache.dolphinscheduler.common.thread.Stopper;

import org.apache.dolphinscheduler.common.thread.ThreadUtils;

import org.apache.dolphinscheduler.common.utils.HadoopUtils;

import org.apache.dolphinscheduler.common.utils.LoggerUtils;

import org.apache.dolphinscheduler.common.utils.OSUtils;

import org.apache.dolphinscheduler.common.utils.StringUtils;

import org.apache.dolphinscheduler.server.entity.TaskExecutionContext;

import org.apache.dolphinscheduler.server.utils.ProcessUtils;

import org.apache.dolphinscheduler.server.worker.cache.TaskExecutionContextCacheManager;

import org.apache.dolphinscheduler.server.worker.cache.impl.TaskExecutionContextCacheManagerImpl;

import org.apache.dolphinscheduler.service.bean.SpringApplicationContext;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStreamReader;

import java.lang.reflect.Field;

import java.nio.charset.StandardCharsets;

import java.util.ArrayList;

import java.util.Collections;

import java.util.LinkedList;

import java.util.List;

import java.util.concurrent.ExecutorService;

import java.util.concurrent.TimeUnit;

import java.util.function.Consumer;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

import org.slf4j.Logger;

/**

 * abstract command executor

 */

public abstract class AbstractCommandExecutor {

    /**

     * rules for extracting application ID

     */

    protected static final Pattern APPLICATION_REGEX = Pattern.compile(Constants.APPLICATION_REGEX);

    protected StringBuilder varPool = new StringBuilder();

    /**

     * process

     */

    private Process process;

    /**

     * log handler

     */

    protected Consumer<List<String>> logHandler;

    /**

     * logger

     */

    protected Logger logger;

    /**

     * log list

     */

    protected final List<String> logBuffer;

    /**

     * taskExecutionContext

     */

    protected TaskExecutionContext taskExecutionContext;

    /**

     * taskExecutionContextCacheManager

     */

    private TaskExecutionContextCacheManager taskExecutionContextCacheManager;

    public AbstractCommandExecutor(Consumer<List<String>> logHandler,

                                   TaskExecutionContext taskExecutionContext,

                                   Logger logger) {

        this.logHandler = logHandler;

        this.taskExecutionContext = taskExecutionContext;

        this.logger = logger;

        this.logBuffer = Collections.synchronizedList(new ArrayList<>());

        this.taskExecutionContextCacheManager = SpringApplicationContext.getBean(TaskExecutionContextCacheManagerImpl.class);

    }

    /**

     * build process

     *

     * @param commandFile command file

     * @throws IOException IO Exception

     */

    private void buildProcess(String commandFile) throws IOException {

        // setting up user to run commands

        List<String> command = new LinkedList<>();

        //init process builder

        ProcessBuilder processBuilder = new ProcessBuilder();

        // setting up a working directory

        processBuilder.directory(new File(taskExecutionContext.getExecutePath()));

        // merge error information to standard output stream

        processBuilder.redirectErrorStream(true);

        // setting up user to run commands

        command.add("sudo");

        command.add("-u");

        command.add(taskExecutionContext.getTenantCode());

        command.add(commandInterpreter());

        command.addAll(commandOptions());

        command.add(commandFile);

        // setting commands

        processBuilder.command(command);

        process = processBuilder.start();

        // print command

        printCommand(command);

    }

    ..........

    /**

     * get the standard output of the process

     *

     * @param process process

     */

    private void parseProcessOutput(Process process) {

        String threadLoggerInfoName = String.format(LoggerUtils.TASK_LOGGER_THREAD_NAME + "-%s", taskExecutionContext.getTaskAppId());

        ExecutorService parseProcessOutputExecutorService = ThreadUtils.newDaemonSingleThreadExecutor(threadLoggerInfoName);

        parseProcessOutputExecutorService.submit(new Runnable() {

            @Override

            public void run() {

                BufferedReader inReader = null;

                try {

                    inReader = new BufferedReader(new InputStreamReader(process.getInputStream()));

                    String line;

                    long lastFlushTime = System.currentTimeMillis();

                    while ((line = inReader.readLine()) != null) {

                        if (line.startsWith("${setValue(")) {

                            varPool.append(line.substring("${setValue(".length(), line.length() - 2));

                            varPool.append("$VarPool$");

                        } else {

                            logBuffer.add(line);

                            lastFlushTime = flush(lastFlushTime);

                        }

                    }

                } catch (Exception e) {

                    logger.error(e.getMessage(), e);

                } finally {

                    clear();

                    close(inReader);

                }

            }

        });

        parseProcessOutputExecutorService.shutdown();

    }

................

    /**

     * when log buffer siz or flush time reach condition , then flush

     *

     * @param lastFlushTime last flush time

     * @return last flush time

     */

    private long flush(long lastFlushTime) {

        long now = System.currentTimeMillis();

        /**

         * when log buffer siz or flush time reach condition , then flush

         */

        if (logBuffer.size() >= Constants.DEFAULT_LOG_ROWS_NUM || now - lastFlushTime > Constants.DEFAULT_LOG_FLUSH_INTERVAL) {

            lastFlushTime = now;

            /** log handle */

            logHandler.accept(logBuffer);

            logBuffer.clear();

        }

        return lastFlushTime;

    }

    /**

     * close buffer reader

     *

     * @param inReader in reader

     */

    private void close(BufferedReader inReader) {

        if (inReader != null) {

            try {

                inReader.close();

            } catch (IOException e) {

                logger.error(e.getMessage(), e);

            }

        }

    }

    protected List<String> commandOptions() {

        return Collections.emptyList();

    }

    protected abstract String buildCommandFilePath();

    protected abstract String commandInterpreter();

    protected abstract void createCommandFileIfNotExists(String execCommand, String commandFile) throws IOException;

}

在这段源码中，parseProcessOutput(Process process) 方法是负责任务日志的获取以及Flush。但是由于采用了BufferedReader 中的readLine() 方法来读取任务进程的process.getInputStream()日志，由于readLine() 是一个阻塞方法，

flush(long lastFlushTime) 方法在处理时有一个判断条件if (logBuffer.size() >= Constants.DEFAULT_LOG_ROWS_NUM || now - lastFlushTime > Constants.DEFAULT_LOG_FLUSH_INTERVAL)，只有当日志条数达到64条或者间隔1s时才会

flush。按理说，代码其实是要实现至少每隔1s会flash 一次日志，但是由于readLine() 是一个阻塞方法，所以并不会一直在执行，而是readLine()必须是读取到新数据后，才会执行flush方法。那么在出现1s内产生的任务日志不满足64条，而任务又很久没有新日志出现时，就会触发这个bug。例如执行如下一个shell 脚本任务，由于每个执行步骤产生的日志少，而且每个步骤执行的时间又很久，时间间隔很大，就会出现很久都不会刷新上一次产生的日志。

#!/bin/bash

echo "hello world"

exec 10m

sleep 100000s

echo "hello world2"

exec 10m

sleep 100000s

echo "hello world3"

exec 10m

sleep 100000s

之后我们对这段源码进行了重写，采用了两个线程进行处理，一个线程负责readline()，一个线程负责flush.做到在readline()方法的线程阻塞时，不影响flush线程的处理。

public abstract class AbstractCommandExecutor {

    /**

     * rules for extracting application ID

     */

    protected static final Pattern APPLICATION_REGEX = Pattern.compile(Constants.APPLICATION_REGEX);

    /**

     * process

     */

    private Process process;

    /**

     * log handler

     */

    protected Consumer<List<String>> logHandler;

    /**

     * logger

     */

    protected Logger logger;

    /**

     * log list

     */

    protected final List<String> logBuffer;

    protected boolean logOutputIsScuccess = false;

    /**

     * taskExecutionContext

     */

    protected TaskExecutionContext taskExecutionContext;

    /**

     * taskExecutionContextCacheManager

     */

    private TaskExecutionContextCacheManager taskExecutionContextCacheManager;

.........

 /**

     * get the standard output of the process

     *

     * @param process process

     */

    private void parseProcessOutput(Process process) {

        String threadLoggerInfoName = String.format(LoggerUtils.TASK_LOGGER_THREAD_NAME + "-%s", taskExecutionContext.getTaskAppId());

        ExecutorService getOutputLogService = ThreadUtils.newDaemonSingleThreadExecutor(threadLoggerInfoName + "-" + "getOutputLogService");

        getOutputLogService.submit(() -> {

            BufferedReader inReader = null;

            try {

                inReader = new BufferedReader(new InputStreamReader(process.getInputStream()));

                String line;while ((line = inReader.readLine()) != null) {

                    logBuffer.add(line);

                }

            } catch (Exception e) {

                logger.error(e.getMessage(), e);

            } finally {

                logOutputIsScuccess = true;

                close(inReader);

            }

        });

        getOutputLogService.shutdown();

        ExecutorService parseProcessOutputExecutorService = ThreadUtils.newDaemonSingleThreadExecutor(threadLoggerInfoName);

        parseProcessOutputExecutorService.submit(() -> {

            try {

                long lastFlushTime = System.currentTimeMillis();

                while (logBuffer.size() > 0 || !logOutputIsScuccess) {

                    if (logBuffer.size() > 0) {

                        lastFlushTime = flush(lastFlushTime);

                    } else {

                        Thread.sleep(Constants.DEFAULT_LOG_FLUSH_INTERVAL);

                    }

                }

            } catch (Exception e) {

                logger.error(e.getMessage(), e);

            } finally {

                clear();

            }

        });

        parseProcessOutputExecutorService.shutdown();

    }

.......

    /**

     * when log buffer siz or flush time reach condition , then flush

     *

     * @param lastFlushTime last flush time

     * @return last flush time

     */

    private long flush(long lastFlushTime) throws InterruptedException {

        long now = System.currentTimeMillis();

        /**

         * when log buffer siz or flush time reach condition , then flush

         */

        if (logBuffer.size() >= Constants.DEFAULT_LOG_ROWS_NUM || now - lastFlushTime > Constants.DEFAULT_LOG_FLUSH_INTERVAL) {

            lastFlushTime = now;

            /** log handle */

            logHandler.accept(logBuffer);

            logBuffer.clear();

        }

        return lastFlushTime;

    }

.......

}

4、将调度系统的监控接入到prometheus和grafana中

incubator-dolphinscheduler 只提供了一些如下的简单实时监控，尤其缺少对任务的监控。

联童在此基础上，引入了prometheus和grafana。

使用prometheus和grafana 不但可以监控到调度系统任务的总体运行，也可以监控到单个任务的运行耗时曲线等。

5、对incubator-dolphinscheduler 的性能优化

待稍后晚点补充

四、联童对于开源社区的拥抱和回馈

联童虽然是一家新兴起的母婴童公司，但是在成立的初始，就秉承着以技术来驱动母婴童产业的发展，公司拥有一个非常好的技术团队，也一直在拥抱开源社区，目前已经引入了incubator-dolphinscheduler、prometheus、grafana 、hadoop、spark、flink、hive、presto......等很多开源项目来支撑公司的技术驱动。在未来，联童也一定回不断的去回馈开源社区，去提供更多的Pull requests，贡献自己的一份力量。

联童科技基于incubator-dolphinscheduler从0到1构建大数据调度平台之路的更多相关文章

从 Airflow 到 Apache DolphinScheduler，有赞大数据开发平台的调度系统演进
点击上方蓝字关注我们作者 | 宋哲琦 ✎ 编者按在不久前的 Apache DolphinScheduler Meetup 2021 上,有赞大数据开发平台负责人宋哲琦带来了平台调度系统 ...
基于MaxCompute的媒体大数据开放平台建设
摘要:随着自媒体的发展,传统媒体面临着巨大的压力和挑战,新华智云运用大数据和人工智能技术,致力于为媒体行业赋能.通过媒体大数据开放平台,将媒体行业全网数据汇总起来,借助平台数据处理能力和算法能力,将有 ...
基于 HTML5 WebGL 与 GIS 的智慧机场大数据可视化分析
前言:大数据,人工智能,工业物联网,5G 已经或者正在潜移默化地改变着我们的生活.在信息技术快速发展的时代,谁能抓住数据的核心,利用有效的方法对数据做数据挖掘和数据分析,从数据中发现趋势,谁就能做到精 ...
基于 HTML5 WebGL 与 GIS 的智慧机场大数据可视化分析【转载】
前言:大数据,人工智能,工业物联网,5G 已经或者正在潜移默化地改变着我们的生活.在信息技术快速发展的时代,谁能抓住数据的核心,利用有效的方法对数据做数据挖掘和数据分析,从数据中发现趋势,谁就能做到精 ...
三：基于Storm的实时处理大数据的平台架构设计
一:元数据管理器==>元数据管理器是系统平台的“大脑”,在任务调度中有着重要的作用[1]什么是元数据?--->中介数据,用于描述数据属性的数据.--->具体类型:描述数据结构,数据的 ...
vue中，基于echarts 地图实现一个人才回流的大数据展示效果
0.引入echarts组件,和中国地图js import eCharts from 'echarts' import 'echarts/map/js/china.js'// 引入中国地图 1. 设置地 ...
实践：由0到1-无线大数据UX团队的成长
背景大数据产品的在项目成立之初,采用的是模仿原有网优工具的方式做UI设计,由BA主导画草图.手绘线框图.excel制作,更有直接打开参考产品做原型的方式,没有统一的设计和规范可言.随着团队逐渐增多. ...
大数据平台迁移实践 | Apache DolphinScheduler 在当贝大数据环境中的应用
大家下午好,我是来自当贝网络科技大数据平台的基础开发工程师王昱翔,感谢社区的邀请来参与这次分享,关于 Apache DolphinScheduler 在当贝网络科技大数据环境中的应用. 本次演讲主要 ...
4 亿用户，7W+ 作业调度难题，Bigo 基于 Apache DolphinScheduler 巧化解
点击上方蓝字关注我们 ✎ 编者按成立于 2014 年的 Bigo,成立以来就聚焦于在全球范围内提供音视频服务.面对 4 亿多用户,Bigo 大数据团队打造的计算平台基于 Apache Dolp ...

随机推荐

Prometheus 监控之 Blackbox_exporter黑盒监测
Prometheus 监控之 Blackbox_exporter黑盒监测 1.blackbox_exporter概述 1.1 Blackbox_exporter 应用场景 2.blackbox_exp ...
Javascript关键字，条件语句，函数及函数相关知识
关键字条件语句作用域回调关键字根据规定,关键字是保留的,不能用作变量名或函数名. 下面是一些ECMAScript关键字的完整列表. break ,case,catch,continue,de ...
Spark日志，及设置日志输出级别
Spark日志,及设置日志输出级别 1.全局应用设置 2.局部应用设置日志输出级别 3.Spark log4j.properties配置详解与实例(摘录于铭霏的记事本) 文章内容来源: 作者:大葱拌豆 ...
go-zero解读与最佳实践（上）
本文有『Go开源说』第三期 go-zero 直播内容修改整理而成,视频内容较长,拆分成上下篇,本文内容有所删减和重构. 大家好,很高兴来到"GO开源说" 跟大家分享开源项目背后的一 ...
Python基础随笔①（MOOC）
@ 目录前言概述主体 1.基本语法元素 ①实例:温度转换要求分析代码部分运行结果 ②作业:Hello World的条件输出要求分析代码运行结果 ③作业:数值运算要求分析代码 ...
Qt update刷新之源码分析(一)
在做GUI开发时,要让控件刷新,会调用update函数:那么在调用了update函数后,Qt究竟基于什么原理.执行了什么代码使得屏幕上有变化?本文就带大家来探究探究其内部源码. Qt手册中关于QWid ...
手动合并hadoop namenode editlog
一. 基本概念 1.NN恢复实际上是由fsimage开始(这个相当于数据的base),如果有多个fsimage,会自动选择最大的fsimage,然后按照editlog序列日志开始执行日志 2.seen ...
D - D (最短路解决源点到多点，多点到源点的和（有向图）)
问从1号点到各个点的距离+各个点到1号点之间的距离和的最小值 In the age of television, not many people attend theater performances ...
牛客练习赛64 D【容斥+背包】
牛客练习赛64 D.宝石装箱 Description \(n\)颗宝石装进\(n\)个箱子使得每个箱子中都有一颗宝石.第\(i\)颗宝石不能装入第\(a_i\)个箱子.求合法的装箱方案对\(99824 ...
UVALive 7276 Wooden Signs
详细题目见:http://7xjob4.com1.z0.glb.clouddn.com/0f10204481da21e62f8c145939e5828e 思路:记dp[i][j]表示第i个木板尾部在j ...

联童科技基于incubator-dolphinscheduler从0到1构建大数据调度平台之路

联童科技基于incubator-dolphinscheduler从0到1构建大数据调度平台之路的更多相关文章

随机推荐

热门专题