yarn application -kill application_id yarn kill 超时任务脚本

需求：kill 掉yarn上超时的任务，实现不同队列不同超时时间的kill机制，并带有任务名的白名单功能

此为python脚本，可配置crontab使用

# _*_ coding=utf-8 _*_

# !/usr/bin/python

import re

import commands

import time

run_app_arr = []

timeout_app_arr = []

ONE_HOURE = 1

THREE_HOURE = 3

TEST_QUEUE_NAME = ['hue', 'etl-test']

ONLINE_QUEUE_NAME = ['default']

KILL_WHITE_LIST = ['org.apache.spark.sql.hive.thriftserver.HiveThriftServer2']

DINGDING_URL = 'xxx'

ding_cmd = """ curl %s -H 'Content-Type: application/json' -d '{"msgtype": "text", "text": {"content": "== YARN OVERTIME JOB KILL 告警 ==\n\n 当前时间: %s \n kill_app_id: %s \n kill_app_name: %s \n kill_app_queue: %s "}}' """

f = None

try:

    f = open('/home/hadoop/autokillhadoopjob/check_timeout_job.log', 'a')

    commond = '. /etc/profile && yarn application -list | grep "http://" |grep "RUNNING" |cut -f1,2,5'

    # 获得正在运行job的id,name,queue 加到 run_app_arr

    status, output = commands.getstatusoutput(commond)

    f.write('#' * 50 + '\n')

    f.write('=> start_time: %s \n' % (time.strftime('%Y-%m-%d %H:%M:%S')))

    if status == 0 :

        for line in output.split('\n'):

            if line.startswith('application_'):

                app_line = re.split('\t', line)

                running_app_id = app_line[0].strip()

                running_app_name = app_line[1].strip()

                app_queue = app_line[2].strip()

                # 根据所在队列 筛选出app加到数组中

                if app_queue in TEST_QUEUE_NAME or app_queue in ONLINE_QUEUE_NAME:

                    run_app_arr.append((running_app_id, running_app_name, app_queue))

    else:

        f.write('yarn -list 执行失败. status: %s.'%(status))

    # 遍历所有队列的running job,如有超时加到timeout_app_arr

    for run_app in run_app_arr:

        running_app_id = run_app[0]

        running_app_name = run_app[1]

        running_app_queue = run_app[2]

        commond = ". /etc/profile && yarn application -status " + running_app_id + "| grep 'Start-Time' | awk -F ':' '{print $2}'"

        status, output = commands.getstatusoutput(commond)

        if status == 0:

            for line in output.split('\n'):

                start_timestamp = line.strip()

                if start_timestamp.isdigit():

                    # 计算任务耗时

                    elapsed_time = time.time() - int(start_timestamp) / 1000

                    cost_time = round(elapsed_time / 60 / 60, 2)

                    f.write('=> cost_time: %sh \n' % (cost_time))

                    # print cost_hour

                    # 筛选出超时的job 加到数据组中/过滤掉白名单任务

                    if running_app_name not in KILL_WHITE_LIST:

                        if (running_app_queue in TEST_QUEUE_NAME and cost_time > ONE_HOURE) \

                                or (running_app_queue in ONLINE_QUEUE_NAME and cost_time > THREE_HOURE):

                            # if cost_hour > 0:# 测试

                            f.write('=> timeout app => %s # %s # %s\n' % (running_app_id, running_app_name, running_app_queue))

                            timeout_app_arr.append((running_app_id, running_app_name, running_app_queue))

        else:

            f.write('yarn -status 执行失败. status: %s.'%(status))

    if len(timeout_app_arr) == 0:

        f.write('=> no timeout job.\n')

    # kill掉超时的job 并dingding报警

    for kill_app in timeout_app_arr:

        kill_app_id = kill_app[0]

        kill_app_name = kill_app[1]

        kill_app_queue = kill_app[2]

        commond = '. /etc/profile && yarn application -kill ' + kill_app_id

        status, output = commands.getstatusoutput(commond)

        if status == 0:

            f.write('=> kill app sucessfully: %s # %s # %s.\n' % (kill_app_id, kill_app_name, kill_app_queue))

            current_time = time.strftime('%Y-%m-%d %H:%M:%S')

            cmd = ding_cmd % (DINGDING_URL, current_time, kill_app_id, kill_app_name, kill_app_queue)

            commands.getstatusoutput(cmd)

        else:

            f.write('=> kill app failed: %s # %s # %s.\n' % (kill_app_id, kill_app_name, kill_app_queue))

    f.write('=> stop_time: %s \n' % (time.strftime('%Y-%m-%d %H:%M:%S')))

except Exception as e:

    f.write('=> Exception: %s \n' % (e.message))

finally:

    if f:

        f.close()

yarn application -kill application_id yarn kill 超时任务脚本的更多相关文章

hadoop job -kill 和 yarn application -kill 区别
hadoop job -kill 调用的是CLI.java里面的job.killJob(); 这里会分几种情况,如果是能查询到状态是RUNNING的话,是直接向AppMaster发送kill请求的.Y ...
hadoop job -kill 与 yarn application -kii（作业卡了或作业重复提交或MapReduce任务运行到running job卡住）
问题详情解决办法 [hadoop@master ~]$ hadoop job -kill job_1493782088693_0001 DEPRECATED: Use of this script ...
yarn application命令介绍
yarn application 1.-list 列出所有 application 信息示例:yarn application -list 2.-appStates <Stat ...
kill 进程卡住，超时kill方法
还是有漏洞 ,万一 working.py未超时, kill_job.sh 会不会杀死别人的进程啊start.sh#!/bin/bash python working.py &python wo ...
spark-shell启动报错：Yarn application has already ended! It might have been killed or unable to launch application master
spark-shell不支持yarn cluster,以yarn client方式启动 spark-shell --master=yarn --deploy-mode=client 启动日志,错误信息 ...
yarn application ID 增长达到10000后
Job, Task, and Task Attempt IDs In Hadoop 2, MapReduce job IDs are generated from YARN application I ...
Yarn application has already exited with state FINISHED
如果在运行spark-sql时遇到如下这样的错误,可能是因为yarn-site.xml中的配置项yarn.nodemanager.vmem-pmem-ratio值偏小,它的默认值为2.1,可以尝试改大 ...
spark利用yarn提交任务报:YARN application has exited unexpectedly with state UNDEFINED
spark用yarn提交任务会报ERROR cluster.YarnClientSchedulerBackend: YARN application has exited unexpectedly w ...
【深入浅出 Yarn 架构与实现】3-1 Yarn Application 流程与编写方法
本篇学习 Yarn Application 编写方法,将带你更清楚的了解一个任务是如何提交到 Yarn ,在运行中的交互和任务停止的过程.通过了解整个任务的运行流程,帮你更好的理解 Yarn 运作方式 ...

随机推荐

WPF Demo16 资源
<Window x:Class="RescourceDemo1.MainWindow" xmlns="http://schemas.microsoft.com/wi ...
ubantu 重启mysql
如何启动/停止/重启MySQL一. 启动方式 1.使用 service 启动:service mysql start 2.使用 mysqld 脚本启动:/etc/inint.d/mysql start ...
C++11--右值引用(移动语义)
/*################################################################## * 右值引用 (C++ 11) * 主要用于以下: * 1. ...
Redis集群事物提交异常Multi-key operations must involve a single slot
redis做完集群后不同键在同一事物中提交,因为key的hash计算结果不同不能分配到同一个分片上,因此出现此异常. 解决方案:在本次事物的key内添加"{tag}",这时redi ...
【SQL Server】MS SQL Server中的CONVERT日期格式化大全
CONVERT 函数将某种数据类型的表达式显式转换为另一种数据类型.SQL Server中将日期格式化. SQL Server 支持使用科威特算法的阿拉伯样式中的数据格式. 在表中,左侧的两列表示将 ...
工具类System,Runtime,Math,Date,Calendar
API--- java.lang.System: 属性和行为都是静态的. long currentTimeMillis(); // 返回当前时间毫秒值 exit(); // 退出虚拟机 Prop ...
Node.js做的代理转发服务器
可以代理苹果ID服务器 const http = require('http'); const https = require('https'); const client = require('ht ...
找出N个无序数中第K大的数
使用类似快速排序,执行一次快速排序后,每次只选择一部分继续执行快速排序,直到找到第K个大元素为止,此时这个元素在数组位置后面的元素即所求时间复杂度: 1.若随机选取枢纽,线性期望时间O(N) 2.若 ...
如何进行CodeReview
一.代码规范的要点代码规范主要分为风格规范与设计规范两大类: 1.代码风格规范主要是文字上的规定,看似表面文章,实际上非常重要. 具体有如下几个方面: (1)缩进 (2)行宽 (3)断行/空白行 ...
CentOS 7安装Oracle 11gR2以及设置自启动（2）
6.创建表空间和用户授权 (1).连接数据库 $ sqlplus / as sysdba (2).创建数据库表空间语法: create tablespace 表空间名 datafile ‘物理地址( ...

yarn application -kill application_id yarn kill 超时任务脚本

需求：kill 掉yarn上超时的任务，实现不同队列不同超时时间的kill机制，并带有任务名的白名单功能

yarn application -kill application_id yarn kill 超时任务脚本的更多相关文章

随机推荐

热门专题