升级服务从spark2.3.0-hadoop2.8 至 spark2.4.0 hadoop3.0

一日后导致spark streaming kafka消费数据积压

服务不是传统的部署在yarn上,而是布在kubernetes(1.13.2)上 https://spark.apache.org/docs/latest/running-on-kubernetes.html

因为近期对集群有大操作,以为是集群的io瓶颈导致的积压,作了几项针对io优化,但没什么效果

一直盯着服务日志和服务器的负载情况

突然发现一点不对,spark相关服务的cpu占用一直在100%-200%之间,长时间停留在100%

集群相关机器是32核,cpu占用100%可以理解为只用了单核,这里明显有问题

猜测数据积压,很可能不是io瓶颈,而是计算瓶颈(服务内部有分词,分类,聚类计算等计算密集操作)

程序内部会根据cpu核心作优化

获取环境内核数的方法
def GetCpuCoreNum(): Int = {
Runtime.getRuntime.availableProcessors
}

打印内核心数

spark 2.4.0

root@consume-topic-qk-nwd-7d84585f5-kh7z5:/usr/spark-2.4.# java -version
java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) -Bit Server VM (build 25.202-b08, mixed mode) [cuidapeng@wx-k8s- ~]$ kb logs consume-topic-qk-nwd-7d84585f5-kh7z5 |more
-- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Cpu core Num
-- :: INFO SparkContext: - Running Spark version 2.4.
-- :: INFO SparkContext: - Submitted application: topic-quick
-- :: INFO SecurityManager: - Changing view acls to: root
-- :: INFO SecurityManager: - Changing modify acls to: root
-- :: INFO SecurityManager: - Changing view acls groups to:
-- :: INFO SecurityManager: - Changing modify acls groups to:
-- :: INFO SecurityManager: - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
odify permissions: Set(root); groups with modify permissions: Set()
-- :: INFO Utils: - Successfully started service 'sparkDriver' on port .
-- :: INFO SparkEnv: - Registering MapOutputTracker
-- :: INFO SparkEnv: - Registering BlockManagerMaster
-- :: INFO BlockManagerMasterEndpoint: - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
-- :: INFO BlockManagerMasterEndpoint: - BlockManagerMasterEndpoint up
-- :: INFO DiskBlockManager: - Created local directory at /tmp/blockmgr-dc0c496e-e5ab-4d07-a518-440f2336f65c
-- :: INFO MemoryStore: - MemoryStore started with capacity 4.5 GB
-- :: INFO SparkEnv: - Registering OutputCommitCoordinator
-- :: INFO log: - Logging initialized @2888ms

Cpu core Num 1 服务变为单核计算,积压的原因就在这里

果然猜测正确,回滚版本至2.3.0

回滚至spark 2.3.0

root@consume-topic-dt-nwd-67b7fd6dd5-jztpb:/usr/spark-2.3.# java -version
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) -Bit Server VM (build 25.131-b11, mixed mode) -- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Cpu core Num
-- :: INFO SparkContext: - Running Spark version 2.3.
-- :: INFO SparkContext: - Submitted application: topic-dt
-- :: INFO SecurityManager: - Changing view acls to: root
-- :: INFO SecurityManager: - Changing modify acls to: root
-- :: INFO SecurityManager: - Changing view acls groups to:
-- :: INFO SecurityManager: - Changing modify acls groups to:
-- :: INFO SecurityManager: - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
odify permissions: Set(root); groups with modify permissions: Set()
-- :: INFO Utils: - Successfully started service 'sparkDriver' on port .
-- :: INFO SparkEnv: - Registering MapOutputTracker
-- :: INFO SparkEnv: - Registering BlockManagerMaster
-- :: INFO BlockManagerMasterEndpoint: - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
-- :: INFO BlockManagerMasterEndpoint: - BlockManagerMasterEndpoint up
-- :: INFO DiskBlockManager: - Created local directory at /tmp/blockmgr-5dbf1194-477a---3da01b5a3f01
-- :: INFO MemoryStore: - MemoryStore started with capacity 6.2 GB
-- :: INFO SparkEnv: - Registering OutputCommitCoordinator
-- :: INFO log: - Logging initialized @2867ms

Cpu core Num 32,32是物理机的内核数

阻塞并不是io引起的,而是runtime可用core变小导致,spark升级至2.4.0后,服务由32核并发执行变成单核执行

这实际不是spark的问题,而是jdk的问题

很早以前有需求限制docker内的core资源,要求jdk获取到core数docker限制的core数,当时印象是对jdk提了需求未来jdk9,10会实现,jdk8还实现不了,就把docker限制内核数的方案给否了,以分散服务调度的方式作计算资源的限制

对jdk8没想到这一点,却在这里踩了个坑

docker 控制cpu的相关参数

Usage:    docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
Run a command in a new container
Options:
--cpu-period int Limit CPU CFS (Completely Fair Scheduler) period
--cpu-quota int Limit CPU CFS (Completely Fair Scheduler) quota
--cpu-rt-period int Limit CPU real-time period in microseconds
--cpu-rt-runtime int Limit CPU real-time runtime in microseconds
-c, --cpu-shares int CPU shares (relative weight)
--cpus decimal Number of CPUs
--cpuset-cpus string CPUs in which to allow execution (-, ,)
--cpuset-mems string MEMs in which to allow execution (-, ,)

另外一点,服务是由kubernetes调度的,kubernetes在docker之上又作一层资源管理

kubernetes对cpu的控制有两种方案
一种是基于内核的 https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
一种是基于百分比的 https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
手动分配cpu资源

        resources:
requests:
cpu:
memory: "24Gi"
limits:
cpu:
memory: "24Gi"

更新服务

-- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Cpu core Num
-- :: INFO SparkContext: - Running Spark version 2.4.
-- :: INFO SparkContext: - Submitted application: topic-dt
-- :: INFO SecurityManager: - Changing view acls to: root
-- :: INFO SecurityManager: - Changing modify acls to: root
-- :: INFO SecurityManager: - Changing view acls groups to:
-- :: INFO SecurityManager: - Changing modify acls groups to:
-- :: INFO SecurityManager: - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
odify permissions: Set(root); groups with modify permissions: Set()
-- :: INFO Utils: - Successfully started service 'sparkDriver' on port .
-- :: INFO SparkEnv: - Registering MapOutputTracker
-- :: INFO SparkEnv: - Registering BlockManagerMaster
-- :: INFO BlockManagerMasterEndpoint: - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
-- :: INFO BlockManagerMasterEndpoint: - BlockManagerMasterEndpoint up
-- :: INFO DiskBlockManager: - Created local directory at /tmp/blockmgr-764f35a8-ea7f---22cbbe2d9a39
-- :: INFO MemoryStore: - MemoryStore started with capacity 6.2 GB
-- :: INFO SparkEnv: - Registering OutputCommitCoordinator
-- :: INFO log: - Logging initialized @2855ms

Cpu core Num 12 生效

kubernetes(docker) 和spark(jdk)之间core有一个兼容性问题

jdk 1.8.0_131 在docker内 获取的是主机上的内核数

jdk 1.8.0_202 在docker内 获取的是docker被限制的内核数,kubernetes不指定resource默认限制为1

升级至spark2.4.0-hadoop3.0(jdk 1.8.0_202),同时kubernetes同时指定内核数,也可以切换jdk至低版本,但需要重新打docker镜像。

指定内核数

Name:               wx-k8s-
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl:
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, Jan :: +
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, Mar :: + Thu, Jan :: + KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, Mar :: + Thu, Jan :: + KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, Mar :: + Thu, Jan :: + KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, Mar :: + Thu, Jan :: + KubeletReady kubelet is posting ready status
Addresses:
Capacity:
cpu:
ephemeral-storage: 1951511544Ki
hugepages-1Gi:
hugepages-2Mi:
memory: 65758072Ki
pods:
Allocatable:
cpu:
ephemeral-storage:
hugepages-1Gi:
hugepages-2Mi:
memory: 65655672Ki
pods:
System Info:
Container Runtime Version: docker://17.3.2
Kubelet Version: v1.13.2
Kube-Proxy Version: v1.13.2
PodCIDR: 10.244.7.0/
Non-terminated Pods: ( in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system kube-flannel-ds-l594f 100m (%) 100m (%) 50Mi (%) 50Mi (%) 11d
kube-system kube-proxy-vckxf (%) (%) (%) (%) 39d
Allocated resources:
(Total limits may be over percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 20100m (%) 100m (%)
memory 45106Mi (%) 61490Mi (%)
ephemeral-storage (%) (%)
Events: <none

spark docker java kubernetes 获取cpu内核/线程数问题的更多相关文章

  1. Java多线程之如何确定线程数

    关于多线程的线程数的确定,最近研读过几篇paper,在此做一下笔记,方便使用时翻看. 1.<Java 虚拟机并发编程>中介绍 就是说:线程数 =  CPU的核心数 * (1 - 阻塞系数) ...

  2. java怎样获取CPU占用率和硬盘占用率

    通过jmx可以监控vm内存使用,系统内存使用等,以下是网上某博客代码,特点是通过window和linux命令获得CPU使用率. 利用java程序实现获取计算机cpu利用率和内存使用信息. packag ...

  3. java中获取当前执行线程的名称

    Thread.currentThread().getName()

  4. java线程与内核线程的关系,及怎么定义ThreadPoolExecutor相关参数

    p.p1 { margin: 0; font: 12px Menlo } p.p1 { margin: 0; font: 12px Menlo } p.p2 { margin: 0; font: 12 ...

  5. Atitit. 获取cpu占有率的 java c# .net php node.js的实现

    Atitit. 获取cpu占有率的 java c# .net php node.js的实现 通过wmic接口获取cpu占有率 C:\Users\Administrator.ATTILAXPC188&g ...

  6. cpu个数、核数、线程数、Java多线程关系的理解

    cpu个数.核数.线程数.Java多线程关系的理解 2017年12月08日 15:35:37 一 cpu个数.核数.线程数的关系 cpu个数:是指物理上,也及硬件上的核心数: 核数:是逻辑上的,简单理 ...

  7. Linux下使用java获取cpu、内存使用率

    原文地址:http://www.voidcn.com/article/p-yehrvmep-uo.html 思路如下:Linux系统中可以用top命令查看进程使用CPU和内存情况,通过Runtime类 ...

  8. Java并发(四)线程池使用

    上一篇博文介绍了线程池的实现原理,现在介绍如何使用线程池. 目录 一.创建线程池 二.向线程池提交任务 三.关闭线程池 四.合理配置线程池 五.线程池的监控 线程池创建规范 一.创建线程池 我们可以通 ...

  9. 转 根据CPU核心数确定线程池并发线程数

    转自: https://www.cnblogs.com/dennyzhangdd/p/6909771.html?utm_source=itdadao&utm_medium=referral 目 ...

随机推荐

  1. 安卓app测试之Monkey日志分析《转载》

    安卓app测试之Monkey日志分析 链接:https://www.cnblogs.com/wuzm/p/10965762.html

  2. 大二暑假第四周总结--开始学习Hadoop基础(三)

    简单学习云数据库系统架构(以UMP系统为例) 一.UMP系统概述 低成本和高性能的MySQL云数据库方案 二.UMP系统架构 架构设计遵循以下原则: 保持单一的系统对外入口,并且为系统内部维护单一的资 ...

  3. 一天一个设计模式——模板方法(Template Method)模式

    一.模式说明 现实世界中的模板是用于将事物的结构规律予以固定化.标准化的成果,它体现了结构形式的标准化.例如镂空文字印刷的模板,通过某个模板印刷出来的文字字体大小都是一模一样,但是具体使用什么材质的颜 ...

  4. JAVA中汉字的Compare排序

    当调用String.compare方法的时候,比较的是Unicode码,并不能对汉字进行准确的排序,所以汉字比较时会出现比较混乱的结果. java.text.Collator类中有一个getInsta ...

  5. h5-transform-3d

    <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

  6. Cracking Digital VLSI Verification Interview 第一章

    目录 Digital Logic Design Number Systems, Arithmetic and Codes Basic Gates Combinational Logic Circuit ...

  7. 网页嵌 activeXForm 中显示fastReport

  8. 《Java核心技术卷I》观赏指南

    Tomxin7 如果你有想看书的计划,但是还在纠结哪些书值得看,可以简单看看"观赏指南"系列,本文会简单列出书中内容,给还没有买书的朋友提供一个参考. 前言 秋招过去很久了,虽然在 ...

  9. win10使用笔记本自带显卡GUP安装CUDA,版本问题

    1.GPU算力问题 查询:win+r, GPU:GeForce GTX 850m,算力5.0,还可以跑得起来深度项目 2.我们需要查看NVIDIA驱动版本,才能安装合适的CUDA版本. 在C:\Pro ...

  10. 关于Java编码规范

    一.尽量使用卫语句 卫语句概念 条件表达式通常有两种表现形式,第一种形式是:所有分支都属于正常行为:第二种形式则是:条件表达式提供的答案中只有一种是正常行为,其他都是不常见的情况.这两类条件表达式有不 ...