升级服务从spark2.3.0-hadoop2.8 至 spark2.4.0 hadoop3.0

一日后导致spark streaming kafka消费数据积压

服务不是传统的部署在yarn上,而是布在kubernetes(1.13.2)上 https://spark.apache.org/docs/latest/running-on-kubernetes.html

因为近期对集群有大操作,以为是集群的io瓶颈导致的积压,作了几项针对io优化,但没什么效果

一直盯着服务日志和服务器的负载情况

突然发现一点不对,spark相关服务的cpu占用一直在100%-200%之间,长时间停留在100%

集群相关机器是32核,cpu占用100%可以理解为只用了单核,这里明显有问题

猜测数据积压,很可能不是io瓶颈,而是计算瓶颈(服务内部有分词,分类,聚类计算等计算密集操作)

程序内部会根据cpu核心作优化

获取环境内核数的方法
def GetCpuCoreNum(): Int = {
Runtime.getRuntime.availableProcessors
}

打印内核心数

spark 2.4.0

root@consume-topic-qk-nwd-7d84585f5-kh7z5:/usr/spark-2.4.# java -version
java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) -Bit Server VM (build 25.202-b08, mixed mode) [cuidapeng@wx-k8s- ~]$ kb logs consume-topic-qk-nwd-7d84585f5-kh7z5 |more
-- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Cpu core Num
-- :: INFO SparkContext: - Running Spark version 2.4.
-- :: INFO SparkContext: - Submitted application: topic-quick
-- :: INFO SecurityManager: - Changing view acls to: root
-- :: INFO SecurityManager: - Changing modify acls to: root
-- :: INFO SecurityManager: - Changing view acls groups to:
-- :: INFO SecurityManager: - Changing modify acls groups to:
-- :: INFO SecurityManager: - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
odify permissions: Set(root); groups with modify permissions: Set()
-- :: INFO Utils: - Successfully started service 'sparkDriver' on port .
-- :: INFO SparkEnv: - Registering MapOutputTracker
-- :: INFO SparkEnv: - Registering BlockManagerMaster
-- :: INFO BlockManagerMasterEndpoint: - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
-- :: INFO BlockManagerMasterEndpoint: - BlockManagerMasterEndpoint up
-- :: INFO DiskBlockManager: - Created local directory at /tmp/blockmgr-dc0c496e-e5ab-4d07-a518-440f2336f65c
-- :: INFO MemoryStore: - MemoryStore started with capacity 4.5 GB
-- :: INFO SparkEnv: - Registering OutputCommitCoordinator
-- :: INFO log: - Logging initialized @2888ms

Cpu core Num 1 服务变为单核计算,积压的原因就在这里

果然猜测正确,回滚版本至2.3.0

回滚至spark 2.3.0

root@consume-topic-dt-nwd-67b7fd6dd5-jztpb:/usr/spark-2.3.# java -version
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) -Bit Server VM (build 25.131-b11, mixed mode) -- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Cpu core Num
-- :: INFO SparkContext: - Running Spark version 2.3.
-- :: INFO SparkContext: - Submitted application: topic-dt
-- :: INFO SecurityManager: - Changing view acls to: root
-- :: INFO SecurityManager: - Changing modify acls to: root
-- :: INFO SecurityManager: - Changing view acls groups to:
-- :: INFO SecurityManager: - Changing modify acls groups to:
-- :: INFO SecurityManager: - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
odify permissions: Set(root); groups with modify permissions: Set()
-- :: INFO Utils: - Successfully started service 'sparkDriver' on port .
-- :: INFO SparkEnv: - Registering MapOutputTracker
-- :: INFO SparkEnv: - Registering BlockManagerMaster
-- :: INFO BlockManagerMasterEndpoint: - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
-- :: INFO BlockManagerMasterEndpoint: - BlockManagerMasterEndpoint up
-- :: INFO DiskBlockManager: - Created local directory at /tmp/blockmgr-5dbf1194-477a---3da01b5a3f01
-- :: INFO MemoryStore: - MemoryStore started with capacity 6.2 GB
-- :: INFO SparkEnv: - Registering OutputCommitCoordinator
-- :: INFO log: - Logging initialized @2867ms

Cpu core Num 32,32是物理机的内核数

阻塞并不是io引起的,而是runtime可用core变小导致,spark升级至2.4.0后,服务由32核并发执行变成单核执行

这实际不是spark的问题,而是jdk的问题

很早以前有需求限制docker内的core资源,要求jdk获取到core数docker限制的core数,当时印象是对jdk提了需求未来jdk9,10会实现,jdk8还实现不了,就把docker限制内核数的方案给否了,以分散服务调度的方式作计算资源的限制

对jdk8没想到这一点,却在这里踩了个坑

docker 控制cpu的相关参数

Usage:    docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
Run a command in a new container
Options:
--cpu-period int Limit CPU CFS (Completely Fair Scheduler) period
--cpu-quota int Limit CPU CFS (Completely Fair Scheduler) quota
--cpu-rt-period int Limit CPU real-time period in microseconds
--cpu-rt-runtime int Limit CPU real-time runtime in microseconds
-c, --cpu-shares int CPU shares (relative weight)
--cpus decimal Number of CPUs
--cpuset-cpus string CPUs in which to allow execution (-, ,)
--cpuset-mems string MEMs in which to allow execution (-, ,)

另外一点,服务是由kubernetes调度的,kubernetes在docker之上又作一层资源管理

kubernetes对cpu的控制有两种方案
一种是基于内核的 https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
一种是基于百分比的 https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
手动分配cpu资源

        resources:
requests:
cpu:
memory: "24Gi"
limits:
cpu:
memory: "24Gi"

更新服务

-- :: WARN NativeCodeLoader: - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Cpu core Num
-- :: INFO SparkContext: - Running Spark version 2.4.
-- :: INFO SparkContext: - Submitted application: topic-dt
-- :: INFO SecurityManager: - Changing view acls to: root
-- :: INFO SecurityManager: - Changing modify acls to: root
-- :: INFO SecurityManager: - Changing view acls groups to:
-- :: INFO SecurityManager: - Changing modify acls groups to:
-- :: INFO SecurityManager: - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
odify permissions: Set(root); groups with modify permissions: Set()
-- :: INFO Utils: - Successfully started service 'sparkDriver' on port .
-- :: INFO SparkEnv: - Registering MapOutputTracker
-- :: INFO SparkEnv: - Registering BlockManagerMaster
-- :: INFO BlockManagerMasterEndpoint: - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
-- :: INFO BlockManagerMasterEndpoint: - BlockManagerMasterEndpoint up
-- :: INFO DiskBlockManager: - Created local directory at /tmp/blockmgr-764f35a8-ea7f---22cbbe2d9a39
-- :: INFO MemoryStore: - MemoryStore started with capacity 6.2 GB
-- :: INFO SparkEnv: - Registering OutputCommitCoordinator
-- :: INFO log: - Logging initialized @2855ms

Cpu core Num 12 生效

kubernetes(docker) 和spark(jdk)之间core有一个兼容性问题

jdk 1.8.0_131 在docker内 获取的是主机上的内核数

jdk 1.8.0_202 在docker内 获取的是docker被限制的内核数,kubernetes不指定resource默认限制为1

升级至spark2.4.0-hadoop3.0(jdk 1.8.0_202),同时kubernetes同时指定内核数,也可以切换jdk至低版本,但需要重新打docker镜像。

指定内核数

Name:               wx-k8s-
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl:
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, Jan :: +
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, Mar :: + Thu, Jan :: + KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, Mar :: + Thu, Jan :: + KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, Mar :: + Thu, Jan :: + KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, Mar :: + Thu, Jan :: + KubeletReady kubelet is posting ready status
Addresses:
Capacity:
cpu:
ephemeral-storage: 1951511544Ki
hugepages-1Gi:
hugepages-2Mi:
memory: 65758072Ki
pods:
Allocatable:
cpu:
ephemeral-storage:
hugepages-1Gi:
hugepages-2Mi:
memory: 65655672Ki
pods:
System Info:
Container Runtime Version: docker://17.3.2
Kubelet Version: v1.13.2
Kube-Proxy Version: v1.13.2
PodCIDR: 10.244.7.0/
Non-terminated Pods: ( in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system kube-flannel-ds-l594f 100m (%) 100m (%) 50Mi (%) 50Mi (%) 11d
kube-system kube-proxy-vckxf (%) (%) (%) (%) 39d
Allocated resources:
(Total limits may be over percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 20100m (%) 100m (%)
memory 45106Mi (%) 61490Mi (%)
ephemeral-storage (%) (%)
Events: <none

spark docker java kubernetes 获取cpu内核/线程数问题的更多相关文章

  1. Java多线程之如何确定线程数

    关于多线程的线程数的确定,最近研读过几篇paper,在此做一下笔记,方便使用时翻看. 1.<Java 虚拟机并发编程>中介绍 就是说:线程数 =  CPU的核心数 * (1 - 阻塞系数) ...

  2. java怎样获取CPU占用率和硬盘占用率

    通过jmx可以监控vm内存使用,系统内存使用等,以下是网上某博客代码,特点是通过window和linux命令获得CPU使用率. 利用java程序实现获取计算机cpu利用率和内存使用信息. packag ...

  3. java中获取当前执行线程的名称

    Thread.currentThread().getName()

  4. java线程与内核线程的关系,及怎么定义ThreadPoolExecutor相关参数

    p.p1 { margin: 0; font: 12px Menlo } p.p1 { margin: 0; font: 12px Menlo } p.p2 { margin: 0; font: 12 ...

  5. Atitit. 获取cpu占有率的 java c# .net php node.js的实现

    Atitit. 获取cpu占有率的 java c# .net php node.js的实现 通过wmic接口获取cpu占有率 C:\Users\Administrator.ATTILAXPC188&g ...

  6. cpu个数、核数、线程数、Java多线程关系的理解

    cpu个数.核数.线程数.Java多线程关系的理解 2017年12月08日 15:35:37 一 cpu个数.核数.线程数的关系 cpu个数:是指物理上,也及硬件上的核心数: 核数:是逻辑上的,简单理 ...

  7. Linux下使用java获取cpu、内存使用率

    原文地址:http://www.voidcn.com/article/p-yehrvmep-uo.html 思路如下:Linux系统中可以用top命令查看进程使用CPU和内存情况,通过Runtime类 ...

  8. Java并发(四)线程池使用

    上一篇博文介绍了线程池的实现原理,现在介绍如何使用线程池. 目录 一.创建线程池 二.向线程池提交任务 三.关闭线程池 四.合理配置线程池 五.线程池的监控 线程池创建规范 一.创建线程池 我们可以通 ...

  9. 转 根据CPU核心数确定线程池并发线程数

    转自: https://www.cnblogs.com/dennyzhangdd/p/6909771.html?utm_source=itdadao&utm_medium=referral 目 ...

随机推荐

  1. 一条命令解决:No toolchains found in the NDK toolchains folder for ABI with prefix: mips64el-linux-android

    1.找到目录D:\android\Sdk\ndk-bundle\toolchains.(根据自己的安装路径找到) 2.该路径下打开终端执行ln -sf aarch64-linux-android-4. ...

  2. 【LeetCode】最小路径和

    [问题]给定一个包含非负整数的 m x n 网格,请找出一条从左上角到右下角的路径,使得路径上的数字总和为最小. 说明:每次只能向下或者向右移动一步. 示例: 输入: [ [,,], [,,], [, ...

  3. jQuery下拉框联动(JQ遍历&JQ中DOM操作)

    1.下载jQuery,并导入:https://blog.csdn.net/weixin_44718300/article/details/88746796 2.代码实现: <!DOCTYPE h ...

  4. 查看 vps 进程网络流量

    弄好了 vps 以后,感觉网络流量走的有点多,决定查查看到底什么情况. 首先安装 sar 来看看各个设备消耗的流量 apt-get install sysstat sar 的参数 DEV 表示网口, ...

  5. 吴裕雄--天生自然TensorFlow2教程:张量排序

    import tensorflow as tf a = tf.random.shuffle(tf.range(5)) a tf.sort(a, direction='DESCENDING') # 返回 ...

  6. 吴裕雄--天生自然JAVA SPRING框架开发学习笔记:Spring CGLlB动态代理

    JDK 动态代理使用起来非常简单,但是它也有一定的局限性,这是因为 JDK 动态代理必须要实现一个或多个接口,如果不希望实现接口,则可以使用 CGLIB 代理. CGLIB(Code Generati ...

  7. CLR .net windows对win32 core抽象的新用处

    断断续续 花了一周的时间,把.net clr的一些知识看完了(确切的说是 一个段落),总体的感觉就是,ms把win32 core创建进程线程.虚拟地址.内存隔离的思想又重用了一遍,有所不同的是这次的场 ...

  8. 前端框架vue学习笔记

    占坑

  9. scala def方法时等号和括号使用说明笔记

    scala定义方法时会指定入参和返回类型(无返回类型时对应Unit,即java和C中的void模式). 1.有入参,有返回类型时,scala具有类型推导功能,以下两种表达方式效果一样.但根据scala ...

  10. 新iPhone泄密12人被捕,苹果这是下狠手的节奏

    一直以来,苹果在保密这件事儿上就秉持着强硬态度.还记得当年乔老爷子在的时候,苹果的保密工作在科技行业算得上是首屈一指.每款iPhone及其他新品在正式发布前,几乎不会被曝出什么消息.而这,或许也是&q ...