mds0: Many clients (191) failing to respond to cache pressure

cephfs时我们产品依赖的主要分布式操作系统，但似乎很不给面子，压力测试的时候经常出问题。

背景
后续的努力
临时的解决办法

背景

集群环境出现的问题: mds0: Many clients (191) failing to respond to cache pressure

背景：三个节点，100多个客户端mount，服务器可用内存仅剩100MB，ceph报错如下：

[root@node1 ceph]# ceph -s

    cluster 1338affa-2d3d-416e-9251-4aa6e9c20eef

     health HEALTH_WARN

            mds0: Many clients (191) failing to respond to cache pressure

     monmap e1: 3 mons at {node1=192.168.0.1:6789/0,node2=192.168.0.2:6789/0,node3=192.168.0.3:6789/0}

            election epoch 22, quorum 0,1,2 node1,node2,node3

      fsmap e924: 1/1/1 up {0=node1=up:active}, 2 up:standby

     osdmap e71: 3 osds: 3 up, 3 in

            flags sortbitwise,require_jewel_osds

      pgmap v48336: 576 pgs, 3 pools, 82382 MB data, 176 kobjects

            162 GB used, 5963 GB / 6126 GB avail

                 576 active+clean

  client io 0 B/s rd, 977 kB/s wr, 19 op/s rd, 116 op/s wr

至今问题也没有解决。(我的意思是说没有弄清楚Capacity的机制，如果抱着解决不了问题，就解决提出问题的人的思路，可以参考第三部分。)

mds日志如下：

2019-11-12 16:00:17.679876 7fa6a5040700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 34.236623 secs

2019-11-12 16:00:17.679914 7fa6a5040700  0 log_channel(cluster) log [WRN] : slow request 34.236623 seconds old, received at 2019-11-12 15:59:43.326917: client_request(client.154893:13683 open #1000005cb77 2019-11-12 15:59:43.293037) currently failed to xlock, waiting

2019-11-12 16:03:27.614474 7fa6a5040700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 34.350555 secs

2019-11-12 16:03:27.614523 7fa6a5040700  0 log_channel(cluster) log [WRN] : slow request 34.350555 seconds old, received at 2019-11-12 16:02:53.263857: client_request(client.155079:5446 open #1000003e360 2019-11-12 16:02:54.011037) currently failed to xlock, waiting

2019-11-12 16:03:57.615297 7fa6a5040700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 64.351379 secs

2019-11-12 16:03:57.615322 7fa6a5040700  0 log_channel(cluster) log [WRN] : slow request 64.351379 seconds old, received at 2019-11-12 16:02:53.263857: client_request(client.155079:5446 open #1000003e360 2019-11-12 16:02:54.011037) currently failed to xlock, waiting

2019-11-12 16:03:58.181330 7fa6a5040700  0 log_channel(cluster) log [WRN] : client.155079 isn't responding to mclientcaps(revoke), ino 1000003e360 pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 64.458260 seconds ago

后续的努力

自己找环境重现，用的一个测试服务器，安装了一个Ubuntu系统，然后进行测试。惊喜的发现，同一个客户端不管我mount多少个目录，与后端的连接始终都只有那两个。

但重现过程中还是出现类似的问题了。

mds0: Client ubuntu:guest failing to respond to capability release

静置一段时间之后出现了如下错误:

[root@ceph741 ~]# ceph -s

    cluster 1338affa-2d3d-416e-9251-4aa6e9c20eef

     health HEALTH_WARN

            mds0: Client ubuntu:guest failing to respond to capability release

            mds0: Client ubuntu:guest failing to advance its oldest client/flush tid

     monmap e2: 3 mons at {ceph741=192.168.15.112:6789/0,ceph742=192.168.15.113:6789/0,ceph743=192.168.15.114:6789/0}

            election epoch 38, quorum 0,1,2 ceph741,ceph742,ceph743

      fsmap e8989: 1/1/1 up {0=ceph743=up:active}, 2 up:standby

     osdmap e67: 3 osds: 3 up, 3 in

            flags sortbitwise,require_jewel_osds

      pgmap v847657: 576 pgs, 3 pools, 20803 MB data, 100907 objects

            44454 MB used, 241 GB / 284 GB avail

                 576 active+clean

  client io 59739 B/s rd, 3926 kB/s wr, 58 op/s rd, 770 op/s wr

临时的解决办法

临时的解决办法就是把出问题的客户端干掉。

步骤主要命令：

ceph tell  mds.0 session ls

ceph tell mds.0 session evict id=249632

其中id是问题client的id。那么问题客户端比其他客户端哪里不同呢，实话说，我也不知道，大家可以看下：

参考：

https://www.jianshu.com/p/d1e0e32346ac

http://www.talkwithtrend.com/Article/242905

https://www.jianshu.com/p/fa49e40f6133

mds0: Many clients (191) failing to respond to cache pressure的更多相关文章

MDS 多活配置
CephFS 介绍及使用经验分享阅读 1179 收藏 2 2019-01-14 原文链接:www.jianshu.com WebRTC SFU中发送数据包的丢失反馈juejin.im 目录 Ceph ...
cephfs测试中出现的问题
最近重新对cephfs进行性能测试. 测试步骤: (1) 选取一个特地版本的操作系统内核,挂载20000个客户端; (2) 用iozone中的fileop工具,在每隔挂载点上都跑一个fileop进程; ...
cephfs删除报nospace的问题
ceph Vol 45 Issue 2 CephFS: No space left on device After upgrading to 10.2.3 we frequently see mess ...
Java性能提示（全）
http://www.onjava.com/pub/a/onjava/2001/05/30/optimization.htmlComparing the performance of LinkedLi ...
Chapter 6 — Improving ASP.NET Performance
https://msdn.microsoft.com/en-us/library/ff647787.aspx Retired Content This content is outdated and ...
smb.conf - Samba组件的配置文件
总览 SYNOPSIS smb.conf是Samba组件的配置文件,包含Samba程序运行时的配置信息.smb.conf被设计成可由swat (8)程序来配置和管理.本文件包含了关于smb.conf的 ...
SQL Server 内存相关博文
Don’t confuse error 823 and error 832 本文大意: 错误832: A page that should have been const ...
Goal driven performance optimization
When your goal is to optimize application performance it is very important to understand what goal d ...
如何使用event 10049分析定位library cache lock and library cache pin
Oracle Library Cache 的 lock 与 pin 说明一. 相关的基本概念之前整理了一篇blog,讲了Library Cache 的机制,参考: Oracle Library c ...

随机推荐

polynote 安装试用
polynote 是netflix 开源的一个notebook 工具(支持scala,python,sql ...) 下载安装包 https://github.com/polynote/polynot ...
<Array> 274 275
274. H-Index 这道题让我们求H指数,这个质数是用来衡量研究人员的学术水平的质数,定义为一个人的学术文章有n篇分别被引用了n次,那么H指数就是n. 用桶排序,按引用数从后往前计算论文数量,当 ...
SCITE设置修改说明
SCITE设置修改说明选项→打开全局设置文件:1.启动时最大化:position.width=-1position.height=-12.用于中文系统,要修改语言.code.page=936outpu ...
[LeetCode] 921. Minimum Add to Make Parentheses Valid 使括号有效的最少添加
Given a string S of '(' and ')' parentheses, we add the minimum number of parentheses ( '(' or ')', ...
[LeetCode] 393. UTF-8 Validation 编码验证
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules: For 1-byte char ...
NOI 2019 退役记
非常抱歉,因为不退役了,所以这篇退役记鸽了.
Navicat Keygen - for Windows
如何使用这个注册机从这里下载最新的release. 使用navicat-patcher.exe替换掉navicat.exe和libcc.dll里的Navicat激活公钥. navicat-patch ...
Qt Quick 常用元素：TabView(选项卡) 与 Slider(滑块)
一.TabView TabView 可以实现类似 Windows 任务管理器的界面,有人叫 TabView 为标签控件,有人又称之为选项卡控件,我们知道它就是这么个东西就行了.现在来介绍 TabVie ...
shell脚本注意点
1.等号两边不能有空格,例如: 获取七天前的日期: before_7_day=`date -d "7 days ago" +%Y-%m-%d` 2.自定义函数只能返回数值,不能返回 ...
SpringBoot第四篇：整合JDBCTemplate
作者:追梦1819 原文:https://www.cnblogs.com/yanfei1819/p/10868954.html 版权声明:本文为博主原创文章,转载请附上博文链接! 引言前面几篇文 ...

mds0: Many clients (191) failing to respond to cache pressure

背景

后续的努力

临时的解决办法

mds0: Many clients (191) failing to respond to cache pressure的更多相关文章

随机推荐

热门专题