问题背景:

笔者所在的项目组最近把生产环境Tomcat迁移到Linux,算是顺利运行了一段时间,最近一个低概率密度的(too many open files)问题导致服务假死并停止响应客户端客户端请求。

进入服务器查看日志,发现tomcat凌晨6-7点的日志丢失,查看进程端口仍旧开放。

root@# lsof -i:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java root 67u IPv4 0t0 TCP *: (LISTEN)

从存档的日志找到一些端倪,发现凌晨1点的日志就开始出现异常

-Dec- ::12.514 严重 [http-nio--Acceptor-] org.apache.tomcat.util.net.Acceptor.run Socket accept failed
java.io.IOException: 打开的文件过多
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:)
at java.lang.Thread.run(Thread.java:)

错误日志一直追到凌晨6-7点,这个错误仍旧比较高密度的出现

-Dec- ::06.932 严重 [http-nio--Acceptor-] org.apache.tomcat.util.net.Acceptor.run Socket accept failed
java.io.IOException: 打开的文件过多
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:)
at java.lang.Thread.run(Thread.java:) -Dec- ::07.692 严重 [http-nio--Acceptor-] org.apache.tomcat.util.net.Acceptor.run Socket accept failed
java.io.IOException: 打开的文件过多
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:)
at java.lang.Thread.run(Thread.java:) -Dec- ::08.532 严重 [http-nio--Acceptor-] org.apache.tomcat.util.net.Acceptor.run Socket accept failed
java.io.IOException: 打开的文件过多
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:)
at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:)
at java.lang.Thread.run(Thread.java:)

故障排除步骤:

笔者最初怀疑自己在预安装生产环境的过程中忘记调优内核参数,于是按步骤查询了最大文件打开数:

root@# ulimit -a
core file size (blocks, -c)
data seg size (kbytes, -d) unlimited
scheduling priority (-e)
file size (blocks, -f) unlimited
pending signals (-i)
max locked memory (kbytes, -l)
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size ( bytes, -p)
POSIX message queues (bytes, -q)
real-time priority (-r)
stack size (kbytes, -s)
cpu time (seconds, -t) unlimited
max user processes (-u)
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

open files那一行就代表系统目前允许单个进程打开的最大句柄数,这里是4096。

按照这个参数想了一下,即使class文件没有打包,项目中也不会消耗这么多文件句柄的。

有打开内存参数确认了一下

vim /etc/security/limits.conf

#<domain>      <type>  <item>         <value>
# #* soft core
#root hard core
#* hard rss
#@student hard nproc
#@faculty soft nproc
#@faculty hard nproc
#ftp hard nproc
#ftp - chroot /ftp
#@student - maxlogins # End of file
*soft nofile
*hard nofile

这一切看上去都没有文件,文件句柄看上去足够tomcat使用。

查看tomcat进程

jps

或者

ps -ef | grep -i 'bootstrap.jar' | grep -v grep | awk '{print $2}'

查得tomcat进程

root@# ps -ef | grep -i 'bootstrap.jar' | grep -v grep | awk '{print $2}'

笔者调大文件句柄最多打开数

然后用命令查看进程打开句柄数

lsof -p | wc -l

root@# lsof -p 29511| wc -l
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/108/gvfs
Output information may be incomplete.
265

root@# lsof -p 29511| wc -l
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/108/gvfs
Output information may be incomplete.
268

通过以下命令查看进程打开的文件句柄详情:

root@# lsof -p
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/108/gvfs
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java root cwd DIR , /opt/tomcat-9.0./logs
java root rtd DIR , /
java root txt REG , /opt/jdk1..0_181/bin/java
java root mem REG , /var/cache/fontconfig/945677eb7aeaf62f1d50efc3fb3ec7d8-le64.cache-
java root mem REG , /var/cache/fontconfig/2cd17615ca594fa2959ae173292e504c-le64.cache-
java root mem REG , /var/cache/fontconfig/04aabc0a78ac019cf9454389977116d2-le64.cache-
java root mem REG , /var/cache/fontconfig/385c0604a188198f04d133e54aba7fe7-le64.cache-
java root mem REG , /opt/jdk1..0_181/jre/lib/amd64/libjpeg.so
java root mem REG , /opt/jdk1..0_181/jre/lib/amd64/libt2k.so
java root mem REG , /opt/jdk1..0_181/jre/lib/amd64/libfontmanager.so
java root mem REG , /var/cache/fontconfig/0d8c3b2ac0904cb8a57a757ad11a4a08-le64.cache-
java root mem REG , /var/cache/fontconfig/1ac9eb803944fde146138c791f5cc56a-le64.cache-
java root mem REG , /var/cache/fontconfig/dc05db6664285cc2f12bf69c139ae4c3-le64.cache-
java root mem REG , /var/cache/fontconfig/767a8244fc0220cfb567a839d0392e0b-le64.cache-
java root mem REG , /var/cache/fontconfig/4794a0821666d79190d59a36cb4f44b5-le64.cache-
java root mem REG , /var/cache/fontconfig/8801497958630a81b71ace7c5f9b32a8-le64.cache-
java root mem REG , /var/cache/fontconfig/bab58bb527bb656aaa9f116d68a48d89-le64.cache-
java root mem REG , /var/cache/fontconfig/3047814df9a2f067bd2d96a2b9c36e5a-le64.cache-
java root mem REG , /var/cache/fontconfig/56cf4f4769d0f4abc89a4895d7bd3ae1-le64.cache-
java root mem REG , /var/cache/fontconfig/b9d506c9ac06c20b433354fa67a72993-le64.cache-
java root mem REG , /var/cache/fontconfig/b47c4e1ecd0709278f4910c18777a504-le64.cache-
java root mem REG , /var/cache/fontconfig/d52a8644073d54c13679302ca1180695-le64.cache-
java root mem REG , /var/cache/fontconfig/551ecf3b0e8b0bca0f25c0944f561853-le64.cache-
java root mem REG , /var/cache/fontconfig/d589a48862398ed80a3d6066f4f56f4c-le64.cache-
.....
java root 133r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 134r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 135r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 136r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 137r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 138r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 139r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 140r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 141r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 142r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 143r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 144r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 145r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 146r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 147r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 148w REG , /opt/tomcat-9.0./logs/logs_tomcat.log

发现一直还在长,于是查看进程打开文件,多次对比查看打开的文件句柄,发现tomcat-users.xml只增不减

java     root  171r      REG               ,        /opt/tomcat-9.0./conf/tomcat-users.xml
java root 172r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 173r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 174r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 175r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 176r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 177r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 178r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 179r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 180r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 181r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 182r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 183r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 184r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 185r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 186r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 187r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 188r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 189r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 190r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 191r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 192r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 193r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 194r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 195r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 196r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 197r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 198r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 199r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 200r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 201r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 202r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 203r REG , /opt/tomcat-9.0./conf/tomcat-users.xml
java root 204r REG , /opt/tomcat-9.0./conf/tomcat-users.xml

至此,问题单症结基本确定了,就是tomcat这里

通过Google查询:too many tomcat-users.xml open by tomcat

最终确定这是tomcat9.0.13的一个bug:Tomcat too many files open (tomcat-users.xml)

This was caused by a bug in Tomcat 9.0.13 which has been fixed in Tomcat 9.0.14.

升级tomcat到9.0.14解决问题

笔者按:以笔者目前粗浅的win32知识触类旁通来分析这个问题,进程打开的文件句柄数达到设定的最大值,在向操作系统申请资源的时候一直处于等待状态,操作系统不再分拨资源

给他,于是进程还在,没死掉,只是资源被他耗尽,待客户端来访问的时候,他还会去申请文件句柄,一直就处于假死状态了。

参考:

Tomcat报java.io.IOException: 打开的文件过多

too many open files(打开的文件过多)解决方法

Error in tomcat “too many open files”

How to really fix the too many open files problem for Tomcat in Ubuntu

Tomcat9.0.13 Bug引发的java.io.IOException:(打开的文件过多 Too many open files)导致服务假死的更多相关文章

  1. 解决: java.io.IOException: 打开的文件过多 的问题

    问题 前一阵子公司项目做了一次压力测试, 中间出现了一个问题: 在50多个并发的时候会出现 java.io.IOException: 打开的文件过多 这个异常. 但是在没有并发的时候是不会出现这个问题 ...

  2. bug日记之-------java.io.IOException: Server returned HTTP response code: 400 for URL

    报的错误 出事代码 出事原因 解决方案 总结 多看源码, 我上面的实现方式并不好, 如果返回的响应编码为400以下却又不是200的情况下getErrorStream会返回null, 所以具体完美的解决 ...

  3. Execute failed: java.io.IOException: Cannot run program &quot;sdk-linux/build-tools/22.0.0/aapt&quot;: error=2

    在Linux上使用ant编译打包apk的时候,出现以下的错误及解决方法: 1./usr/local/android-sdk-linux/tools/ant/build.xml:698: Execute ...

  4. linux下java.io.IOException: Cannot run program "/opt/jdk/jre/bin/java": error=13, Permission denied

    linux下启动jetty时报: [root@mv01 jetty-distribution-9.2.14.v20151106]# java -jar start.jar java.io.IOExce ...

  5. ElasticsearchException: java.io.IOException: failed to read [id:0, file:/data/elasticsearch/nodes/0/_state/global-0.st]

    from : https://www.cnblogs.com/hixiaowei/p/11213143.html 1.以前装过elasticsearch,重新安装elastic search ,报错 ...

  6. java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries

    在已经搭建好的集群环境Centos6.6+Hadoop2.7+Hbase0.98+Spark1.3.1下,在Win7系统Intellij开发工具中调试Spark读取Hbase.运行直接报错: ? 1 ...

  7. 运行基准测试hadoop集群中的问题:org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /benchmarks/TestDFSIO/io_data/test_

    在master(即:host2)中执行 hadoop jar hadoop-test-1.1.2.jar DFSCIOTest -write -nrFiles 12 -fileSize 10240 - ...

  8. mockito 异常Reason: java.io.IOException: invalid constant type: 18

    原因: mockito内部使用的javassit的版本不一致导致的,修改为一直版本即可. 异常内容: /Library/Java/JavaVirtualMachines/jdk1.8.0_162.jd ...

  9. Hadoop错误:java.io.IOException: Incompatible clusterIDs

    问题: 配置Hadoop集群时,一个节点的DataNode无法启动 排查: 查看hadoop-root-datanode-bigdata114.log文件,错误信息如下: java.io.IOExce ...

随机推荐

  1. 账户和联系人 Accounts and Contacts 译

    原文链接: https://crmbook.powerobjects.com/basics/searching-and-navigation/understanding-accounts-and-co ...

  2. ASP.NET资源大全-知识分享 【转载】

    API 框架 NancyFx:轻量.用于构建 HTTP 基础服务的非正式(low-ceremony)框架,基于.Net 及 Mono 平台.官网 ASP.NET WebAPI:快捷创建 HTTP 服务 ...

  3. zabbix自动发现zabbix_agent后添加到所属组和链接到某些模块(九)

    自动发现的两个操作:discovery(自动发现) and actions(发现后执行某个操作)   需求:   1:自动发现 Zabbix agent运行的主机   2:执行的动作 1)添加到所属组 ...

  4. scoping作用域,anonymous function匿名函数,built-in functions内置函数

    作用域练习1 def test1(): print('in the test1') def test(): print('in the test') return test1 res = test() ...

  5. 执行sql语句为什么?用PreparedStatement要比Statement好用

    PreparedStatement public interface PreparedStatement extends Statement;可以看到PreparedStatement是Stateme ...

  6. Git和Svn对比

    From: https://wenku.baidu.com/view/1f090e2e7275a417866fb84ae45c3b3567ecdd12.html Git和Svn对比   共享文档   ...

  7. 用TreeSet和Comparator给list集合元素去重

    今天在做导入功能时,看到一个感觉很好的去重算法,特分享给大家看看: 其原理利用了以下几点: 1.TreeSet里面不会有重复的元素,所以当把一个List放进TreeSet里面后,会自动去重 2.Tre ...

  8. Mysql建了索引查询很慢

    遇到一个问题,有几个结构一个的查询,表的索引建的也一样,但是有的查询很快,有的却很慢,需要半分钟以上才能执行完. 查看执行计划,并没有什么区别.找了很久原因才发现是主查询和子查询所涉及的表的字符编码不 ...

  9. VS2017调试出现异常浏览器直接关闭的解决办法

    最近升级完VS2017后,出现了各种不适应. 1.F5调试时总是会打开新的浏览器,过去都是在现有窗口右侧打开新的新的浏览器标签页. 这一点就让很不爽,勉强接受吧,继续调试代码但是还有第二种情况. 2. ...

  10. PP.io的三个阶段,“强中心”——“弱中心”——“去中心”

    什么是PP.io? PP.io是我和Bill发起的存储项目,目的在于为开发者提供一个去中心化的存储和分发平台,能做到更便宜,更高速,更隐私. 当然做去中心化存储的项目也有好几个,FileCoin,Si ...