Log4J & elk 事故总结

周六的早晨8点，应用出现了大面积的登录超时问题。

作为一款日活15W、用户量700W+的应用，这是致命的问题。

唯一的安慰是——好在今天是周末，加班的公司才会使用。虽然如此，客服、产品的电话也被打爆了。

初步怀疑，问题与前一天晚上的更新有关，运维的同事回滚了更新，应用全部回滚完毕，然而，问题依然没有解决，服务依然不可用。

运维开始束手无策，9点钟的时候，基本所有的开发teamleader都过来了，加上架构部的，十几号人开始分析问题，客服、运营、产品们忙着安慰客户，发公告。总监、副总裁都过来了，看着一群开发忙来忙去的找问题。

我们最先怀疑的是后端的基础系统故障（历史经验，这一块出问题的可能性比较大），mongo, mysql, redis, memcache, rabbitmq一个个排查，它们的表现都正常：群集读写的压力都很小，请求处理时间短。

在排除了以上系统的问题之后，我们把怀疑的对象对准了一个Http服务，这是一个古老的服务，使用oauth 1.0 ，底层的http不能指定超时（而新的服务都是使用async-http实现，有超时设置），屏蔽服务后，问题依然存在。

后续的分析，集中在内部服务接口的请求上。通过vpn直接请求内部接口，测了几个在网关层请求都超时的服务，它们的返回都正常，最终将问题锁定在了网关层的服务。

网关层服务是一个轻量级的服务，它的主要职责是两件事：（1）鉴权（移动端、ERP）（2）路由（按业务），理论上这个服务不应该出现问题，不管它，先dump内存看看。

将网关层应用的内存dump下来后，发现了问题：

"qtp1056944384-232" prio=10 tid=0x00007f54900d0800 nid=0x63b3 waiting for monitor entry [0x00007f54492d0000]

   java.lang.Thread.State: BLOCKED (on object monitor)

	at org.apache.log4j.Category.callAppenders(Category.java:205)

	- waiting to lock <0x00000007e81c4830> (a org.apache.log4j.spi.RootLogger)

	at org.apache.log4j.Category.forcedLog(Category.java:391)

	at org.apache.log4j.Category.log(Category.java:856)

	at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:368)

总计有200多个log4j的线程在等待锁"0x00000007e81c4830"，而这把锁被谁持有呢？通过搜索，找到以下dump信息：

"qtp1056944384-218" prio=10 tid=0x00007f54800bb800 nid=0x63a5 runnable [0x00007f544a0de000]

   java.lang.Thread.State: RUNNABLE

	at java.net.SocketOutputStream.socketWrite0(Native Method)

	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)

	at java.net.SocketOutputStream.write(SocketOutputStream.java:141)

	at net.logstash.log4j.SocketAppender.append(SocketAppender.java:190)

	at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)

	- locked <0x00000007e8210868> (a net.logstash.log4j.SocketAppender)

	at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)

	at org.apache.log4j.Category.callAppenders(Category.java:206)

	- locked <0x00000007e81c4830> (a org.apache.log4j.spi.RootLogger)

	at org.apache.log4j.Category.forcedLog(Category.java:391)

	at org.apache.log4j.Category.log(Category.java:856)

	at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:368)

第218号线程持有了锁"0x00000007e81c4830"，然后其它log4j的线程都在等待这把锁的释放。可以看到，这把锁的持有者，一个关键字“logstash.log4j”，因此初步诊断和最近添加的elk有关。elk 即 Logstash+ElasticSearch+Kibana4，它是架构部近期引入的一个实时日志分析系统，最近的调整是：log4j.rootLogger 参数，添加了实时将日志打到logstash，而logstash在早晨8点的时候日志片段如下：

WARNING: [logstash-elk_0_47-16997-7944] [gc][young][2284683][25218] duration [1.1s], collections [1]/[1.8s], total [1.1s]/[12.2m], memory [1gb]->[915.1mb]/[3.9gb], all_pools {[young] [269.4mb]->[6.4mb]/[532.5mb]}{[survivor] [50.3mb]->[50.3mb]/[66.5mb]}{[old] [742.8mb]->[858.4mb]/[3.3gb]}

{:timestamp=>"2015-12-12T08:01:00.606000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}

{:timestamp=>"2015-12-12T08:01:00.636000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}

{:timestamp=>"2015-12-12T08:01:00.637000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}

{:timestamp=>"2015-12-12T08:01:00.637000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}

logstash出现了大量的503，查看elasticsearch的日志片段：

[2015-12-12 08:00:00,244][INFO ][cluster.metadata         ] [Iron Man] [logstash-2015.12.12] creating index, cause [auto(bulk api)], templates [logstash], shards [5]/[0], mappings [_default_]

[2015-12-12 08:00:00,285][INFO ][cluster.metadata         ] [Iron Man] [app-logs-2015.12.12] creating index, cause [auto(bulk api)], templates [], shards [5]/[0], mappings []

[2015-12-12 08:00:18,053][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node

[2015-12-12 08:00:48,053][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node

[2015-12-12 08:00:48,054][INFO ][cluster.routing.allocation.decider] [Iron Man] high disk watermark exceeded on one or more nodes, rerouting shards

[2015-12-12 08:01:18,087][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node

[2015-12-12 08:01:48,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node

[2015-12-12 08:02:18,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node

[2015-12-12 08:02:18,054][INFO ][cluster.routing.allocation.decider] [Iron Man] high disk watermark exceeded on one or more nodes, rerouting shards

[2015-12-12 08:02:48,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node

[2015-12-12 08:03:18,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.7gb[7%], shards will be relocated away from this node

[2015-12-12 08:03:18,055][INFO ][cluster.routing.allocation.decider] [Iron Man] high disk watermark exceeded on one or more nodes, rerouting shards

运维的同事说，elk是单节点，而此时，elasticsearch由于磁盘空间不够，出现了服务不可用（没有添加预警），logstash阻塞。logstash阻塞，导致网关服务整个集群不可用。

初步的处理是，停用logstash同步写，同时将生产环境日志临时调整为fatal（为了减少日志量）。

为什么log4j会出现几百个线程等待一个锁的问题呢？后面笔者分析的log4j的Category.callAppenders源码：

  /**

     Call the appenders in the hierrachy starting at

     <code>this</code>.  If no appenders could be found, emit a

     warning.

     <p>This method calls all the appenders inherited from the

     hierarchy circumventing any evaluation of whether to log or not

     to log the particular log request.

     @param event the event to log.  */

public void callAppenders(LoggingEvent event) {

    int writes = 0;

    for(Category c = this; c != null; c=c.parent) {

      // Protected against simultaneous call to addAppender, removeAppender,...

      synchronized(c) {

        if(c.aai != null) {

            writes += c.aai.appendLoopOnAppenders(event);

        }

        if(!c.additive) {

            break;

        }

      }

    }

    if(writes == 0) {

      repository.emitNoAppenderWarning(this);

    }

  }

log4j版本1.x中，使用的是古老的synchronized(this)，所有线程共用一个Category，而它通过log4j.properties指定。同一个Category下的线程打log时，需要进行全局同步，因此它的效率会很低，log4j 1.x版不适合高并发的场景。

为了杜绝这样的问题，后续需要吸取教训：

1. 尽量减少不必要的日志，在成熟的接口上，关闭日志输出，这样有利于提高效率。

2. 替换底层log的实现类，不再使用log4j 1.x，使用logback（推荐）或者新的log4j 2.x版本。

最后，附带两篇关于log4j 1.x中，日志系统死锁的分析（我们最开始怀疑是这个问题）：

https://bz.apache.org/bugzilla/show_bug.cgi?id=50213

http://javaeesupportpatterns.blogspot.com/2012/09/log4j-thread-deadlock-case-study.html

Log4J & elk 事故总结的更多相关文章

《DevOps实践：驭DevOps之力强化技术栈并优化IT运行》
DevOps实践:驭DevOps之力强化技术栈并优化IT运行主旨这本书并非坐而论道,而是介绍了DevOps全流程中的许多实践,以及相应工具的运用.虽然随着时代的推移,工具将来可能会过时,但是这些实 ...
ELK(ElasticSearch, Logstash, Log4j)系统日志搭建
1.elk平台介绍 Elasticsearch是个开源分布式搜索引擎,它的特点有:分布式,零配置,自动发现,索引自动分片,索引副本机制,restful风格接口,多数据源,自动搜索负载等. Logsta ...
ELK菜鸟手记 (四) - 利用filebeat和不同端口把不同服务器上的log4j日志传输到同一台ELK服务器
1. 问题描述我们需要将不同服务器(如Web Server)上的log4j日志传输到同一台ELK服务器,介于公司服务器资源紧张(^_^) 2. 我们需要用到filebeat 什么是filebeat ...
ELK环境配置+log4j日志记录
ELK环境配置+log4j日志记录 1. 背景介绍在大数据时代,日志记录和管理变得尤为重要. 以往的文件记录日志的形式,既查询起来又不方便,又造成日志在服务器上分散存储,管理起来相当麻烦, 想根据一 ...
ELK菜鸟手记 (一) 环境配置+log4j日志记录
1. 背景介绍在大数据时代,日志记录和管理变得尤为重要. 以往的文件记录日志的形式,既查询起来又不方便,又造成日志在服务器上分散存储,管理起来相当麻烦, 想根据一个关键字查询日志中某个关键信息相当困 ...
写给大忙人的CentOS 7下最新版(6.2.4)ELK+Filebeat+Log4j日志集成环境搭建完整指南
现在的公司由于绝大部分项目都采用分布式架构,很早就采用ELK了,只不过最近因为额外的工作需要,仔细的研究了分布式系统中,怎么样的日志规范和架构才是合理和能够有效提高问题排查效率的.经过仔细的分析和研究 ...
ELK学习笔记之CentOS 7下ELK(6.2.4)++LogStash+Filebeat+Log4j日志集成环境搭建
0x00 简介现在的公司由于绝大部分项目都采用分布式架构,很早就采用ELK了,只不过最近因为额外的工作需要,仔细的研究了分布式系统中,怎么样的日志规范和架构才是合理和能够有效提高问题排查效率的. 经 ...
ELK 记录 java log4j 类型日志
ELK 记载 java log4j 时,一个报错会生成很多行,阅读起来很不方便. 类似这样解决这个问题的方法 1.使用多行合并合并多行数据(Multiline) 有些时候,应用程序调试日志会包含 ...
日志分析利器elk与logback(log4j)实战
https://blog.csdn.net/puhaiyang/article/details/69664891

随机推荐

Python 高级编程 ——观察者模式
观察者模式的定义 :定义了对象之间一对多依赖,当一个对象改变状态时,这个对象的所有依赖者都会收到通知并按照自己的方式进行更新. 按照一个气象站的例子来看观察者模式从气象站取得数据后要在三个布告牌显示 ...
由VC2010与VC2017数据结构差异造成的程序错误
内容:VC2010和VC2017的标准库中,string(或wstring)的数据结构和操作有所不同,所以在将这两种数据作为参数在两个系统产生的函数中传递时会出现乱码(string和wstring在2 ...
python--partial偏函数
new_func = partial(函数名,参数), 生成一个新的函数, 新的函数中参数是partial固定时的参数例1: from functools import partial def f ...
Python 脚本利用adb 进行手机控制
相关参考:https://www.cnblogs.com/bravesnail/articles/5850335.html 一. adb 相关命令: 1. 关闭adb服务:adb kill-serv ...
Apache-通过CGI执行脚本
1.配置服务器,开启注释 vim /etc/httpd/conf/httpd.conf 292 # (You will also need to add "ExecCGI" to ...
poj1029 False coin
http://poj.org/problem?id=1029 题目大意:“金条”银行从可靠的消息来源得知,在他们最后一组N个硬币中,一枚硬币是假的,与其他硬币的重量不同(其他硬币的重量相等).在经济危 ...
python中大于0的元素全部转化为1，小于0的元素全部转化为0的代码
[code] """ 大于0的元素全部转化为1 """ np_arr = np.array([[1 ,2, 3, 4]]) print(&q ...
剑指offer八之跳台阶
一.题目一只青蛙一次可以跳上1级台阶,也可以跳上2级.求该青蛙跳上一个n级的台阶总共有多少种跳法. 二.思路 a.如果两种跳法,1阶或者2阶,那么假定第一次跳的是一阶,那么剩下的是n-1个台阶,跳法 ...
Python：线程指南
1. 线程基础 1.1. 线程状态线程有5种状态,状态转换的过程如下图所示: 1.2. 线程同步(锁) 多线程的优势在于可以同时运行多个任务(至少感觉起来是这样).但是当线程需要共享数据时,可能存在 ...
常见数据结构的Java实现
单链表的Java实现首先参考wiki上的单链表说明,单链表每个节点包含数据和指向链表中下一个节点的指针或引用.然后看代码 import java.lang.*; public class Singl ...

Log4J & elk 事故总结

Log4J & elk 事故总结的更多相关文章

随机推荐

热门专题