请求量突增一下，系统有效QPS为何下降很多？

原创：扣钉日记（微信公众号ID：codelogs），欢迎分享，转载请保留出处。

简介

最近我观察到一个现象，当服务的请求量突发的增长一下时，服务的有效QPS会下降很多，有时甚至会降到0，这种现象网上也偶有提到，但少有解释得清楚的，所以这里来分享一下问题成因及解决方案。

队列延迟

目前的Web服务器，如Tomcat，请求处理过程大概都类似如下：

这是Tomcat请求处理的过程，如下：

Acceptor线程：线程名类似http-nio-8080-Acceptor-0，此线程用于接收新的TCP连接，并将TCP连接注册到NIO事件中。
Poller线程：线程名类似http-nio-8080-ClientPoller-0，此线程一般有CPU核数个，用于轮询已连接的Socket，接收新到来的Socket事件（如调用端发请求数据了），并将活跃Socket放入exec线程池的请求队列中。
exec线程：线程名类似http-nio-8080-exec-0，此线程从请求队列中取出活跃Socket，并读出请求数据，最后执行请求的API逻辑。

这里不用太关心Acceptor与Poller线程，这是nio编程时常见的线程模型，我们将重点放在exec线程池上，虽然Tomcat做了一些优化，但它还是从Java原生线程池扩展出来的，即有一个任务队列与一组线程。

当请求量突发增长时，会发生如下的情况：

当请求量不大时，任务队列基本是空的，每个请求都能得到及时的处理。
但当请求量突发时，任务队列中就会有很多请求，这时排在队列后面的请求，就会被处理得越晚，因而请求的整体耗时就会变长，甚至非常长。

可是，exec线程们还是在一刻不停歇的处理着请求的呀，按理说服务QPS是不会减少的呀！

简单想想的确如此，但调用端一般是有超时时间设置的，不会无限等待下去，当客户端等待超时的时候，这个请求实际上Tomcat就不用再处理了，因为就算处理了，客户端也不会再去读响应数据的。

因此，当队列比较长时，队列后面的请求，基本上都是不用再处理的，但exec线程池不知道啊，它还是会一如既往地处理这些请求。

当exec线程执行这些已超时的请求时，若又有新请求进来，它们也会排在队尾，这导致这些新请求也会超时，所以在流量突发的这段时间内，请求的有效QPS会下降很多，甚至会降到0。

这种超时也叫做队列延迟，但队列在软件系统中应用得太广泛了，比如操作系统调度器维护了线程队列，TCP中有backlog连接队列，锁中维护了等待队列等等。

因此，很多系统也会存在这种现象，平时响应时间挺稳定的，但偶尔耗时很高，这种情况有很多都是队列延迟导致的。

优化队列延迟

知道了问题产生的原因，要优化它就比较简单了，我们只需要让队列中那些长时间未处理的请求暂时让路，让线程去执行那些等待时间不长的请求即可，毕竟这些长时间未处理的请求，让它们再等等也无防，因为客户端可能已经超时了而不需要请求结果了，虽然这破坏了队列的公平性，但这是我们需要的。

对于Tomcat，在springboot中，我们可以如下修改：

使用WebServerFactoryCustomizer自定义Tomcat的线程池，如下：

@Component

public class TomcatExecutorCustomizer implements WebServerFactoryCustomizer<TomcatServletWebServerFactory> {

    @Resource

    ServerProperties serverProperties;

    @Override

    public void customize(TomcatServletWebServerFactory factory) {

        TomcatConnectorCustomizer tomcatConnectorCustomizer = connector -> {

            ServerProperties.Tomcat.Threads threads = serverProperties.getTomcat().getThreads();

            TaskQueue taskqueue = new SlowDelayTaskQueue(1000);

            ThreadPoolExecutor executor = new org.apache.tomcat.util.threads.ThreadPoolExecutor(

                    threads.getMinSpare(), threads.getMax(), 60L, TimeUnit.SECONDS,

                    taskqueue, new CustomizableThreadFactory("http-nio-8080-exec-"));

            taskqueue.setParent(executor);

            ProtocolHandler handler = connector.getProtocolHandler();

            if (handler instanceof AbstractProtocol) {

                AbstractProtocol<?> protocol = (AbstractProtocol<?>) handler;

                protocol.setExecutor(executor);

            }

        };

        factory.addConnectorCustomizers(tomcatConnectorCustomizer);

    }

}

注意，这里还是使用的Tomcat实现的线程池，只是将任务队列TaskQueue扩展为了SlowDelayTaskQueue，它的作用是将长时间未处理的任务移到另一个慢队列中，待当前队列中无任务时，再把慢队列中的任务移回来。

为了能记录任务入队列的时间，先封装了一个记录时间的任务类RecordTimeTask，如下：

@Getter

public class RecordTimeTask implements Runnable {

    private Runnable run;

    private long createTime;

    private long putQueueTime;

    public RecordTimeTask(Runnable run){

        this.run = run;

        this.createTime = System.currentTimeMillis();

        this.putQueueTime = this.createTime;

    }

    @Override

    public void run() {

        run.run();

    }

    public void resetPutQueueTime() {

        this.putQueueTime = System.currentTimeMillis();

    }

    public long getPutQueueTime() {

        return this.putQueueTime;

    }

}

然后队列的扩展实现如下：

public class SlowDelayTaskQueue extends TaskQueue {

    private long timeout;

    private BlockingQueue<RecordTimeTask> slowQueue;

    public SlowDelayTaskQueue(long timeout) {

        this.timeout = timeout;

        this.slowQueue = new LinkedBlockingQueue<>();

    }

    @Override

    public boolean offer(Runnable o) {

        // 将任务包装一下，目的是为了记录任务放入队列的时间

        if (o instanceof RecordTimeTask) {

            return super.offer(o);

        } else {

            return super.offer(new RecordTimeTask(o));

        }

    }

    public void pullbackIfEmpty() {

        // 如果队列空了，从慢队列中取回来一个

        if (this.isEmpty()) {

            RecordTimeTask r = slowQueue.poll();

            if (r == null) {

                return;

            }

            r.resetPutQueueTime();

            this.add(r);

        }

    }

    @Override

    public Runnable poll(long timeout, TimeUnit unit) throws InterruptedException {

        pullbackIfEmpty();

        while (true) {

            RecordTimeTask task = (RecordTimeTask) super.poll(timeout, unit);

            if (task == null) {

                return null;

            }

            // 请求在队列中长时间等待，移入慢队列中

            if (System.currentTimeMillis() - task.getPutQueueTime() > this.timeout) {

                this.slowQueue.offer(task);

                continue;

            }

            return task;

        }

    }

    @Override

    public Runnable take() throws InterruptedException {

        pullbackIfEmpty();

        while (true) {

            RecordTimeTask task = (RecordTimeTask) super.take();

            // 请求在队列中长时间等待，移入慢队列中

            if (System.currentTimeMillis() - task.getPutQueueTime() > this.timeout) {

                this.slowQueue.offer(task);

                continue;

            }

            return task;

        }

    }

}

逻辑其实挺简单的，如下：

当任务入队列时，包装一下任务，记录一下入队列的时间。
然后线程从队列中取出任务时，若发现任务等待时间过长，就将其移入慢队列。
而pullbackIfEmpty的逻辑，就是当队列为空时，再将慢队列中的任务移回来执行。

为了将请求的队列延迟记录在access.log中，我又修改了一下Task，并加了一个Filter，如下：

使用ThreadLocal将队列延迟先存起来

@Getter

public class RecordTimeTask implements Runnable {

    private static final ThreadLocal<Long> WAIT_IN_QUEUE_TIME = new ThreadLocal<>();

    private Runnable run;

    private long createTime;

    private long putQueueTime;

    public RecordTimeTask(Runnable run){

        this.run = run;

        this.createTime = System.currentTimeMillis();

        this.putQueueTime = this.createTime;

    }

    @Override

    public void run() {

        try {

            WAIT_IN_QUEUE_TIME.set(System.currentTimeMillis() - this.createTime);

            run.run();

        } finally {

            WAIT_IN_QUEUE_TIME.remove();

        }

    }

    public void resetPutQueueTime() {

        this.putQueueTime = System.currentTimeMillis();

    }

    public long getPutQueueTime() {

        return this.putQueueTime;

    }

    public static long getWaitInQueueTime(){

        return ObjectUtils.defaultIfNull(WAIT_IN_QUEUE_TIME.get(), 0L);

    }

}

再在Filter中将队列延迟取出来，放入Request对象中

@WebFilter

@Component

public class WaitInQueueTimeFilter extends HttpFilter {

    @Override

    public void doFilter(HttpServletRequest request, HttpServletResponse response, FilterChain chain) throws

                                                                                                      IOException,

                                                                                                      ServletException {

        long waitInQueueTime = RecordTimeTask.getWaitInQueueTime();

        // 将等待时间设置到request的attribute中，给access.log使用

        request.setAttribute("waitInQueueTime", waitInQueueTime);

        // 如果请求在队列中等待了太长时间，客户端大概率已超时，就没有必要再执行了

        if (waitInQueueTime > 5000) {

            response.sendError(503, "service is busy");

            return;

        }

        chain.doFilter(request, response);

    }

}

然后在access.log中配置队列延迟

server:

  tomcat:

    accesslog:

      enabled: true

      directory: /home/work/logs/applogs/java-demo

      file-date-format: .yyyy-MM-dd

      pattern: '%h %l %u %t "%r" %s %b %Dms %{waitInQueueTime}rms "%{Referer}i" "%{User-Agent}i" "%{X-Forwarded-For}i"'

注意，在access.log中配置%{xxx}r表示取请求xxx属性的值，所以，%{waitInQueueTime}r就是队列延迟，后面的ms是毫秒单位。

优化效果

我使用接口压测工具wrk压了一个测试接口，此接口执行时间100ms，使用1000个并发去压，1s的超时时间，如下：

wrk -d 10d -T1s --latency http://localhost:8080/sleep -c 1000

然后，用arthas看一下线程池的队列长度，如下：

[arthas@619]$ vmtool --action getInstances \

    --classLoaderClass org.springframework.boot.loader.LaunchedURLClassLoader \

    --className org.apache.tomcat.util.threads.ThreadPoolExecutor \

    --express 'instances.{ #{"ActiveCount":getActiveCount(),"CorePoolSize":getCorePoolSize(),"MaximumPoolSize":getMaximumPoolSize(),"QueueSize":getQueue().size()} }' \

    -x 2

可以看到，队列长度远小于1000，这说明队列中积压得不多。

再看看access.log，如下：

可以发现，虽然队列延迟任然存在，但被控制在了1s以内，这样这些请求就不会超时了，Tomcat的有效QPS保住了。

而最后面那些队列延迟极长的请求，则是被不公平对待的请求，但只能这么做，因为在请求量超出Tomcat处理能力时，只能牺牲掉它们，以保全大局。