按日期切割nginx访问日志--及性能优化

先谈下我们需求，一个比较大的nginx访问日志，根据访问日期切割日志，保存在/tmp目录下。

测试机器为腾讯云机子，单核1G内存。测试日志大小80M。

不使用多线程版：

#!/usr/bin/env python

# coding=utf-8

import re

import datetime

if __name__ == '__main__':

    date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):')

    with open('./access_all.log-20161227') as f:

        for line in f:

            day, mon, year = re.search(date_pattern, line).groups()

            mon = datetime.datetime.strptime(mon, '%b').month

            log_file = '/tmp/%s-%s-%s' % (year, mon, day)

            with open(log_file, 'a+') as f:

                f.write(line)

耗时：

[root@VM_255_164_centos data_parse]# time python3 log_cut.py 

real    0m41.152s

user    0m32.578s

sys    0m6.046s

多线程版：

#!/usr/bin/env python

# coding=utf-8

import re

import datetime

import threading

date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):')

def log_cut(line):

    day, mon, year = re.search(date_pattern, line).groups()

    mon = datetime.datetime.strptime(mon, '%b').month

    log_file = '/tmp/%s-%s-%s' % (year, mon, day)

    with open(log_file, 'a+') as f:

        f.write(line)

if __name__ == '__main__':

    with open('./access_all.log-20161227') as f:

        for line in f:

            t = threading.Thread(target=log_cut, args=(line,))

            t.setDaemon(True)

            t.start()

耗时：

# time python3 log_cut.py 

real    1m35.905s

user    1m10.292s

sys    0m19.666s

使用多线程版竟然比不使用多进程版要慢的多。。cpu密集型任务使用上下文切换果然很耗时。

线程池版：

线程池类

#!/usr/bin/env python

# coding=utf-8

import queue

import threading

import contextlib

import time

StopEvent = object()

class ThreadPool(object):

    def __init__(self, max_num, max_task_num = None):

        if max_task_num:

            self.q = queue.Queue(max_task_num)

        else:

            self.q = queue.Queue()

        self.max_num = max_num

        self.cancel = False

        self.terminal = False

        self.generate_list = []

        self.free_list = []

    def run(self, func, args, callback=None):

        if self.cancel:

            return

        if len(self.free_list) == 0 and len(self.generate_list) < self.max_num:

            self.generate_thread()

        w = (func, args, callback,)

        self.q.put(w)

    def generate_thread(self):

        t = threading.Thread(target=self.call)

        t.start()

    def call(self):

        current_thread = threading.currentThread()

        self.generate_list.append(current_thread) 

        event = self.q.get()

        while event != StopEvent:

            func, arguments, callback = event

            try:

                result = func(*arguments)

                success = True

            except Exception as e:

                success = False

                result = None

            if callback is not None:

                try:

                    callback(success, result)

                except Exception as e:

                    pass

            with self.worker_state(self.free_list, current_thread):

                if self.terminal:

                    event = StopEvent

                else:

                    event = self.q.get()

        else:

            self.generate_list.remove(current_thread)

    def close(self):

        self.cancel = True

        full_size = len(self.generate_list)

        while full_size:

            self.q.put(StopEvent)  #

            full_size -= 1

    def terminate(self):

        self.terminal = True

        while self.generate_list:

            self.q.put(StopEvent)

        self.q.queue.clear()

    @contextlib.contextmanager

    def worker_state(self, state_list, worker_thread):

        state_list.append(worker_thread)

        try:

            yield

        finally:

            state_list.remove(worker_thread)

threadingPool.py

代码

#!/usr/bin/env python

# coding=utf-8

import re

import datetime

from threadingPool import ThreadPool

date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+)\:')

def log_cut(line):

    day, mon, year = date_pattern.search(line).groups()

    mon = datetime.datetime.strptime(mon, '%b').month

    log_file = '/tmp/%s-%s-%s' % (year, mon, day)

    with open(log_file, 'a+') as f:

        f.write(line)

def callback(status, result):

    pass

pool = ThreadPool(1)

with open('./access_all.log-20161227') as f:

    for line in f:

        pool.run(log_cut, (line,), callback)

pool.close()

耗时：

# time python3 log_cut2.py 

real    0m53.371s

user    0m44.761s

sys    0m5.600s

线程池版比多线程版要快，看来写的线程池类还是有用的。减少了上下文切换时间。

进程池版：

#!/usr/bin/env python

# coding=utf-8

import re

import datetime

from multiprocessing import Pool

date_pattern = re.compile(r'\[(\d+)\/(\w+)\/(\d+):')

def log_cut(line):

    day, mon, year = re.search(date_pattern, line).groups()

    mon = datetime.datetime.strptime(mon, '%b').month

    log_file = '/tmp/%s-%s-%s' % (year, mon, day)

    with open(log_file, 'a+') as f:

        f.write(line)

if __name__ == '__main__':

    pool = Pool(1)

    with open('./access_all.log-20161227') as f:

        for line in f:

            pool.apply_async(func=log_cut, args=(line,))

    pool.close()

单个进程耗时：

# time python3 log_cut.py 

real    0m28.392s

user    0m23.451s

sys    0m1.888s

2个进程耗时：

# time python3 log_cut.py 

real    0m40.920s

user    0m33.690s

sys    0m3.206s

看来使用多进程时，如果是单核cpu只开一个进程，多核cpu的话开多个速度更快，单核cpu开多个进程速度很慢。

shell版

#!/bin/bash

Usage(){

    echo "Usage: $0 Logfile"

}

if [ $# -eq  ] ;then

    Usage

    exit

else

    Log=$

fi

date_log=$(mktemp)

cat $Log |awk -F'[ :]' '{print $5}'|awk -F'[' '{print $2}'|uniq > date_log

for i in `cat date_log`

do

    grep $i $Log > /tmp/log/${i::}-${i::}-${i::}.access

done

耗时：

# time sh log_cut.sh access_all.log- 

real    0m2.435s

user    0m2.042s

sys    0m0.304s

shell的效果非常棒啊，只用2s多久完成了。

按日期切割nginx访问日志--及性能优化的更多相关文章

性能调优之访问日志IO性能优化
性能调优之访问日志IO性能优化 poptest是国内唯一一家培养测试开发工程师的培训机构,以学员能胜任自动化测试,性能测试,测试工具开发等工作为目标.如果对课程感兴趣,请大家咨询qq:908821 ...
访问日志IO性能优化
在高并发量的场景下磁盘IO往往是性能的瓶颈所在,访问日志涉及到频繁的写操作,所以这部分要尽可能地优化,不然将拖累系统的整体性能.针对文件记录及数据库记录两种方式可以有以下措施提高写性能, l 避免频繁 ...
采集并分析Nginx访问日志
日志服务支持通过数据接入向导配置采集Nginx日志,并自动创建索引和Nginx日志仪表盘,帮助您快速采集并分析Nginx日志. 许多个人站长选取了Nginx作为服务器搭建网站,在对网站访问情况进行分析 ...
Nginx访问日志.Nginx日志切割
11月27日任务 12.10 Nginx访问日志12.11 Nginx日志切割12.12 静态文件不记录日志和过期时间 1.Nginx访问日志示例一: 日志格式 vim /usr/local/ngi ...
Nginx 访问日志轮询切割
Nginx 访问日志轮询切割脚本 #!/bin/sh Dateformat=`date +%Y%m%d` Basedir="/application/nginx" Nginxlog ...
Nginx访问日志、 Nginx日志切割、静态文件不记录日志和过期时间
1.Nginx访问日志配制访问日志:默认定义格式: log_format combined_realip '$remote_addr $http_x_forwarded_for [$time_loc ...
Nginx访问日志、日志切割、静态文件不记录日志和过期时间
6月8日任务 12.10 Nginx访问日志12.11 Nginx日志切割12.12 静态文件不记录日志和过期时间 12.10 Nginx访问日志除了在主配置文件nginx.conf里定义日志格式外 ...
Linux centosVMware Nginx访问日志、Nginx日志切割、静态文件不记录日志和过期时间
一.Nginx访问日志 vim /usr/local/nginx/conf/nginx.conf //搜索log_format 日至格式改为davery格式 $remote_addr 客户端IP ...
nginx访问日志（access_log）
一.nginx访问日志介绍 nginx软件会把每个用户访问网站的日志信息记录到指定的日志文件里,供网站提供者分析用户的浏览行为等,此功能由ngx_http_log_module模块负责,对应的官方地址 ...

随机推荐

安装cocoapods遇到两大坑-Ruby版本升级和Podfile的配置
今天安装cocoapods #移除原有ruby源 $ gem sources --remove https://rubygems.org/ #使用可用的淘宝网 $ gem sources -a htt ...
HTTP Session、Cookie机制详解
一.什么是http session,有什么用 HTTP协议本身是无状态的,本身并不能支持服务端保存客户端的状态信息,于是,Web Server中引入了session的概念,用来保存客户端的状态信息. ...
C++实例讲解Binder通信
binder是android里面的通信机制,这就不说它如何如何好了,Goog已经说过了,这里不多说.binder是一个面向对象的编程方法,大量使用虚函数类.最近研究binder看到一网友写的,就借鉴一 ...
Java Generics and Collections-2.3
2.3 Wildcards with super 这里就直接拿书上的例子好了,这是Collections里面的一个方法: public static <T> void copy(List& ...
MySql in子句效率低下优化
MySql in子句效率低下优化背景: 更新一张表中的某些记录值,更新条件来自另一张含有200多万记录的表,效率极其低下,耗时高达几分钟. where resid in ( ); 耗时 365s ...
BZOJ 3289: Mato的文件管理[莫队算法树状数组]
3289: Mato的文件管理 Time Limit: 40 Sec Memory Limit: 128 MBSubmit: 2399 Solved: 988[Submit][Status][Di ...
nodejs核心模块之http
http模块包含以下5个核心类和方法及属性: 核心类 1,http.Agent 2,http.ClientRequest 3,http.Server 4,http.ServerResponse 5,h ...
解决ASP.NET上传文件大小限制
第一种方法,主要适用于IIS6.0版本一.修改配置Web.Config文件中的httpRuntime节点对于asp.net,默认只允许上传4M文件,增加如下配置,一般可以自定义最大文件大小.一.修改 ...
[LeetCode] Best Meeting Point 最佳开会地点
A group of two or more people wants to meet and minimize the total travel distance. You are given a ...
[LeetCode] Group Shifted Strings 群组偏移字符串
Given a string, we can "shift" each of its letter to its successive letter, for example: & ...