python实现指定目录下批量文件的单词计数：串行版本

直接上代码。

练习目标：

1. 使用 Python 面向对象的方法封装逻辑和表达；

2. 使用异常处理和日志API ；

3. 使用文件目录读写API ；

4. 使用 list, map, tuple 三种数据结构；

5. lambda 、正则使用及其它。

下一篇将实现并发版本。

#-------------------------------------------------------------------------------

# Name:        wordstat_serial.py

# Purpose:     statistic words in java files of given directory by serial

#

# Author:      qin.shuq

#

# Created:     08/10/2014

# Copyright:   (c) qin.shuq 2014

# Licence:     <your licence>

#-------------------------------------------------------------------------------

import re

import os

import time

import logging

LOG_LEVELS = {

    'DEBUG': logging.DEBUG, 'INFO': logging.INFO,

    'WARN': logging.WARNING, 'ERROR': logging.ERROR,

    'CRITICAL': logging.CRITICAL

}

def initlog(filename) :

    logger = logging.getLogger()

    hdlr = logging.FileHandler(filename)

    formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")

    hdlr.setFormatter(formatter)

    logger.addHandler(hdlr)

    logger.setLevel(LOG_LEVELS['INFO'])

    return logger

errlog = initlog("error.log")

infolog = initlog("info.log")

class WordReading(object):

    def __init__(self, fileList):

        self.fileList = fileList

    def readFileInternal(self, filename):

        lines = []

        try:

            f = open(filename, 'r')

            lines = f.readlines()

            infolog.info('[successful read file %s]\n' % filename)

            f.close()

        except IOError, err:

            errorInfo = 'file %s Not found \n' % filename

            errlog.error(errorInfo)

        return lines

    def readFile(self):

        allLines = []

        for filename in self.fileList:

            allLines.extend(self.readFileInternal(filename))

        return allLines

class WordAnalyzing(object):

    '''

     return Map<Word, count>  the occurrence times of each word

    '''

    wordRegex = re.compile("[\w]+")

    def __init__(self, allLines):

        self.allLines = allLines

    def analyze(self):

        result = {}

        lineContent = ''.join(self.allLines)

        matches = WordAnalyzing.wordRegex.findall(lineContent)

        if matches:

            for word in matches:

                if result.get(word) is None:

                    result[word] = 0

                result[word] += 1

        return result

class FileObtainer(object):

    def __init__(self, dirpath, fileFilterFunc=None):

        self.dirpath = dirpath

        self.fileFilterFunc = fileFilterFunc

    def findAllFilesInDir(self):

        files = []

        for path, dirs, filenames in os.walk(self.dirpath):

            if len(filenames) > 0:

                for filename in filenames:

                    files.append(path+'/'+filename)

        if self.fileFilterFunc is None:

            return files

        else:

            return filter(self.fileFilterFunc, files)

class PostProcessing(object):

    def __init__(self, resultMap):

        self.resultMap = resultMap

    def sortByValue(self):

        return sorted(self.resultMap.items(),key=lambda e:e[1], reverse=True)

    def obtainTopN(self, topN):

        sortedResult = self.sortByValue()

        sortedNum = len(sortedResult)

        topN = sortedNum if topN > sortedNum else topN

        for i in range(topN):

            topi = sortedResult[i]

            print topi[0], ' counts: ', topi[1]

if __name__ == "__main__":

    dirpath = "c:\\Users\\qin.shuq\\Desktop\\region_master\\src"

    starttime = time.time()

    fileObtainer = FileObtainer(dirpath, lambda f: f.endswith('.java'))

    fileList = fileObtainer.findAllFilesInDir()

    endtime = time.time()

    print 'ObtainFile cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    wr = WordReading(fileList)

    allLines = wr.readFile()

    endtime = time.time()

    print 'WordReading cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    wa = WordAnalyzing(allLines)

    resultMap = wa.analyze()

    endtime = time.time()

    print 'WordAnalyzing cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    postproc = PostProcessing(resultMap)

    postproc.obtainTopN(30)

    endtime = time.time()

    print 'PostProcessing cost: ', (endtime-starttime)*1000 , 'ms'

python实现指定目录下批量文件的单词计数：串行版本的更多相关文章

python实现指定目录下批量文件的单词计数：并发版本
在文章 <python实现指定目录下批量文件的单词计数:串行版本>中, 总体思路是: A. 一次性获取指定目录下的所有符合条件的文件 -> B. 一次性获取所有文件的所有文件行 - ...
[python] 在指定目录下找文件
import os # 查找当前目录下所有包含关键字的文件 def findFile(path, filekw): return[os.path.join(path,x) for x in os.li ...
python实现指定目录下JAVA文件单词计数的多进程版本
要说明的是, 串行版本足够快了, 在我的酷睿双核 debian7.6 下运行只要 0.2s , 简直是难以超越. 多进程版本难以避免大量的进程创建和数据同步与传输开销, 性能反而不如串行版本, 只能作 ...
python查找指定目录下所有文件，以及改文件名的方法
一: os.listdir(path) 把path目录下的所有文件保存在列表中: >>> import os>>> import re>>> pa ...
PHP 批量获取指定目录下的文件列表(递归，穿透所有子目录)
//调用 $dir = '/Users/xxx/www'; $exceptFolders = array('view','test'); $exceptFiles = array('BaseContr ...
python获取指定目录下所有文件名os.walk和os.listdir
python获取指定目录下所有文件名os.walk和os.listdir 觉得有用的话,欢迎一起讨论相互学习~Follow Me os.walk 返回指定路径下所有文件和子文件夹中所有文件列表其中文 ...
Python获取指定目录下所有子目录、所有文件名
需求给出制定目录,通过Python获取指定目录下的所有子目录,所有(子目录下)文件名: 实现 import os def file_name(file_dir): for root, dirs, f ...
PHP 获取指定目录下所有文件（包含子目录）
PHP 获取指定目录下所有文件(包含子目录) //glob — 寻找与模式匹配的文件路径 $filter_dir = array('CVS', 'templates_c', 'log', 'img', ...

随机推荐

http://blog.csdn.net/qiutongyeluo/article/details/52468081
http://blog.csdn.net/qiutongyeluo/article/details/52468081
基础-训练营-day1~day5
基础大纲变量: 声明.初始化.使用.命名数据类型: int.long.double.boolean.char.String 运算符: 赋值.算术.关系.逻辑 ...
JQuery：JQuery设置HTML
JQuery:设置HTML1.Query - 设置内容和属性设置内容 - text().html() 以及 val()我们将使用前一章中的三个相同的方法来设置内容: text() - 设置或返回所选元 ...
Virtio：针对 Linux 的 I/O 虚拟化框架
Virtio:针对 Linux 的 I/O 虚拟化框架 --http://www.ibm.com/developerworks/cn/linux/l-virtio/#ibm-pcon 使用 KVM 和 ...
HTTP深入浅出 http请求
HTTP(HyperText Transfer Protocol)是一套计算机通过网络进行通信的规则.计算机专家设计出HTTP,使HTTP客户(如Web浏览器)能够从HTTP服务器(Web服务器)请求 ...
python matplotlib 绘图
饼图 import matplotlib.pyplot as plt # The slices will be ordered and plotted counter-clockwise. label ...
Http authentication（BASIC, DIGEST）
Http authentication....BASIC: In the context of an HTTP transaction, basic access authentication is ...
Java基础之一组有用的类——生成日期和时间（TryDateFormats)
控制台程序. java.util包中含有相当多的类涉及日期和时间,包括Date类.Calendar类和GregorianCalendar类. Date类对象其实定义了精确到毫秒的时刻,从1970年1月 ...
Angular.js 学习笔记
AngularJS 通过新的属性和表达式扩展了 HTML. AngularJS 可以构建一个单一页面应用程序. <!-- ng-app 指令定义一个 AngularJS 应用程序. ng-mod ...
JAVA-面向对象-特性
1.封装 1.定义方式 1修饰符class类名 2类名首字母大写 2.类的成员 1属性成员变量可以设置默认值第一个单词首字母小写,后面首字母大写一般把属性设置成private 提供属性对应的g ...

python实现指定目录下批量文件的单词计数：串行版本

python实现指定目录下批量文件的单词计数：串行版本的更多相关文章

随机推荐

热门专题