python实现指定目录下批量文件的单词计数：串行版本

直接上代码。

练习目标：

1. 使用 Python 面向对象的方法封装逻辑和表达；

2. 使用异常处理和日志API ；

3. 使用文件目录读写API ；

4. 使用 list, map, tuple 三种数据结构；

5. lambda 、正则使用及其它。

下一篇将实现并发版本。

#-------------------------------------------------------------------------------

# Name:        wordstat_serial.py

# Purpose:     statistic words in java files of given directory by serial

#

# Author:      qin.shuq

#

# Created:     08/10/2014

# Copyright:   (c) qin.shuq 2014

# Licence:     <your licence>

#-------------------------------------------------------------------------------

import re

import os

import time

import logging

LOG_LEVELS = {

    'DEBUG': logging.DEBUG, 'INFO': logging.INFO,

    'WARN': logging.WARNING, 'ERROR': logging.ERROR,

    'CRITICAL': logging.CRITICAL

}

def initlog(filename) :

    logger = logging.getLogger()

    hdlr = logging.FileHandler(filename)

    formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")

    hdlr.setFormatter(formatter)

    logger.addHandler(hdlr)

    logger.setLevel(LOG_LEVELS['INFO'])

    return logger

errlog = initlog("error.log")

infolog = initlog("info.log")

class WordReading(object):

    def __init__(self, fileList):

        self.fileList = fileList

    def readFileInternal(self, filename):

        lines = []

        try:

            f = open(filename, 'r')

            lines = f.readlines()

            infolog.info('[successful read file %s]\n' % filename)

            f.close()

        except IOError, err:

            errorInfo = 'file %s Not found \n' % filename

            errlog.error(errorInfo)

        return lines

    def readFile(self):

        allLines = []

        for filename in self.fileList:

            allLines.extend(self.readFileInternal(filename))

        return allLines

class WordAnalyzing(object):

    '''

     return Map<Word, count>  the occurrence times of each word

    '''

    wordRegex = re.compile("[\w]+")

    def __init__(self, allLines):

        self.allLines = allLines

    def analyze(self):

        result = {}

        lineContent = ''.join(self.allLines)

        matches = WordAnalyzing.wordRegex.findall(lineContent)

        if matches:

            for word in matches:

                if result.get(word) is None:

                    result[word] = 0

                result[word] += 1

        return result

class FileObtainer(object):

    def __init__(self, dirpath, fileFilterFunc=None):

        self.dirpath = dirpath

        self.fileFilterFunc = fileFilterFunc

    def findAllFilesInDir(self):

        files = []

        for path, dirs, filenames in os.walk(self.dirpath):

            if len(filenames) > 0:

                for filename in filenames:

                    files.append(path+'/'+filename)

        if self.fileFilterFunc is None:

            return files

        else:

            return filter(self.fileFilterFunc, files)

class PostProcessing(object):

    def __init__(self, resultMap):

        self.resultMap = resultMap

    def sortByValue(self):

        return sorted(self.resultMap.items(),key=lambda e:e[1], reverse=True)

    def obtainTopN(self, topN):

        sortedResult = self.sortByValue()

        sortedNum = len(sortedResult)

        topN = sortedNum if topN > sortedNum else topN

        for i in range(topN):

            topi = sortedResult[i]

            print topi[0], ' counts: ', topi[1]

if __name__ == "__main__":

    dirpath = "c:\\Users\\qin.shuq\\Desktop\\region_master\\src"

    starttime = time.time()

    fileObtainer = FileObtainer(dirpath, lambda f: f.endswith('.java'))

    fileList = fileObtainer.findAllFilesInDir()

    endtime = time.time()

    print 'ObtainFile cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    wr = WordReading(fileList)

    allLines = wr.readFile()

    endtime = time.time()

    print 'WordReading cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    wa = WordAnalyzing(allLines)

    resultMap = wa.analyze()

    endtime = time.time()

    print 'WordAnalyzing cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    postproc = PostProcessing(resultMap)

    postproc.obtainTopN(30)

    endtime = time.time()

    print 'PostProcessing cost: ', (endtime-starttime)*1000 , 'ms'

python实现指定目录下批量文件的单词计数：串行版本的更多相关文章

python实现指定目录下批量文件的单词计数：并发版本
在文章 <python实现指定目录下批量文件的单词计数:串行版本>中, 总体思路是: A. 一次性获取指定目录下的所有符合条件的文件 -> B. 一次性获取所有文件的所有文件行 - ...
[python] 在指定目录下找文件
import os # 查找当前目录下所有包含关键字的文件 def findFile(path, filekw): return[os.path.join(path,x) for x in os.li ...
python实现指定目录下JAVA文件单词计数的多进程版本
要说明的是, 串行版本足够快了, 在我的酷睿双核 debian7.6 下运行只要 0.2s , 简直是难以超越. 多进程版本难以避免大量的进程创建和数据同步与传输开销, 性能反而不如串行版本, 只能作 ...
python查找指定目录下所有文件，以及改文件名的方法
一: os.listdir(path) 把path目录下的所有文件保存在列表中: >>> import os>>> import re>>> pa ...
PHP 批量获取指定目录下的文件列表(递归，穿透所有子目录)
//调用 $dir = '/Users/xxx/www'; $exceptFolders = array('view','test'); $exceptFiles = array('BaseContr ...
python获取指定目录下所有文件名os.walk和os.listdir
python获取指定目录下所有文件名os.walk和os.listdir 觉得有用的话,欢迎一起讨论相互学习~Follow Me os.walk 返回指定路径下所有文件和子文件夹中所有文件列表其中文 ...
Python获取指定目录下所有子目录、所有文件名
需求给出制定目录,通过Python获取指定目录下的所有子目录,所有(子目录下)文件名: 实现 import os def file_name(file_dir): for root, dirs, f ...
PHP 获取指定目录下所有文件（包含子目录）
PHP 获取指定目录下所有文件(包含子目录) //glob — 寻找与模式匹配的文件路径 $filter_dir = array('CVS', 'templates_c', 'log', 'img', ...

随机推荐

在Ubutu14.04的Eclipse启动Tomcat的问题
PS:因为tomcat文件夹的权限问题,导致我研究了一中午,首先是New Server时,不能输入server name,之后我删除了 org.eclipse.wst.server.core.pref ...
PHP5下SOAP调用实现过程
本文以某公司iPhone 6手机预约接口开发为例,介绍PHP5下SOAP调用的实现过程. 一.基础概念 SOAP(Simple Object Access Protocol )简单对象访问协议是在分散 ...
android开发经验
1.选好"车轮" 一个项目的开发,我们不可能一切从0做起,如果真是这样,那同样要哭瞎.因此,善于借用已经做好的 "车轮" 非常重要,如: 网络访问框架:okht ...
Java 语句总结
一.替代if语句x = a ? b:c; 等价:if (a){ x=b;}else{x=c;}二.页面展示二维数组 <s:iterator var="rt" value=&q ...
php 使用 Memcache 例子
代码写成后不断的往数据库插入数据,可以发现当set时:理论上速度变慢,但数据同步当get时:理论上速度变快,但数据不同步,需要缓存失效后重新请求set方法 <?php $mem = new ...
连接sql server数据库的两种方式
class DB { private static SqlConnection conn; public static SqlConnection getConn() { //conn = n ...
Java NIO 开篇
一些很好的blog(待更新): 1.NIO入门 2.NIO.2 入门,第 1 部分: 异步通道 API I- 就是从硬盘到内存 O- 就是从内存到硬盘一.阻塞IO 第一种方式:我从硬盘读取数据,然后 ...
ImportError: No module named setuptools
Python第三方模块中一般会自带setup.py文件,在Windows环境下,我们只需要使用命令 cd c:\Temp\foo python setup.py install 两个命令就可以完成第三 ...
java jps命令
jps是jdk提供的一个查看当前java进程的小工具, 可以看做是JavaVirtual Machine Process Status Tool的缩写.非常简单实用. 命令格式:jps [option ...
解决Ueditor 不兼容IE7 和IE8
引用Ueditor的js 的时候用绝对路径

python实现指定目录下批量文件的单词计数：串行版本

python实现指定目录下批量文件的单词计数：串行版本的更多相关文章

随机推荐

热门专题