Python的并行求和例子

先上一个例子，这段代码是为了评估一个预测模型写的，详细评价说明在

https://www.kaggle.com/c/how-much-did-it-rain/details/evaluation，

它的核心是要计算

在实际计算过程中，n很大（1126694），以至于单进程直接计算时间消耗巨大（14分10秒），

所以这里参考mapReduce的思想，尝试使用多进程的方式进行计算，即每个进程计算一部分n，最后将结果相加再计算C

代码如下：

import csv

import sys

import logging

import argparse

import numpy as np

import multiprocessing

import time

# configure logging

logger = logging.getLogger("example")

handler = logging.StreamHandler(sys.stderr)

handler.setFormatter(logging.Formatter(

    '%(asctime)s %(levelname)s %(name)s: %(message)s'))

logger.addHandler(handler)

logger.setLevel(logging.DEBUG)

def H(n, z):

    return (n-z) >= 0

def evaluate(args, start, end):

    '''handle range[start, end)'''

    logger.info("Started %d to %d" %(start, end))

    expReader = open('train_exp.csv','r')

    expReader.readline()

    for i in range(start):

        _ = expReader.readline()

    predFile = open(args.predict)

    for i in range(start+1):

        _ = predFile.readline()

    predReader = csv.reader(predFile, delimiter=',')

    squareErrorSum = 0

    totalLines = end - start

    for i, row in enumerate(predReader):

        if i == totalLines:

            logger.info("Completed %d to %d" %(start, end))

            break

        expId, exp = expReader.readline().strip().split(',')

        exp = float(exp)

        predId = row[0]

        row = np.array(row, dtype='float')

        #assert expId == predId

        #lineSum = 0

        for j in xrange(1,71):

            n = j - 1

            squareErrorSum += (row[j]-(n>=exp))**2

            #squareErrorSum += (row[j]-H(n,exp))**2

            #lineSum += (row[j]-H(n,exp))**2

    logger.info('SquareErrorSum %d to %d: %f' %(start, end, squareErrorSum))

    return squareErrorSum

def fileCmp(args):

    '''check number of lines in two files are same'''

    for count, line in enumerate(open('train_exp.csv')):

        pass

    expLines = count + 1 - 1 #discare header

    for count, line in enumerate(open(args.predict)):

        pass

    predictLines = count + 1 - 1

    print 'Lines(exp, predict):', expLines, predictLines

    assert expLines == predictLines

    evaluate.Lines = expLines

if __name__ == "__main__":

    # set up logger

    parser = argparse.ArgumentParser(description=__doc__)

    parser.add_argument('--predict',

                        help=("path to an predict probability file, this will "

                              "predict_changeTimePeriod.csv"))

    args = parser.parse_args()

    fileCmp(args)

    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())

    result = []

    blocks = multiprocessing.cpu_count()

    linesABlock = evaluate.Lines / blocks

    for i in xrange(blocks-1):

        result.append(pool.apply_async(evaluate, (args, i*linesABlock, (i+1)*linesABlock)))

    result.append(pool.apply_async(evaluate, (args, (i+1)*linesABlock, evaluate.Lines+1)))

    pool.close()

    pool.join()

    result = [res.get() for res in result]

    print result

    print 'evaluate.Lines', evaluate.Lines

    score = sum(result) / (70*evaluate.Lines)

    print "score:", score

这里是有几个CPU核心就分成几个进程进行计算，希望尽量榨干CPU的计算能力。实际上运行过程中CPU的占用率也一直是100%

测试后计算结果与单进程一致，计算时间缩短为6分27秒，只快了一倍。

提升没有想象中的大。

经过尝试直接用StringIO将原文件每个进程加载一份到内存在进行处理速度也没有进一步提升，结合CPU的100%占用率考虑看起来是因为计算能力还不够。

看来计算密集密集型的工作还是需要用C来写的：）

C的实现要比python快太多了，单线程只需要50秒就能搞定，详见：

http://www.cnblogs.com/instant7/p/4313649.html

Python的并行求和例子的更多相关文章

python实现并行爬虫
问题背景:指定爬虫depth.线程数, python实现并行爬虫思路: 单线程实现爬虫类Fetcher 多线程 threading.Thread去调Fet ...
python抓取网页例子
python抓取网页例子最近在学习python,刚刚完成了一个网页抓取的例子,通过python抓取全世界所有的学校以及学院的数据,并存为xml文件.数据源是人人网. 因为刚学习python,写的代码 ...
【MPI】并行求和
比较简单的并行求和读入还是串行的而且无法处理线程数无法整除数据总长度的情况主要用到了MPI_Bcast MPI_Scatter MPI_Reduce typedef long long __in ...
快速掌握用python写并行程序
目录一.大数据时代的现状二.面对挑战的方法 2.1 并行计算 2.2 改用GPU处理计算密集型程序 3.3 分布式计算三.用python写并行程序 3.1 进程与线程 3.2 全局解释器锁GIL ...
Python,while循环小例子--猜拳游戏(三局二胜)
Python,while循环小例子--猜拳游戏(三局二胜) import random all_choice = ['石头', '剪刀', '布'] prompt = '''(0)石头 (1)剪刀 ( ...
python中并行遍历：zip和map-转
http://blog.sina.com.cn/s/blog_70e50f090101lat2.html 1.并行遍历:zip和map 内置的zip函数可以让我们使用for循环来并行使用多个序列.在基 ...
python之第一个例子hello world
python用缩进(四个空格,不是teble)来区分代码块 1. coding=utf-8 字符编码,支持汉字 #!/usr/bin/env python# coding=utf-8print ...
[Spark][Python]DataFrame where 操作例子
[Spark][Python]DataFrame中取出有限个记录的例子的继续 [15]: myDF=peopleDF.where("age>21") In [16]: m ...
[Spark][Python]DataFrame select 操作例子
[Spark][Python]DataFrame中取出有限个记录的例子的继续 In [4]: peopleDF.select("age")Out[4]: DataFrame[a ...

随机推荐

一个用beego写的API项目
beego-api 一个使用beego写的API 支持Api日志支持Swagger注解文档项目地址: https://github.com/eternity-wdd/beego-api 使用说明 ...
centos wireshark
root安装: yum install wireshark yum install wireshark-gnome wireshark error: There are no interfaces o ...
Ubuntu 18.04实现实时显示网速
1.添加源 sudo add-apt-repository ppa:fossfreedom/indicator-sysmonitor 2.更新源 sudo apt-get update 3.安装sys ...
在eclipse中xml文件注释的快捷键
在eclipse中xml文件注释的快捷键注释:Ctrl+Shift+/ 取消注释:Ctrl+Shift+\
SQL 查询今天、昨天、7天内、30天的数据
今天的所有数据: 昨天的所有数据: 7天内的所有数据: 30天内的所有数据: 本月的所有数据: 本年的所有数据: 查询今天是今年的第几天: select datepart(dayofyear,getD ...
最简单之安装azkaban
一,拉取源码构建 git clone https://github.com/azkaban/azkaban.git cd azkaban; ./gradlew build installDist 二, ...
前端知识体系：JavaScript基础-原型和原型链-理解原型设计模式以及 JavaScript中的原型规则
理解原型设计模式以及 JavaScript中的原型规则(原文地址) 1.原型对象:我们创建的每一个函数(JavaScript中函数也是一个对象)都有一个原型属性 prototype,原型属性实质上是一 ...
Bootstrap-Bootstrap官网卡片响应式布局
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
BZOJ 4802: 欧拉函数 (Pollard-Rho)
开始一直T,原来是没有srand- CODE #include<bits/stdc++.h> using namespace std; typedef long long LL; vect ...
excel操作之poi-ooxml
目前市场上流行的对于excel处理的框架大致有两种:poi和jxl.对于这两种框架,我们可以做一个简单的对比: 1 开发团队:poi是Apache旗下的一个开源项目,由Apache官方维护,jxl ...

Python的并行求和例子

Python的并行求和例子的更多相关文章

随机推荐

热门专题