MapReduce C++ Library for single-machine, multicore applications

Distributed and scalable computing disciplines have recognized that immutable data, lock free access, and isolated data processing is not only inevitable across a number of machines, but has significant benefits for reliability and scalability. These benefit can be fostered in application design to improve reliability and take advantage of increasingly multi-core processor that are available on end-user devices as well as server machines.

MapReduce is an architecture framework designed for creating and processing large volumes of data using clusters of computers. For background information on MapReduce see Software Scalability with MapReduce.

An important development from Google's original paper is in applying MapReduce to parallel processing on multi-core environments such as multi-core and multi-processor machines and graphics processors (GPUs).

The scalability achieved using MapReduce to implement data processing across a large volume of machines with low implementation costs motivates the design of this library. By taking the principles that have been proven in a distributed MapReduce system and applying them to a single-machine, multicore implementation, reliability and execution efficiency can be attained in a reusable framework. In scaling down the architecture from multi-machine to multi-CPU or multi-core, threads of execution become analogous to machines in a distributed environment as a unit of process execution.

The MapReduce C++ Library implements a single-machine platform for programming using the the Google MapReduce idiom. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the Google paper.

  map (k1,v1) --> list(k2,v2)

  reduce (k2,list(v2)) --> list(v2)

Synopsis

namespace mapreduce {

template<typename MapTask,

         typename ReduceTask,

         typename Datasource=datasource::directory_iterator<MapTask>,

         typename Combiner=null_combiner,

         typename IntermediateStore=intermediates::local_disk<MapTask> >

class job;

} // namespace mapreduce

The developer is required to write two classes; MapTask implements a mapping function to process key/value pairs generate a set of intermediate key/value pairs and ReduceTask that implements a reduce function to merges all intermediate values associated with the same intermediate key.

In addition, there are three optional template parameters that can be used to modify the default implementation behavior; Datasource that implements a mechanism to feed data to the Map Tasks - on request of the MapReduce library, Combiner that can be used to partially consolidate results of the Map Task before they are passed to the Reduce Tasks, and IntermediateStore that handles storage, merging and sorting of intermediate results between the Map and Reduce phases.

The MapTask class must define four data types; the key/value types for the inputs to the Map Tasks and the intermediate types.

class map_task

{

  public:

    typedef std::string   key_type;

    typedef std::ifstream value_type;

    typedef std::string   intermediate_key_type;

    typedef unsigned      intermediate_value_type;

    map_task(job::map_task_runner &runner);

    void operator()(key_type const &key, value_type const &value);

}:

The ReduceTask must define the key/value types for the results of the Reduce phase.

class reduce_task

{

  public:

    typedef std::string  key_type;

    typedef size_t       value_type;

    reduce_task(job::reduce_task_runner &runner);

    template<typename It>

    void operator()(typename map_task::intermediate_key_type const &key, It it, It ite)

}:

Extensibility

The library is designed to be extensible and configurable through a Policy-based mechanism. Default implementations are provided to enable the library user to run MapReduce simply by implementing the core Map and Reduce tasks, but can be replaced to provide specific features.

Policy	Application	Supplied Implementation(s)
`Datasource`	`mapreduce::job` template parameter	`datasource::directory_iterator<MapTask>`
`Combiner`	`mapreduce::job` template parameter	`null_combiner`
`IntermediateStore`	`mapreduce::job` template parameter	`local_disk<MapTask, SortFn, MergeFn>`
`SortFn`	`local_disk` template parameter	`external_file_sort`
`MergeFn`	`local_disk` template parameter	`external_file_merge`
`SchedulePolicy`	`mapreduce::job::run()` template parameter	`cpu_parallel`, `sequential`

Datasource

This policy implements a data provider for Map Tasks. The default implementation iterates a given directory and feeds each Map Task with a Filename and std::ifstream to the open file as a key/value pair.

Combiner

A Combiner is an optimization technique, originally designed to reduce network traffic by applying a local reduction of intermediate key/value pairs in the Map phase before being passed to the Reduce phase. The combiner is optional, and can actually degrade performance on a single machine implementation due to the additional file sorting that is required. The default is therefore a null_combiner which does nothing.

IntermediateStore

The policy class implements the behavior for storing, sorting and merging intermediate results between the Map and Reduce phases. The default implementation uses temporary files on the local filesystem.

SortFn

Used to sort external intermediate files. Current default implementation uses a system() call to shell out to the operating system SORT process. A Merge Sort implementation is currently in development.

MergeFn

Used to merge external intermediate files. Current default implementation uses a system() call to shell out to the operating system COPY process (Win32 only). A platform independent in-process implementation is required.

SchedulePolicy

This policy is the core of the scheduling algorithm and runs the Map and Reduce Tasks. Two schedule policies are supplied, cpu_parallel uses the maximum available CPU cores to run as many map simultaneous tasks as possible (within a limit given in the mapreduce::specification object). The sequential scheduler will run one map task followed by one reduce task, which is useful for debugging purposes.

Example - WordCount

Below is a simplified - but complete - WordCount example using the library. Error checking is removed and a simplified definition of a 'word' is used for brevity.

namespace wordcount {

class map_task;

class reduce_task;

typedef

mapreduce::job<

  wordcount::map_task,

  wordcount::reduce_task>

job;

class map_task : boost::noncopyable

{

  public:

    typedef std::string   key_type;

    typedef std::ifstream value_type;

    typedef std::string   intermediate_key_type;

    typedef unsigned      intermediate_value_type;

    map_task(job::map_task_runner &runner)

      : runner_(runner)

    {

    }

    // 'value_type' is not a reference to const to enable streams to be passed

    //    key: input filename

    //    value: ifstream of the open file

    void operator()(key_type const &/*key*/, value_type &value)

    {

        while (!value.eof())

        {

            std::string word;

            value >> word;

            std::transform(word.begin(), word.end(), word.begin(),

                           std::bind1st(

                               std::mem_fun(&std::ctype<char>::tolower),

                               &std::use_facet<std::ctype<char> >(std::locale::classic())));

            runner_.emit_intermediate(word, 1);

        }

    }

  private:

    job::map_task_runner &runner_;

};

class reduce_task : boost::noncopyable

{

  public:

    typedef std::string  key_type;

    typedef size_t       value_type;

    reduce_task(job::reduce_task_runner &runner)

      : runner_(runner)

    {

    }

    template<typename It>

    void operator()(typename map_task::intermediate_key_type const &key, It it, It ite)

    {

        reduce_task::value_type result = 0;

        for (; it!=ite; ++it)

           result += *it;

        runner_.emit(key, result);

    }

  private:

    job::reduce_task_runner &runner_;

};

}   // namespace wordcount

int main(int argc, char **argv)

{

    wordcount::job::datasource_type datasource;

    datasource.set_directory(argv[1]);

    mapreduce::specification  spec;

    mapreduce::results        result;

    wordcount::job            mr(datasource);

    mr.run<mapreduce::schedule_policy::cpu_parallel>(spec, result);

    // output the results

    std::cout << std::endl << "\n" << "MapReduce statistics:";

    std::cout << "\n  " << "MapReduce job runtime                     : " << result.job_runtime << " seconds, of which...";

    std::cout << "\n  " << "  Map phase runtime                       : " << result.map_runtime << " seconds";

    std::cout << "\n  " << "  Reduce phase runtime                    : " << result.reduce_runtime << " seconds";

    std::cout << "\n\n  " << "Map:";

    std::cout << "\n    " << "Total Map keys                          : " << result.counters.map_tasks;

    std::cout << "\n    " << "Map keys processed                      : " << result.counters.map_tasks_completed;

    std::cout << "\n    " << "Map key processing errors               : " << result.counters.map_tasks_error;

    std::cout << "\n    " << "Number of Map Tasks run (in parallel)   : " << result.counters.actual_map_tasks;

    std::cout << "\n    " << "Fastest Map key processed in            : " << *std::min_element(result.map_times.begin(), result.map_times.end()) << " seconds";

    std::cout << "\n    " << "Slowest Map key processed in            : " << *std::max_element(result.map_times.begin(), result.map_times.end()) << " seconds";

    std::cout << "\n    " << "Average time to process Map keys        : " << std::accumulate(result.map_times.begin(), result.map_times.end(), boost::int64_t()) / result.map_times.size() << " seconds";

    std::cout << "\n\n  " << "Reduce:";

    std::cout << "\n    " << "Number of Reduce Tasks run (in parallel): " << result.counters.actual_reduce_tasks;

    std::cout << "\n    " << "Number of Result Files                  : " << result.counters.num_result_files;

    std::cout << "\n    " << "Fastest Reduce key processed in         : " << *std::min_element(result.reduce_times.begin(), result.reduce_times.end()) << " seconds";

    std::cout << "\n    " << "Slowest Reduce key processed in         : " << *std::max_element(result.reduce_times.begin(), result.reduce_times.end()) << " seconds";

    std::cout << "\n    " << "Average time to process Reduce keys     : " << std::accumulate(result.reduce_times.begin(), result.reduce_times.end(), boost::int64_t()) / result.map_times.size() << " seconds";

    return 0;

}

Performance

Here are some results running the WordCount example on the Westbury Lab USENET corpus (2005) containing 9.92 GB (10,659,287,688 bytes) of data in 23 files.

Sequential MapReduce

The sequential schedulig algorithm gives a baseline timing for the WordCount implementation, running a single Map task followed by a single Reduce task.

MapReduce Wordcount Application

16 CPU cores

class mapreduce::job<class wordcount::map_task,class wordcount::reduce_task,struct mapreduce::null_c

ombiner,class mapreduce::datasource::directory_iterator<class wordcount::map_task>,class mapreduce::

intermediates::local_disk<class wordcount::map_task,struct win32::external_file_sort,struct win32::e

xternal_file_merge> >

Running Sequential MapReduce...

Finished.

MapReduce statistics:

  MapReduce job runtime                     : 3105 seconds, of which...

    Map phase runtime                       : 699 seconds

    Reduce phase runtime                    : 2406 seconds

  Map:

    Total Map keys                          : 23

    Map keys processed                      : 23

    Map key processing errors               : 0

    Number of Map Tasks run (in parallel)   : 1

    Fastest Map key processed in            : 0 seconds

    Slowest Map key processed in            : 96 seconds

    Average time to process Map keys        : 30 seconds

  Reduce:

    Number of Reduce Tasks run (in parallel): 1

    Number of Result Files                  : 10

    Fastest Reduce key processed in         : 161 seconds

    Slowest Reduce key processed in         : 390 seconds

    Average time to process Reduce keys     : 104 seconds

CPU Parallel

Running on a 16 CPU-core Windows server, using all core for Map tasks, produces the following results:

MapReduce Wordcount Application

16 CPU cores

class mapreduce::job,class mapreduce::

intermediates::local_disk >

Running CPU Parallel MapReduce...

CPU Parallel MapReduce Finished.

MapReduce statistics:

  MapReduce job runtime                     : 1608 seconds, of which...

    Map phase runtime                       : 842 seconds

    Reduce phase runtime                    : 766 seconds

  Map:

    Total Map keys                          : 23

    Map keys processed                      : 23

    Map key processing errors               : 0

    Number of Map Tasks run (in parallel)   : 16

    Fastest Map key processed in            : 0 seconds

    Slowest Map key processed in            : 842 seconds

    Average time to process Map keys        : 433 seconds

  Reduce:

    Number of Reduce Tasks run (in parallel): 10

    Number of Result Files                  : 10

    Fastest Reduce key processed in         : 384 seconds

    Slowest Reduce key processed in         : 766 seconds

    Average time to process Reduce keys     : 261 seconds

Running on the same server, restricting the number of Map task to 8 yields:

MapReduce Wordcount Application

16 CPU cores

class mapreduce::job<class wordcount::map_task,class wordcount::reduce_task,struct mapreduce::null_c

ombiner,class mapreduce::datasource::directory_iterator<class wordcount::map_task>,class mapreduce::

intermediates::local_disk<class wordcount::map_task,struct win32::external_file_sort,struct win32::e

xternal_file_merge> >

Running CPU Parallel MapReduce...

CPU Parallel MapReduce Finished.

MapReduce statistics:

  MapReduce job runtime                     : 1743 seconds, of which...

    Map phase runtime                       : 950 seconds

    Reduce phase runtime                    : 793 seconds

  Map:

    Total Map keys                          : 23

    Map keys processed                      : 23

    Map key processing errors               : 0

    Number of Map Tasks run (in parallel)   : 8

    Fastest Map key processed in            : 0 seconds

    Slowest Map key processed in            : 934 seconds

    Average time to process Map keys        : 303 seconds

  Reduce:

    Number of Reduce Tasks run (in parallel): 10

    Number of Result Files                  : 10

    Fastest Reduce key processed in         : 396 seconds

    Slowest Reduce key processed in         : 793 seconds

    Average time to process Reduce keys     : 271 seconds

MapReduce C++ Library的更多相关文章

Type Archive for required library: 'C:/Users/EuphemiaShaw/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.6.5/hadoop-hdfs-2.6.5.jar' in project 'mapreduce' cannot be read or is not a valid ZIP file
error: Description Resource Path Location Type Archive for required library: 'C:/Users/EuphemiaShaw/ ...
Linux上搭建Hadoop2.6.3集群以及WIN7通过Eclipse开发MapReduce的demo
近期为了分析国内航空旅游业常见安全漏洞,想到了用大数据来分析,其实数据也不大,只是生产项目没有使用Hadoop,因此这里实际使用一次. 先看一下通过hadoop分析后的结果吧,最终通过hadoop分析 ...
Hadoop学习笔记—4.初识MapReduce
一.神马是高大上的MapReduce MapReduce是Google的一项重要技术,它首先是一个编程模型,用以进行大数据量的计算.对于大数据量的计算,通常采用的处理手法就是并行计算.但对许多开发者来 ...
Hadoop官方文档翻译——MapReduce Tutorial
MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...
Hadoop 学习笔记3 Develping MapReduce
小笔记: Mavon是一种项目管理工具,通过xml配置来设置项目信息. Mavon POM(project of model). Steps: 1. set up and configure the ...
《转载》化繁为简如何向老婆解释MapReduce？
本文转载自http://server.zol.com.cn/329/3295529.html 昨天,我在Xebia印度办公室发表了一个关于MapReduce的演说.演说进行得很顺利,听众们都能够理解M ...
MapReduce应用案例--简单排序
1. 设计思路在MapReduce过程中自带有排序,可以使用这个默认的排序达到我们的目的. MapReduce 是按照key值进行排序的,我们在Map过程中将读入的数据转化成IntWritable类 ...
化繁为简如何向老婆解释MapReduce？（转载）
化繁为简如何向老婆解释MapReduce? 昨天,我在Xebia印度办公室发表了一个关于MapReduce的演说.演说进行得很顺利,听众们都能够理解MapReduce的概念(根据他们的反馈).我成功 ...
从hadoop框架与MapReduce模式中谈海量数据处理
http://blog.csdn.net/wind19/article/details/7716326 前言几周前,当我最初听到,以致后来初次接触Hadoop与MapReduce这两个东西,我便稍显 ...

随机推荐

ContentType&CORS&Git
ContentType django内置的ContentType组件就是帮我们做连表操作如果一个表与其他表有多个外键关系,我们可以通过ContentType来解决这种关联 from django.d ...
OSI七层网络模型与TCP/IP四层模型介绍
目录 OSI七层网络模型与TCP/IP四层模型介绍 1.OSI七层网络模型介绍 2.TCP/IP四层网络模型介绍 3.各层对应的协议 4.OSI七层和TCP/IP四层的区别 5.交换机工作在OSI的哪 ...
ABP .Net Core 调用异步方法抛异常A second operation started on this context before a previous asynchronous operation completed
1. 问题描述最近使用ABP .Net Core框架做一个微信开发,同时采用了一个微信开发框架集成到ABP,在微信用户关注的推送事件里调用了一个async 方法,由于没有返回值,也没做任何处理,本 ...
XCODE中使用Main.Storyboard拉入控件并实现事件（Swift语言）
如何在XCODE中的Main.Storyboard内拉入控件并实现一个简单的效果呢?本人由于刚接触Swift语言不久,对于IDE的操作还是很生疏,不懂了就在网上参考了网上前辈们的文章.以下我将演示如何 ...
HDU 1592 Half of and a Half（大数）
Half of and a Half Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Other ...
故障排查：vsftpd无法用浏览器访问
在CentOS6上搭建的ftp服务器,突然无法使用浏览器进行访问,但使用xftp等工具可以正常访问想到之前修改过阿里云的安全组设置,推测可能有关 1)修改vsftpd的配置,手动指定被动模式的随机连 ...
opencv-python教程学习系列5-处理鼠标事件
前言 opencv-python教程学习系列记录学习python-opencv过程的点滴,本文主要介绍opencv-python处理鼠标事件,坚持学习,共同进步. 系列教程参照OpenCV-Pytho ...
在MNIST数据集，实现多个功能的tensorflow程序
使用带指数衰减的学习率的设置.使用正则化来避免过拟合,使用滑动平均模型使得最终模型更加健壮. import tensorflow as tf from tensorflow.examples.tuto ...
test20180907 day1
T1 256MB,1Sec T2 512MB,3Sec T3 512MB,1Sec 总分:150 试题一餐馆题目背景铜企鹅是企鹅餐馆的老板,他正在计划如何使得自己本年度收益增加. 题目描述共有 ...
leetcode:Valid Palindrome【Python版】
1.注意空字符串的处理: 2.注意是alphanumeric字符: 3.字符串添加字符直接用+就可以: class Solution: # @param s, a string # @return a ...

MapReduce C++ Library