[No0000183]Parallel Programming with .NET-How PLINQ processes an IEnumerable<T> on multiple cores

As Ed Essey explained in Partitioning in PLINQ, partitioning is an important step in PLINQ execution. Partitioning splits up a single input sequence into multiple sequences that can be processed in parallel. This post further explains chunk partitioning, the most general partitioning scheme that works on any IEnumerable<T>.

Chunk partitioning appears in two places in Parallel Extensions. First, it is one of the algorithms that PLINQ uses under the hood to execute queries in parallel. Second, chunk partitioning is available as a standalone algorithm through the Partitioner.Create() method.

To explain the design of the chunk partitioning algorithm, let’s walk through the possible ways of processing an IEnumerable<T> with multiple worker threads, finally arriving at the solution used in PLINQ (approach 4).

Approach 1: Load the input sequence into an intermediate array

As a simple solution, we could walk over the input sequence and store all elements into an array. Then, we can split up the array into ranges, and assign each range to a different worker.

The disadvantage of this approach is that we need to allocate an array large enough to store all input elements. If the input sequence is long, this will algorithm leads to unnecessarily large memory consumption. Also, we need to wait until the entire input sequence is ready before the workers can start executing.

Approach 2: Hand out elements to threads on demand

An entirely different approach is to have all worker threads share one input enumerator. When a worker is ready to process the next input element, it takes a shared lock, gets the next element from the input enumerator, and releases the lock.

This algorithm has a fairly large overhead because processing every element requires locking. Also, handing out elements individually is prone to poor cache behavior.

This approach does have an interesting advantage over Approach 1, though: since workers receive data on demand, the workers that finish faster will come back to request more work. In contrast, Approach 1 splits up all work ahead of time, and a worker that is done early simply goes away.

Approach 3: Hand out elements in chunks

To mitigate the two drawbacks of Approach 2 (synchronization cost and cache behavior), we can hand out elements to threads in “chunks”. When a thread is ready to process more inputs, it will take say 64 elements from the input enumerator.

Unfortunately, while this approach nicely amortizes the synchronization cost over multiple elements, it does not work well for short inputs. For example, if the input contains 50 elements and the chunk size is 64, all inputs will go into a single partition. Even if the work per element is large, we will not be able to benefit from parallelism, since one worker gets all the work.

And since IEnumerable<T> in general does not declare its length, we cannot simply tune the chunk size based on the input sequence length.

Approach 4: Hand out elements in chunks of increasing size

A solution to the problem with small inputs is to use chunks of a growing size. The first chunk assigned to each thread is of size 1 and subsequent chunks are gradually larger, until a specific threshold is reached.

Our solution doubles the chunk size every few chunks. So, each thread first receives a few chunks of size 1, then a few chunks of size 2, then 4, and so forth. Once the chunk size reaches a certain threshold, it remains constant.

This chunking strategy ensures that if the input is short, it will still get split up fairly among the cores. But, the chunk size also grows fairly quickly, and the per-chunk overheads are small for large inputs. Also, the algorithm is quite good at load-balancing, so if one worker is taking longer to process its inputs, other workers will process more elements to decrease the overall processing time.

One interesting consequence of the chunk partitioning algorithm is that multiple threads will call MoveNext() on the input enumerator. The worker threads will use a lock to ensure mutual exclusion, but the enumerator must not assume that MoveNext() will be called from a particular thread (e.g., it should not use thread-local storage, manipulate UI, etc).

The current implementation of both PLINQ chunk partitioning and Partitioner.Create() follows approach 4 fairly closely. Now you know how it behaves and why!

正如Ed Essey 在PLINQ中的Partitioning中所解释的那样，分区是PLINQ执行中的重要一步。分区将单个输入序列拆分为可以并行处理的多个序列。这篇文章进一步解释了块分区，这是一种适用于任何IEnumerable <T>的最常用的分区方案。

块分区出现在Parallel Extensions中的两个位置。首先，它是PLINQ在引擎盖下用于并行执行查询的算法之一。其次，块分区可通过Partitioner.Create（）方法作为独立算法使用。

为了解释块分区算法的设计，让我们来看看处理具有多个工作线程的IEnumerable <T>的可能方法，最后得出PLINQ中使用的解决方案（方法4）。

方法1：将输入序列加载到中间阵列中

作为一个简单的解决方案，我们可以遍历输入序列并将所有元素存储到数组中。然后，我们可以将数组拆分为范围，并将每个范围分配给不同的工作人员。

这种方法的缺点是我们需要分配一个足够大的数组来存储所有输入元素。如果输入序列很长，这将导致不必要的大量内存消耗。此外，我们需要等到整个输入序列准备就绪，然后工人才能开始执行。

方法2：根据需要将元素分发给线程

一种完全不同的方法是让所有工作线程共享一个输入枚举器。当一个worker准备处理下一个输入元素时，它需要一个共享锁，从输入枚举器获取下一个元素，然后释放锁。

该算法具有相当大的开销，因为处理每个元素都需要锁定。此外，单独分发元素容易导致缓存行为不良。

然而，这种方法确实比方法1具有一个有趣的优势：由于工作人员按需接收数据，因此工作得更快的工作人员将回来请求更多的工作。相比之下，方法1提前将所有工作分开，而早期完成的工人就会消失。

方法3：以块的形式分发元素

为了缓解方法2的两个缺点（同步成本和缓存行为），我们可以将元素分发给“块”中的线程。当线程准备好处理更多输入时，它将从输入枚举器中获取64个元素。

不幸的是，虽然这种方法可以很好地分摊多个元素的同步成本，但它对短输入效果不佳。例如，如果输入包含50个元素且块大小为64，则所有输入将进入单个分区。即使每个元素的工作量很大，我们也无法从并行性中受益，因为一个工作者可以完成所有工作。

并且由于IEnumerable <T>通常不会声明其长度，因此我们不能简单地根据输入序列长度调整块大小。

方法4：分发大小不断增加的元素

小输入问题的解决方案是使用不断增长的大小的块。分配给每个线程的第一个块大小为1，后续块逐渐变大，直到达到特定阈值。

我们的解决方案每隔几个块就会使块大小翻倍。因此，每个线程首先接收一些大小为1的块，然后是几个大小为2的块，然后是4块，依此类推。一旦块大小达到某个阈值，它就保持不变。

这种分块策略可确保如果输入很短，它仍会在核心之间公平分配。但是，块大小也相当快地增长，并且对于大输入，每块大小的开销很小。此外，该算法非常擅长负载平衡，因此如果一个工作人员花费更长的时间来处理其输入，则其他工作人员将处理更多元素以减少整体处理时间。

块分区算法的一个有趣结果是多个线程将在输入枚举器上调用MoveNext（）。工作线程将使用锁来确保互斥，但是枚举器不能假设将从特定线程调用MoveNext（）（例如，它不应该使用线程本地存储，操纵UI等）。

PLINQ块分区和Partitioner.Create（）的当前实现非常接近方法4。现在你知道它的行为和原因了！

[No0000183]Parallel Programming with .NET-How PLINQ processes an IEnumerable<T> on multiple cores的更多相关文章

[No0000182]Parallel Programming with .NET-Partitioning in PLINQ
Every PLINQ query that can be parallelized starts with the same step: partitioning. Some queries ma ...
Notes of Principles of Parallel Programming - TODO
0.1 TopicNotes of Lin C., Snyder L.. Principles of Parallel Programming. Beijing: China Machine Pres ...
Samples for Parallel Programming with the .NET Framework
The .NET Framework 4 includes significant advancements for developers writing parallel and concurren ...
Introduction to Multi-Threaded, Multi-Core and Parallel Programming concepts
https://katyscode.wordpress.com/2013/05/17/introduction-to-multi-threaded-multi-core-and-parallel-pr ...
4.3 Reduction代码(Heterogeneous Parallel Programming class lab)
首先添加上Heterogeneous Parallel Programming class 中 lab: Reduction的代码: myReduction.c // MP Reduction // ...
Task Cancellation: Parallel Programming
http://beyondrelational.com/modules/2/blogs/79/posts/11524/task-cancellation-parallel-programming-ii ...
Parallel Programming for FPGAs 学习笔记（1）
Parallel Programming for FPGAs 学习笔记(1)
Parallel Programming AND Asynchronous Programming
https://blogs.oracle.com/dave/ Java Memory Model...and the pragmatics of itAleksey Shipilevaleksey.s ...
Fork and Join: Java Can Excel at Painless Parallel Programming Too!---转
原文地址:http://www.oracle.com/technetwork/articles/java/fork-join-422606.html Multicore processors are ...

随机推荐

计算N个点和M个点之间的距离
KNN中,训练样本有train_count个,测试样本有test_count个,每个样本有attr_count个属性.现在需要快速计算test_count个测试样本和train_count个样本之间的 ...
dup2替换
今天看APUE上一道题,要求不能用fcnt1来替换dup1. 刚开始的思路是dup一个,测试发现与期望的不一致就马上关闭,发现遇到无限循环,刚才想了下,才发现一旦close掉,再次dup仍然是分配最小 ...
单片机成长之路（51基础篇） - 017 C51中data,idata,xdata,pdata的区别(转)
从数据存储类型来说,8051系列有片内.片外程序存储器,片内.片外数据存储器,片内程序存储器还分直接寻址区和间接寻址类型,分别对应code.data.xdata.idata以及根据51系列特点而设定的 ...
mongodb高级聚合查询（转）
在工作中会经常遇到一些mongodb的聚合操作,特此总结下.mongo存储的可以是复杂类型,比如数组.对象等mysql不善于处理的文档型结构,并且聚合的操作也比mysql复杂很多. 注:本文基于 mo ...
老司机在zabbix上的一次翻车
[前言] 自以为是zabbix的老司机了,没有想到今天翻车了! 一般人出错了都可以找到一个借口.我就不一样啦,我感觉我可以找两个1): 针对官方文档给出的操作步骤没有经过深入的思考 2): 今天没有 ...
每日英语：How To Survive The Windows XPiration Date
The default background for Microsoft's Windows XP operating system -- a perfect blue sky full of cot ...
pycharm如何在debug的时候动态执行python语句
在调试MATLAB的时候,这一点很容易实现,比如动态修改变量的值,在VS2017中调试python程序,这一点也很容易实现,但是我在pycharm里面找了半天,如下图:
MinGW 使用 mintty 终端替代默认终端以解决界面上复制与粘贴的问题
使用了一段时间的 cygwin,挺开心的,又尝试了下同类工具 Msys + MinGW,安装好之后发现它居然使用默认的 cmd 作为终端,界面输出内容的复制与粘贴极其不便,我记得 Cygwin 使用的 ...
Ubuntu 设置NAT共享网络（命令行方法）
本文介绍如何使用iptables来实现NAT转发,事实上就是将一台机器作为网关(gateway)来使用.我们假设充当网关的机器至少有网卡eth0和eth1,使用eth0表示连接到外网的网卡,使用eth ...
（实用）pip源
Pypi官方源网站的连接速度实在慢点出奇,可以更换为豆瓣的源 vim ~/.pip/pip.conf 添加如下内容即可: [global]index-url=http://pypi.doubam.co ...

[No0000183]Parallel Programming with .NET-How PLINQ processes an IEnumerable<T> on multiple cores

[No0000183]Parallel Programming with .NET-How PLINQ processes an IEnumerable<T> on multiple cores的更多相关文章

随机推荐

热门专题