最近看了Google的Pregel论文，图算法有一些经典且不可被替代的应用场景，如社交网络，相互引用等。但是在单个结点上的运算量往往过少，注重的是消息传播和逻辑处理，而不是单纯的大规模计算。虽然已经过去了十几年，但是其中的messsage passing，combiner，aggregator，group partition，状态机等机制还是设计分布式图算法的有效方法。这里针对设计和实现部分做一点点笔记。

Pregel: A System for Large-Scale Graph Processing

key point:

distributed computer clusters -> graph task

graph algorithm

introduction

Target:

Many practical computing problems concern large graphs. e.g. the Web graph and various social networks.

Parallelism for graph challenging:

Graph algorithms often exhibit poor locality of memory access,
very little work per vertex,
and a changing degree of parallelism over the course of execution.

We want a scalable general-purpose system for graph! Compared with existed options:

efficient
scalable
fault-tolerant

inspired by: Valiant’s Bulk Synchronous Parallel model

Model

data structure

input: directed graph(both have value):

vertex: vertex identifier!
edge: associated with source vertices

process

superstep: working unit with state machine -> sync!

vertex: first class citizens

output: may not as same as input(the graph structure can change)

terminal: voting to halt

API

Compute()

GetValue()

MutableValue()

combiner:

aggregator: global monitor (in my opinion, aggregator is global combiner + global coordinator(more logical than combiner))

aggregator: get global information and do a calculation, not only the combination of data.

e.g. do a particular operation when all the vertices meet a particualr condition

Implementation

Architecture

group partition: default(hash(ID) mod N)

self-define partition allocator: use locality

hierarchy(up to bottom):

user program(copies of app) = 1 master + N-1 workers
physical machine
group partition
vertex & outer-edge

Fault Tolerance

checkpoint

worker: partition state -> persistant storage

master: aggregator

regular "ping" message

分布式图算法Pregel的更多相关文章

明风：分布式图计算的平台Spark GraphX 在淘宝的实践
快刀初试:Spark GraphX在淘宝的实践作者:明风 (本文由团队中梧苇和我一起撰写,并由团队中的林岳,岩岫,世仪等多人Review,发表于程序员的8月刊,由于篇幅原因,略作删减,本文为完整版) ...
Apache Spark GraphX的体系结构
1. 整体架构 GraphX 的整体架构(如图 1所示)可以分为三部分. 图 1 GraphX 架构存储和原语层: Graph 类是图计算的核心类.内部含有 VertexRDD. EdgeRDD ...
《深入理解Spark：核心思想与源码分析》（第2章）
<深入理解Spark:核心思想与源码分析>一书前言的内容请看链接<深入理解SPARK:核心思想与源码分析>一书正式出版上市 <深入理解Spark:核心思想与源码分析> ...
Spark设计理念与基本架构
1.基本概念 Spark中的一些概念: RDD(resillient distributed dataset):弹性分布式数据集. Partition:数据分区.即一个RDD的数据可以划分为多少个分区 ...
【转帖】Spark设计理念与基本架构
Spark设计理念与基本架构 https://www.cnblogs.com/swordfall/p/9280006.html 1.基本概念 Spark中的一些概念: RDD(resillient d ...
大数据技术之_19_Spark学习_05_Spark GraphX 应用解析 + Spark GraphX 概述、解析 + 计算模式 + Pregel API + 图算法参考代码 + PageRank 实例
第1章 Spark GraphX 概述1.1 什么是 Spark GraphX1.2 弹性分布式属性图1.3 运行图计算程序第2章 Spark GraphX 解析2.1 存储模式2.1.1 图存储模式 ...
转债---Pregel: A System for Large-Scale Graph Processing(译)
转载:http://duanple.blog.163.com/blog/static/70971767201281610126277/ 作者:Grzegorz Malewicz, Matthew ...
Pregel: A System for Large-Scale Graph Processing(译)
[说明:Pregel这篇是发表在2010年的SIGMOD上,Pregel这个名称是为了纪念欧拉,在他提出的格尼斯堡七桥问题中,那些桥所在的河就叫Pregel.最初是为了解决PageRank计算问题,由 ...
图数据库之Pregel
/* 版权声明:能够随意转载,转载时请务必标明文章原始出处和作者信息 .*/ author: 张俊林节选自<大数据日知录:架构与算法>十四章.书籍文件夹在此 Pre ...
[Berkeley]弹性分布式数据集RDD的介绍（RDD: A Fault-Tolerant Abstraction for In-Memory Cluster Computing 论文翻译）
摘要: 本文提出了分布式内存抽象的概念--弹性分布式数据集(RDD,Resilient Distributed Datasets).它同意开发者在大型集群上运行基于内存的计算.RDD适用于两种 ...

随机推荐

java项目实战-spring-基本用法01-day24
目录 1. spring 简单介绍 2. IOC/DI --控制反转--是啥 3. 实现 3. 如果对象的属性为引用数据类型如何实例化对象 4 如何用注解的方式以少量的代码实现对象的创建于获 ...
Solon v2.6.5 发布（助力信创）
Solon 是什么框架? Java "生态级"应用开发框架.从零开始构建,有自己的标准规范与开放生态(历时六年,具备全球第二级别的生态规模). 相对于 Spring,有什么特点? ...
[转帖]2024年正常使用windows XP之一：系统下载篇
https://zhuanlan.zhihu.com/p/347764175 半夏:2024年正常使用windows XP之一:系统下载篇半夏:2024年正常使用windows XP之二:补丁及运行 ...
[转帖]A Quick Look at the Huawei HiSilicon Kunpeng 920 Arm Server CPU
https://www.servethehome.com/a-quick-look-huawei-hisilicon-kunpeng-920-arm-server-cpu/ Huawei Hi ...
[转帖]《Linux性能优化实战》笔记（五）—— 不可中断进程与僵尸进程
一. 进程状态 1. 状态含义从 ps或者 top 命令的输出中,可以看到处于不同状态的进程 R:Running 或 Runnable,表示进程在 CPU 的就绪队列中,正在运行或者正在等待运行 D ...
[转帖]数据库系列之TiDB存储引擎TiKV实现机制
TiDB存储引擎TiKV是基于RocksDB存储引擎,通过Raft分布式算法保证数据一致性.本文详细介绍了TiKV存储引擎的实现机制和原理,加深对TiDB底层存储架构的理解. 1.TiDB存储引擎Ti ...
[转帖]Linux系统安装之后，如何调节CPU性能策略
https://baize.cc/posts/efc.html#:~:text=Linux%E7%B3%BB%E7%BB%9F%E5%AE%89%E8%A3%85%E4%B9%8B%E5%90%8E% ...
[转帖]为什么不推荐使用/etc/fstab
https://www.jianshu.com/p/af49a5d0553f 对于工作中使用服务器的公司来讲,每到节假日来临时,总免不了对服务器进行下电.而收假回来的早上,则会有一个早上的时间会花费在 ...
[转帖]Jmeter中线程组和setUP线程组、tearDown线程组的区别
JMETER: setUP线程组:在测试任务ThreadGroup 运行前先被运行.通常用在运行测试任务前,做初始化工作.例如建立数据库连接初始分化工作.用户登录 tearDown线程组:在测试任务线 ...
[转帖]【软件测试】Jmeter性能测试（性能测试，Jmeter使用与结果分析）
文章目录前言一.性能测试 1. 什么是性能测试? 2. 性能测试的重要性 3. 性能指标--QPS和TPS ①QPS ②TPS 二.压测工具Jmeter 1. 什么是Jmeter? 2. Jmet ...

分布式图算法Pregel