Spark Graphx编程指南

问题导读

1.GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图?
2.PageRank算法在图中发挥什么作用？
3.三角形计数算法的作用是什么？

Spark中文手册-编程指南
Spark之一个快速的例子 Spark之基本概念
Spark之基本概念
Spark之基本概念（2）
Spark之基本概念（3）
Spark-sql由入门到精通
Spark-sql由入门到精通续
spark GraphX编程指南（1）

Pregel API

图本身是递归数据结构，顶点的属性依赖于它们邻居的属性，这些邻居的属性又依赖于自己邻居的属性。所以许多重要的图算法都是迭代的重新计算每个顶点的属性，直到满足某个确定的条件。一系列的graph-parallel抽象已经被提出来用来表达这些迭代算法。GraphX公开了一个类似Pregel的操作，它是广泛使用的Pregel和GraphLab抽象的一个融合。

在GraphX中，更高级的Pregel操作是一个约束到图拓扑的批量同步（bulk-synchronous）并行消息抽象。Pregel操作者执行一系列的超级步骤（super steps），在这些步骤中，顶点从之前的超级步骤中接收进入(inbound)消息的总和，为顶点属性计算一个新的值，然后在以后的超级步骤中发送消息到邻居顶点。不像Pregel而更像GraphLab，消息作为一个边三元组的函数被并行计算，消息计算既访问了源顶点特征也访问了目的顶点特征。在超级步中，没有收到消息的顶点被跳过。当没有消息遗留时，Pregel操作停止迭代并返回最终的图。

注意，与更标准的Pregel实现不同的是，GraphX中的顶点仅仅能发送信息给邻居顶点，并利用用户自定义的消息函数构造消息。这些限制允许在GraphX进行额外的优化。

一下是[url=https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection]Pregel操作[/url]((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED])的类型签名以及实现草图（注意，访问graph.cache已经被删除）

class GraphOps[VD, ED] {
def pregel[A]
(initialMsg: A,
maxIter: Int = Int.MaxValue,
activeDir: EdgeDirection = EdgeDirection.Out)
(vprog: (VertexId, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] = {
// Receive the initial message at each vertex
var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
// compute the messages
var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
var activeMessages = messages.count()
// Loop until no messages remain or maxIterations is achieved
var i = 0
while (activeMessages > 0 && i < maxIterations) {
// Receive the messages: -----------------------------------------------------------------------
// Run the vertex program on all vertices that receive messages
val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
// Merge the new vertex values back into the graph
g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
// Send Messages: ------------------------------------------------------------------------------
// Vertices that didn't receive a message above don't appear in newVerts and therefore don't
// get to send messages. More precisely the map phase of mapReduceTriplets is only invoked
// on edges in the activeDir of vertices in newVerts
messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
activeMessages = messages.count()
i += 1
}
g
}
}

复制代码

注意，pregel有两个参数列表（graph.pregel(list1)(list2)）。第一个参数列表包含配置参数初始消息、最大迭代数、发送消息的边的方向（默认是沿边方向出）。第二个参数列表包含用户自定义的函数用来接收消息（vprog）、计算消息（sendMsg）、合并消息（mergeMsg）。

我们可以用Pregel操作表达计算单源最短路径( single source shortest path)。

import org.apache.spark.graphx._
// Import random graph generation library
import org.apache.spark.graphx.util.GraphGenerators
// A graph with edge attributes containing distances
val graph: Graph[Int, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have distance infinity.
val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => math.min(dist, newDist), // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) => math.min(a,b) // Merge Message
)
println(sssp.vertices.collect.mkString("\n"))

复制代码

图构造者

GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图。默认情况下，没有哪个图构造者为图的边重新分区，而是把边保留在默认的分区中（例如HDFS中它们的原始块）。Graph.groupEdges⇒ED):Graph[VD,ED]) 需要重新分区图，因为它假定相同的边将会被分配到同一个分区，所以你必须在调用groupEdges之前调用Graph.partitionBy:Graph[VD,ED])

object GraphLoader {
def edgeListFile(
sc: SparkContext,
path: String,
canonicalOrientation: Boolean = false,
minEdgePartitions: Int = 1)
: Graph[Int, Int]
}

复制代码

GraphLoader.edgeListFile:Graph[Int,Int]) 提供了一个方式从磁盘上的边列表中加载一个图。它解析如下形式（源顶点ID，目标顶点ID）的连接表，跳过以#开头的注释行。

# This is a comment
2 1
4 1
1 2

复制代码

它从指定的边创建一个图，自动地创建边提及的所有顶点。所有的顶点和边的属性默认都是1。canonicalOrientation参数允许重定向正方向(srcId < dstId)的边。这在connected components 算法中需要用到。minEdgePartitions参数指定生成的边分区的最少数量。边分区可能比指定的分区更多，例如，一个HDFS文件包含更多的块。

object Graph {
def apply[VD, ED](
vertices: RDD[(VertexId, VD)],
edges: RDD[Edge[ED]],
defaultVertexAttr: VD = null)
: Graph[VD, ED]
def fromEdges[VD, ED](
edges: RDD[Edge[ED]],
defaultValue: VD): Graph[VD, ED]
def fromEdgeTuples[VD](
rawEdges: RDD[(VertexId, VertexId)],
defaultValue: VD,
uniqueEdges: Option[PartitionStrategy] = None): Graph[VD, Int]
}

复制代码

[url=https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD]Graph.apply[/url]],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]) 允许从顶点和边的RDD上创建一个图。重复的顶点可以任意的选择其中一个，在边RDD中而不是在顶点RDD中发现的顶点分配默认的属性。

[url=https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD]Graph.fromEdges[/url](ClassTag[VD],ClassTag[ED]):Graph[VD,ED]) 允许仅仅从一个边RDD上创建一个图，它自动地创建边提及的顶点，并分配这些顶点默认的值。

[url=https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId]Graph.fromEdgeTuples[/url]],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]) 允许仅仅从一个边元组组成的RDD上创建一个图。分配给边的值为1。它自动地创建边提及的顶点，并分配这些顶点默认的值。它还支持删除边。为了删除边，需要传递一个PartitionStrategy 为值的Some作为uniqueEdges参数（如uniqueEdges = Some(PartitionStrategy.RandomVertexCut)）。分配相同的边到同一个分区从而使它们可以被删除，一个分区策略是必须的。

顶点和边RDDs

GraphX暴露保存在图中的顶点和边的RDD。然而，因为GraphX包含的顶点和边拥有优化的数据结构，这些数据结构提供了额外的功能。顶点和边分别返回VertexRDD和EdgeRDD。这一章我们将学习它们的一些有用的功能。

VertexRDDs

VertexRDD[A]继承自RDD[(VertexID, A)]并且添加了额外的限制，那就是每个VertexID只能出现一次。此外，VertexRDD[A]代表了一组属性类型为A的顶点。在内部，这通过保存顶点属性到一个可重复使用的hash-map数据结构来获得。所以，如果两个VertexRDDs从相同的基本VertexRDD获得（如通过filter或者mapValues），它们能够在固定的时间内连接而不需要hash评价。为了利用这个索引数据结构，VertexRDD暴露了一下附加的功能：

class VertexRDD[VD] extends RDD[(VertexID, VD)] {
// Filter the vertex set but preserves the internal index
def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD]
// Transform the values without changing the ids (preserves the internal index)
def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
def mapValues[VD2](map: (VertexId, VD) => VD2): VertexRDD[VD2]
// Remove vertices from this set that appear in the other set
def diff(other: VertexRDD[VD]): VertexRDD[VD]
// Join operators that take advantage of the internal indexing to accelerate joins (substantially)
def leftJoin[VD2, VD3](other: RDD[(VertexId, VD2)])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
def innerJoin[U, VD2](other: RDD[(VertexId, U)])(f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
// Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
def aggregateUsingIndex[VD2](other: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
}

复制代码

举个例子，filter操作如何返回一个VertexRDD。过滤器实际使用一个BitSet实现，因此它能够重用索引以及保留和其它VertexRDDs做连接时速度快的能力。同样的，mapValues操作不允许map函数改变VertexID，因此可以保证相同的HashMap数据结构能够重用。当连接两个从相同的hashmap获取的VertexRDDs和使用线性扫描而不是昂贵的点查找实现连接操作时，leftJoin 和innerJoin都能够使用。

从一个RDD[(VertexID, A)]高效地构建一个新的VertexRDD，aggregateUsingIndex操作是有用的。概念上，如果我通过一组顶点构造了一个VertexRDD[B]，而VertexRDD[B]是一些RDD[(VertexID, A)]中顶点的超集，那么我们就可以在聚合以及随后索引RDD[(VertexID, A)]中重用索引。例如：

val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 100L).map(id => (id, 1)))
val rddB: RDD[(VertexId, Double)] = sc.parallelize(0L until 100L).flatMap(id => List((id, 1.0), (id, 2.0)))
// There should be 200 entries in rddB
rddB.count
val setB: VertexRDD[Double] = setA.aggregateUsingIndex(rddB, _ + _)
// There should be 100 entries in setB
setB.count
// Joining A and B should now be fast!
val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)

复制代码

EdgeRDDs

EdgeRDD[ED]继承自RDD[Edge[ED]]，使用定义在PartitionStrategy的各种分区策略中的一个在块分区中组织边。在每个分区中，边属性和相邻结构被分别保存，当属性值改变时，它们可以最大化的重用。

EdgeRDD暴露了三个额外的函数

// Transform the edge attributes while preserving the structure
def mapValues[ED2](f: Edge[ED] => ED2): EdgeRDD[ED2]
// Revere the edges reusing both attributes and structure
def reverse: EdgeRDD[ED]
// Join two `EdgeRDD`s partitioned using the same partitioning strategy.
def innerJoin[ED2, ED3](other: EdgeRDD[ED2])(f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]

复制代码

在大多数的应用中，我们发现，EdgeRDD操作可以通过图操作者(graph operators)或者定义在基本RDD中的操作来完成。

图算法

GraphX包括一组图算法来简化分析任务。这些算法包含在org.apache.spark.graphx.lib包中，可以被直接访问。

PageRank算法

PageRank度量一个图中每个顶点的重要程度，假定从u到v的一条边代表v的重要性标签。例如，一个Twitter用户被许多其它人粉，该用户排名很高。GraphX带有静态和动态PageRank的实现方法，这些方法在PageRank object中。静态的PageRank运行固定次数的迭代，而动态的PageRank一直运行，直到收敛。GraphOps允许直接调用这些算法作为图上的方法。

GraphX包含一个我们可以运行PageRank的社交网络数据集的例子。用户集在graphx/data/users.txt中，用户之间的关系在graphx/data/followers.txt中。我们通过下面的方法计算每个用户的PageRank。

// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("graphx/data/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("\n"))

复制代码

连通体算法

连通体算法用id标注图中每个连通体，将连通体中序号最小的顶点的id作为连通体的id。例如，在社交网络中，连通体可以近似为集群。GraphX在ConnectedComponents object 中包含了一个算法的实现，我们通过下面的方法计算社交网络数据集中的连通体。

/ Load the graph as in the PageRank example
val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
// Find the connected components
val cc = graph.connectedComponents().vertices
// Join the connected components with the usernames
val users = sc.textFile("graphx/data/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val ccByUsername = users.join(cc).map {
case (id, (username, cc)) => (username, cc)
}
// Print the result
println(ccByUsername.collect().mkString("\n"))

复制代码

三角形计数算法

一个顶点有两个相邻的顶点以及相邻顶点之间的边时，这个顶点是一个三角形的一部分。GraphX在TriangleCount object 中实现了一个三角形计数算法，它计算通过每个顶点的三角形的数量。需要注意的是，在计算社交网络数据集的三角形计数时，TriangleCount需要边的方向是规范的方向(srcId < dstId), 并且图通过Graph.partitionBy分片过。

// Load the edges in canonical order and partition the graph for triangle count
val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
// Find the triangle count for each vertex
val triCounts = graph.triangleCount().vertices
// Join the triangle counts with the usernames
val users = sc.textFile("graphx/data/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
(username, tc)
}
// Print the result
println(triCountByUsername.collect().mkString("\n"))

复制代码

例子

假定我们想从一些文本文件中构建一个图，限制这个图包含重要的关系和用户，并且在子图上运行page-rank，最后返回与top用户相关的属性。可以通过如下方式实现.

// Connect to the Spark cluster
val sc = new SparkContext("spark://master.amplab.org", "research")
// Load my user data and parse into tuples of user id and attribute list
val users = (sc.textFile("graphx/data/users.txt")
.map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
// Parse the edge data which is already in userId -> userId format
val followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
// Attach the user attributes
val graph = followerGraph.outerJoinVertices(users) {
case (uid, deg, Some(attrList)) => attrList
// Some users may not have attributes so we set them as empty
case (uid, deg, None) => Array.empty[String]
}
// Restrict the graph to users with usernames and names
val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)
// Compute the PageRank
val pagerankGraph = subgraph.pageRank(0.001)
// Get the attributes of the top pagerank users
val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
case (uid, attrList, Some(pr)) => (pr, attrList.toList)
case (uid, attrList, None) => (0.0, attrList.toList)
}
println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))

复制代码

相关内容：

Spark中文手册1-编程指南
http://www.aboutyun.com/thread-11413-1-1.html

Spark中文手册2：Spark之一个快速的例子
http://www.aboutyun.com/thread-11484-1-1.html

Spark中文手册3：Spark之基本概念
http://www.aboutyun.com/thread-11502-1-1.html

Spark中文手册4：Spark之基本概念（2）
http://www.aboutyun.com/thread-11516-1-1.html

Spark中文手册5：Spark之基本概念（3）
http://www.aboutyun.com/thread-11535-1-1.html

Spark中文手册6：Spark-sql由入门到精通
http://www.aboutyun.com/thread-11562-1-1.html

Spark中文手册7：Spark-sql由入门到精通【续】
http://www.aboutyun.com/thread-11575-1-1.html

Spark中文手册8：spark GraphX编程指南（1）
http://www.aboutyun.com/thread-11589-1-1.html

Spark中文手册10：spark部署：提交应用程序及独立部署模式
http://www.aboutyun.com/thread-11615-1-1.html

Spark中文手册11：Spark 配置指南
http://www.aboutyun.com/thread-10652-1-1.html

Spark Graphx编程指南的更多相关文章

Spark—GraphX编程指南
Spark系列面试题 Spark面试题(一) Spark面试题(二) Spark面试题(三) Spark面试题(四) Spark面试题(五)--数据倾斜调优 Spark面试题(六)--Spark资源调 ...
GraphX编程指南
GraphX编程指南概述入门属性图属性图示例图算子算子摘要列表属性算子结构化算子 Join算子最近邻聚集汇总消息(aggregateMessages) Map Reduce三元 ...
Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南概述一个入门示例基础概念依赖初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
<译>Spark Sreaming 编程指南
Spark Streaming 编程指南 Overview A Quick Example Basic Concepts Linking Initializing StreamingContext D ...
Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南
Spark Streaming 编程指南概述一个入门示例基础概念依赖初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
Spark Streaming编程指南
Overview A Quick Example Basic Concepts Linking Initializing StreamingContext Discretized Streams (D ...
Spark SQL编程指南（Python）
前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询.它的核心是一个特殊类型的Spark RDD:SchemaRDD. SchemaRDD类似于传统关 ...
Spark SQL编程指南（Python）【转】
转自:http://www.cnblogs.com/yurunmiao/p/4685310.html 前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询 ...
Spark官方3 ---------Spark Streaming编程指南（1.5.0）
Design Patterns for using foreachRDD dstream.foreachRDD是一个强大的原语,允许将数据发送到外部系统.然而,了解如何正确有效地使用该原语很重要.避免 ...

随机推荐

C#动态表达式计算
C#动态表达式计算应该有不少人开发过程中遇到过这样的需求,我们直接看图说话: 如上图所示,其中Entity为实体类,其中包括五个属性,该五个属性的值分别来自于数据库查询结果: 用户通过可视化界面进行 ...
[转载]LVS快速搭建教程
LVS配置教程作者:oldjiang 一.前言相信专程来读此文的读者对LVS必然有一定的了解,首先看图: 毋庸置疑,Load Balancer是负载调度器,由它将网络请求无缝隙调度到真实服务器,至 ...
boost------ref的使用(Boost程序库完全开发指南)读书笔记
STL和Boost中的算法和函数大量使用了函数对象作为判断式或谓词参数,而这些参数都是传值语义,算法或函数在内部保修函数对象的拷贝并使用,例如: #include "stdafx.h&quo ...
net破解一(反编译,反混淆-剥壳,工具推荐)
net破解一(反编译,反混淆-剥壳,工具推荐) 大家好,前段时间做数据分析,需要解析对方数据,而数据文件是对方公司内部的生成方式,完全不知道它是怎么生成的. 不过还好能拿到客户端(正好是C#开发)所以 ...
Helper Method
ASP.NET MVC 小牛之路]13 - Helper Method 我们平时编程写一些辅助类的时候习惯用“XxxHelper”来命名.同样,在 MVC 中用于生成 Html 元素的辅助类是 Sys ...
VMware NAT方式 CentOS 6.8配置静态IP
一.打开虚拟机设置,配置网络连接,如下图二.编辑 /etc/sysconfig/network,以配置网关 vim /etc/sysconfig/network NETWORKING=yes HOS ...
如何本地测试例如QQ登录等第三方接口
前言:现在基本是个网站就会集成第三方的一些接口,比如QQ登录.分享等等.但是在开发的时候,尤其是没有这方面经验的开发人员来说,调试流程时会显得迷茫,不知道怎么调试.这里就个人的这方面学习摸索做一个总结 ...
General Structure of Quartz.NET and How To Implement It
General Structure of Quartz.NET and How To Implement It General Structure of Quartz.NET and How To ...
ie8下下拉菜单文字为空
<html> <head> <title></title> <script type="text/javascript"> ...
线性回归，logistic回归和一般回归
1 摘要本报告是在学习斯坦福大学机器学习课程前四节加上配套的讲义后的总结与认识.前四节主要讲述了回归问题,回归属于有监督学习中的一种方法.该方法的核心思想是从连续型统计数据中得到数学模型,然后将该数 ...

Spark Graphx编程指南

Spark Graphx编程指南的更多相关文章

随机推荐

热门专题