Spark Graphx编程指南
1.GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图?
2.PageRank算法在图中发挥什么作用?
3.三角形计数算法的作用是什么?

Spark中文手册-编程指南
Spark之一个快速的例子Spark之基本概念
Spark之基本概念
Spark之基本概念(2)
Spark之基本概念(3)
Spark-sql由入门到精通
Spark-sql由入门到精通续
spark GraphX编程指南(1)
Pregel API
- class GraphOps[VD, ED] {
- def pregel[A]
- (initialMsg: A,
- maxIter: Int = Int.MaxValue,
- activeDir: EdgeDirection = EdgeDirection.Out)
- (vprog: (VertexId, VD, A) => VD,
- sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
- mergeMsg: (A, A) => A)
- : Graph[VD, ED] = {
- // Receive the initial message at each vertex
- var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
- // compute the messages
- var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
- var activeMessages = messages.count()
- // Loop until no messages remain or maxIterations is achieved
- var i = 0
- while (activeMessages > 0 && i < maxIterations) {
- // Receive the messages: -----------------------------------------------------------------------
- // Run the vertex program on all vertices that receive messages
- val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
- // Merge the new vertex values back into the graph
- g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
- // Send Messages: ------------------------------------------------------------------------------
- // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
- // get to send messages. More precisely the map phase of mapReduceTriplets is only invoked
- // on edges in the activeDir of vertices in newVerts
- messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
- activeMessages = messages.count()
- i += 1
- }
- g
- }
- }
复制代码
- import org.apache.spark.graphx._
- // Import random graph generation library
- import org.apache.spark.graphx.util.GraphGenerators
- // A graph with edge attributes containing distances
- val graph: Graph[Int, Double] =
- GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
- val sourceId: VertexId = 42 // The ultimate source
- // Initialize the graph such that all vertices except the root have distance infinity.
- val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
- val sssp = initialGraph.pregel(Double.PositiveInfinity)(
- (id, dist, newDist) => math.min(dist, newDist), // Vertex Program
- triplet => { // Send Message
- if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
- Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
- } else {
- Iterator.empty
- }
- },
- (a,b) => math.min(a,b) // Merge Message
- )
- println(sssp.vertices.collect.mkString("\n"))
复制代码
图构造者
- object GraphLoader {
- def edgeListFile(
- sc: SparkContext,
- path: String,
- canonicalOrientation: Boolean = false,
- minEdgePartitions: Int = 1)
- : Graph[Int, Int]
- }
复制代码
- # This is a comment
- 2 1
- 4 1
- 1 2
复制代码
- object Graph {
- def apply[VD, ED](
- vertices: RDD[(VertexId, VD)],
- edges: RDD[Edge[ED]],
- defaultVertexAttr: VD = null)
- : Graph[VD, ED]
- def fromEdges[VD, ED](
- edges: RDD[Edge[ED]],
- defaultValue: VD): Graph[VD, ED]
- def fromEdgeTuples[VD](
- rawEdges: RDD[(VertexId, VertexId)],
- defaultValue: VD,
- uniqueEdges: Option[PartitionStrategy] = None): Graph[VD, Int]
- }
复制代码
顶点和边RDDs
VertexRDDs
- class VertexRDD[VD] extends RDD[(VertexID, VD)] {
- // Filter the vertex set but preserves the internal index
- def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD]
- // Transform the values without changing the ids (preserves the internal index)
- def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
- def mapValues[VD2](map: (VertexId, VD) => VD2): VertexRDD[VD2]
- // Remove vertices from this set that appear in the other set
- def diff(other: VertexRDD[VD]): VertexRDD[VD]
- // Join operators that take advantage of the internal indexing to accelerate joins (substantially)
- def leftJoin[VD2, VD3](other: RDD[(VertexId, VD2)])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
- def innerJoin[U, VD2](other: RDD[(VertexId, U)])(f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
- // Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
- def aggregateUsingIndex[VD2](other: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
- }
复制代码
- val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 100L).map(id => (id, 1)))
- val rddB: RDD[(VertexId, Double)] = sc.parallelize(0L until 100L).flatMap(id => List((id, 1.0), (id, 2.0)))
- // There should be 200 entries in rddB
- rddB.count
- val setB: VertexRDD[Double] = setA.aggregateUsingIndex(rddB, _ + _)
- // There should be 100 entries in setB
- setB.count
- // Joining A and B should now be fast!
- val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)
复制代码
EdgeRDDs
- // Transform the edge attributes while preserving the structure
- def mapValues[ED2](f: Edge[ED] => ED2): EdgeRDD[ED2]
- // Revere the edges reusing both attributes and structure
- def reverse: EdgeRDD[ED]
- // Join two `EdgeRDD`s partitioned using the same partitioning strategy.
- def innerJoin[ED2, ED3](other: EdgeRDD[ED2])(f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]
复制代码
图算法
PageRank算法
- // Load the edges as a graph
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Run PageRank
- val ranks = graph.pageRank(0.0001).vertices
- // Join the ranks with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ranksByUsername = users.join(ranks).map {
- case (id, (username, rank)) => (username, rank)
- }
- // Print the result
- println(ranksByUsername.collect().mkString("\n"))
复制代码
连通体算法
- / Load the graph as in the PageRank example
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Find the connected components
- val cc = graph.connectedComponents().vertices
- // Join the connected components with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ccByUsername = users.join(cc).map {
- case (id, (username, cc)) => (username, cc)
- }
- // Print the result
- println(ccByUsername.collect().mkString("\n"))
复制代码
三角形计数算法
- // Load the edges in canonical order and partition the graph for triangle count
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
- // Find the triangle count for each vertex
- val triCounts = graph.triangleCount().vertices
- // Join the triangle counts with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
- (username, tc)
- }
- // Print the result
- println(triCountByUsername.collect().mkString("\n"))
复制代码
例子
- // Connect to the Spark cluster
- val sc = new SparkContext("spark://master.amplab.org", "research")
- // Load my user data and parse into tuples of user id and attribute list
- val users = (sc.textFile("graphx/data/users.txt")
- .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
- // Parse the edge data which is already in userId -> userId format
- val followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Attach the user attributes
- val graph = followerGraph.outerJoinVertices(users) {
- case (uid, deg, Some(attrList)) => attrList
- // Some users may not have attributes so we set them as empty
- case (uid, deg, None) => Array.empty[String]
- }
- // Restrict the graph to users with usernames and names
- val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)
- // Compute the PageRank
- val pagerankGraph = subgraph.pageRank(0.001)
- // Get the attributes of the top pagerank users
- val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
- case (uid, attrList, Some(pr)) => (pr, attrList.toList)
- case (uid, attrList, None) => (0.0, attrList.toList)
- }
- println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))
复制代码
相关内容:
Spark中文手册1-编程指南
http://www.aboutyun.com/thread-11413-1-1.html
Spark中文手册2:Spark之一个快速的例子
http://www.aboutyun.com/thread-11484-1-1.html
Spark中文手册3:Spark之基本概念
http://www.aboutyun.com/thread-11502-1-1.html
Spark中文手册4:Spark之基本概念(2)
http://www.aboutyun.com/thread-11516-1-1.html
Spark中文手册5:Spark之基本概念(3)
http://www.aboutyun.com/thread-11535-1-1.html
Spark中文手册6:Spark-sql由入门到精通
http://www.aboutyun.com/thread-11562-1-1.html
Spark中文手册7:Spark-sql由入门到精通【续】
http://www.aboutyun.com/thread-11575-1-1.html
Spark中文手册8:spark GraphX编程指南(1)
http://www.aboutyun.com/thread-11589-1-1.html
Spark中文手册10:spark部署:提交应用程序及独立部署模式
http://www.aboutyun.com/thread-11615-1-1.html
Spark中文手册11:Spark 配置指南
http://www.aboutyun.com/thread-10652-1-1.html
Spark Graphx编程指南的更多相关文章
- Spark—GraphX编程指南
Spark系列面试题 Spark面试题(一) Spark面试题(二) Spark面试题(三) Spark面试题(四) Spark面试题(五)--数据倾斜调优 Spark面试题(六)--Spark资源调 ...
- GraphX编程指南
GraphX编程指南 概述 入门 属性图 属性图示例 图算子 算子摘要列表 属性算子 结构化算子 Join算子 最近邻聚集 汇总消息(aggregateMessages) Map Reduce三元 ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- <译>Spark Sreaming 编程指南
Spark Streaming 编程指南 Overview A Quick Example Basic Concepts Linking Initializing StreamingContext D ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- Spark Streaming编程指南
Overview A Quick Example Basic Concepts Linking Initializing StreamingContext Discretized Streams (D ...
- Spark SQL编程指南(Python)
前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询.它的核心是一个特殊类型的Spark RDD:SchemaRDD. SchemaRDD类似于传统关 ...
- Spark SQL编程指南(Python)【转】
转自:http://www.cnblogs.com/yurunmiao/p/4685310.html 前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询 ...
- Spark官方3 ---------Spark Streaming编程指南(1.5.0)
Design Patterns for using foreachRDD dstream.foreachRDD是一个强大的原语,允许将数据发送到外部系统.然而,了解如何正确有效地使用该原语很重要.避免 ...
随机推荐
- 功能和形式的反思sql声明 一个
日前必须使用sql语句来查询数据库 但每次你不想写一个数据库中读取所以查了下反射 我想用反映一个实体的所有属性,然后,基于属性的查询和分配值 首先,须要一个实体类才干反射出数据库相应的字段, 可是開始 ...
- 【转】Android 图层引导帮助界面制作
2012-11-02 10:31 1979人阅读 评论(0) 收藏 举报 原文:http://www.cnblogs.com/beenupper/archive/2012/07/18/2597504. ...
- javascript 类型检测
javascript数据类型分为简单数据类型和复杂数据类型.简单数据类型分为string,number,boolean,defined,null,复杂数据类型为Object.类型检测在写代码可能会非常 ...
- ios中mvc的FormsAuthentication.SetAuthCookie(cookieUserName, false)失败
如果楼主使用.net开发,要注意FormsAuthentication.SetAuthCookie 方法的使用会导致ios出现该问题.因为这个方法在ios设备上是把票据加入到url中,导致url和你的 ...
- 为ASP.NET MVC应用程序读取相关数据
为ASP.NET MVC应用程序读取相关数据 2014-05-08 18:24 by Bce, 299 阅读, 0 评论, 收藏, 编辑 这是微软官方教程Getting Started with En ...
- Windows Forms框架编程
<Windows Forms框架编程>节选 第九章 设计模式与原则 软件设计模式(Design pattern)是一套被反复使用的代码设计经验总结.使用设计模式是为了可重用代码.让代码 ...
- SharpDevelop插件开发手册
SharpDevelop插件开发手册部分内容摘取自:http://www.cnblogs.com/CBuilder的SharpDevelop开发教程 SharpDevelop插件开发手册 第一章 ...
- C#使用文件监控对象FileSystemWatcher 实现数据同步
在C#使用文件监控对象FileSystemWatcher 实现数据同步 2013-12-12 18:24 by 幕三少, 352 阅读, 3 评论, 收藏, 编辑 最近在项目中有这么个需求,就是得去实 ...
- 解决Xcode升级7.0后,部分.a静态库在iOS9.0的模拟器上,link失败的问题
简单描述一下这个问题:我们项目中使用了Google大神开发的LevelDB键值对数据库,在Xcode6,iOS8的环境下,编译好的.a静态库是可以正常使用的.但是升级后,发现在模拟器上无法link成功 ...
- 【ios开发】ios开发问题集锦
1. ARC forbids explicit message send of'release' 'release' is unavailable: not available inautomatic ...