Spark Graphx编程指南
1.GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图?
2.PageRank算法在图中发挥什么作用?
3.三角形计数算法的作用是什么?

Spark中文手册-编程指南
Spark之一个快速的例子Spark之基本概念
Spark之基本概念
Spark之基本概念(2)
Spark之基本概念(3)
Spark-sql由入门到精通
Spark-sql由入门到精通续
spark GraphX编程指南(1)
Pregel API
- class GraphOps[VD, ED] {
- def pregel[A]
- (initialMsg: A,
- maxIter: Int = Int.MaxValue,
- activeDir: EdgeDirection = EdgeDirection.Out)
- (vprog: (VertexId, VD, A) => VD,
- sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
- mergeMsg: (A, A) => A)
- : Graph[VD, ED] = {
- // Receive the initial message at each vertex
- var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
- // compute the messages
- var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
- var activeMessages = messages.count()
- // Loop until no messages remain or maxIterations is achieved
- var i = 0
- while (activeMessages > 0 && i < maxIterations) {
- // Receive the messages: -----------------------------------------------------------------------
- // Run the vertex program on all vertices that receive messages
- val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
- // Merge the new vertex values back into the graph
- g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
- // Send Messages: ------------------------------------------------------------------------------
- // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
- // get to send messages. More precisely the map phase of mapReduceTriplets is only invoked
- // on edges in the activeDir of vertices in newVerts
- messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
- activeMessages = messages.count()
- i += 1
- }
- g
- }
- }
复制代码
- import org.apache.spark.graphx._
- // Import random graph generation library
- import org.apache.spark.graphx.util.GraphGenerators
- // A graph with edge attributes containing distances
- val graph: Graph[Int, Double] =
- GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
- val sourceId: VertexId = 42 // The ultimate source
- // Initialize the graph such that all vertices except the root have distance infinity.
- val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
- val sssp = initialGraph.pregel(Double.PositiveInfinity)(
- (id, dist, newDist) => math.min(dist, newDist), // Vertex Program
- triplet => { // Send Message
- if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
- Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
- } else {
- Iterator.empty
- }
- },
- (a,b) => math.min(a,b) // Merge Message
- )
- println(sssp.vertices.collect.mkString("\n"))
复制代码
图构造者
- object GraphLoader {
- def edgeListFile(
- sc: SparkContext,
- path: String,
- canonicalOrientation: Boolean = false,
- minEdgePartitions: Int = 1)
- : Graph[Int, Int]
- }
复制代码
- # This is a comment
- 2 1
- 4 1
- 1 2
复制代码
- object Graph {
- def apply[VD, ED](
- vertices: RDD[(VertexId, VD)],
- edges: RDD[Edge[ED]],
- defaultVertexAttr: VD = null)
- : Graph[VD, ED]
- def fromEdges[VD, ED](
- edges: RDD[Edge[ED]],
- defaultValue: VD): Graph[VD, ED]
- def fromEdgeTuples[VD](
- rawEdges: RDD[(VertexId, VertexId)],
- defaultValue: VD,
- uniqueEdges: Option[PartitionStrategy] = None): Graph[VD, Int]
- }
复制代码
顶点和边RDDs
VertexRDDs
- class VertexRDD[VD] extends RDD[(VertexID, VD)] {
- // Filter the vertex set but preserves the internal index
- def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD]
- // Transform the values without changing the ids (preserves the internal index)
- def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
- def mapValues[VD2](map: (VertexId, VD) => VD2): VertexRDD[VD2]
- // Remove vertices from this set that appear in the other set
- def diff(other: VertexRDD[VD]): VertexRDD[VD]
- // Join operators that take advantage of the internal indexing to accelerate joins (substantially)
- def leftJoin[VD2, VD3](other: RDD[(VertexId, VD2)])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
- def innerJoin[U, VD2](other: RDD[(VertexId, U)])(f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
- // Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
- def aggregateUsingIndex[VD2](other: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
- }
复制代码
- val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 100L).map(id => (id, 1)))
- val rddB: RDD[(VertexId, Double)] = sc.parallelize(0L until 100L).flatMap(id => List((id, 1.0), (id, 2.0)))
- // There should be 200 entries in rddB
- rddB.count
- val setB: VertexRDD[Double] = setA.aggregateUsingIndex(rddB, _ + _)
- // There should be 100 entries in setB
- setB.count
- // Joining A and B should now be fast!
- val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)
复制代码
EdgeRDDs
- // Transform the edge attributes while preserving the structure
- def mapValues[ED2](f: Edge[ED] => ED2): EdgeRDD[ED2]
- // Revere the edges reusing both attributes and structure
- def reverse: EdgeRDD[ED]
- // Join two `EdgeRDD`s partitioned using the same partitioning strategy.
- def innerJoin[ED2, ED3](other: EdgeRDD[ED2])(f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]
复制代码
图算法
PageRank算法
- // Load the edges as a graph
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Run PageRank
- val ranks = graph.pageRank(0.0001).vertices
- // Join the ranks with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ranksByUsername = users.join(ranks).map {
- case (id, (username, rank)) => (username, rank)
- }
- // Print the result
- println(ranksByUsername.collect().mkString("\n"))
复制代码
连通体算法
- / Load the graph as in the PageRank example
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Find the connected components
- val cc = graph.connectedComponents().vertices
- // Join the connected components with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ccByUsername = users.join(cc).map {
- case (id, (username, cc)) => (username, cc)
- }
- // Print the result
- println(ccByUsername.collect().mkString("\n"))
复制代码
三角形计数算法
- // Load the edges in canonical order and partition the graph for triangle count
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
- // Find the triangle count for each vertex
- val triCounts = graph.triangleCount().vertices
- // Join the triangle counts with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
- (username, tc)
- }
- // Print the result
- println(triCountByUsername.collect().mkString("\n"))
复制代码
例子
- // Connect to the Spark cluster
- val sc = new SparkContext("spark://master.amplab.org", "research")
- // Load my user data and parse into tuples of user id and attribute list
- val users = (sc.textFile("graphx/data/users.txt")
- .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
- // Parse the edge data which is already in userId -> userId format
- val followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Attach the user attributes
- val graph = followerGraph.outerJoinVertices(users) {
- case (uid, deg, Some(attrList)) => attrList
- // Some users may not have attributes so we set them as empty
- case (uid, deg, None) => Array.empty[String]
- }
- // Restrict the graph to users with usernames and names
- val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)
- // Compute the PageRank
- val pagerankGraph = subgraph.pageRank(0.001)
- // Get the attributes of the top pagerank users
- val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
- case (uid, attrList, Some(pr)) => (pr, attrList.toList)
- case (uid, attrList, None) => (0.0, attrList.toList)
- }
- println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))
复制代码
相关内容:
Spark中文手册1-编程指南
http://www.aboutyun.com/thread-11413-1-1.html
Spark中文手册2:Spark之一个快速的例子
http://www.aboutyun.com/thread-11484-1-1.html
Spark中文手册3:Spark之基本概念
http://www.aboutyun.com/thread-11502-1-1.html
Spark中文手册4:Spark之基本概念(2)
http://www.aboutyun.com/thread-11516-1-1.html
Spark中文手册5:Spark之基本概念(3)
http://www.aboutyun.com/thread-11535-1-1.html
Spark中文手册6:Spark-sql由入门到精通
http://www.aboutyun.com/thread-11562-1-1.html
Spark中文手册7:Spark-sql由入门到精通【续】
http://www.aboutyun.com/thread-11575-1-1.html
Spark中文手册8:spark GraphX编程指南(1)
http://www.aboutyun.com/thread-11589-1-1.html
Spark中文手册10:spark部署:提交应用程序及独立部署模式
http://www.aboutyun.com/thread-11615-1-1.html
Spark中文手册11:Spark 配置指南
http://www.aboutyun.com/thread-10652-1-1.html
Spark Graphx编程指南的更多相关文章
- Spark—GraphX编程指南
Spark系列面试题 Spark面试题(一) Spark面试题(二) Spark面试题(三) Spark面试题(四) Spark面试题(五)--数据倾斜调优 Spark面试题(六)--Spark资源调 ...
- GraphX编程指南
GraphX编程指南 概述 入门 属性图 属性图示例 图算子 算子摘要列表 属性算子 结构化算子 Join算子 最近邻聚集 汇总消息(aggregateMessages) Map Reduce三元 ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- <译>Spark Sreaming 编程指南
Spark Streaming 编程指南 Overview A Quick Example Basic Concepts Linking Initializing StreamingContext D ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- Spark Streaming编程指南
Overview A Quick Example Basic Concepts Linking Initializing StreamingContext Discretized Streams (D ...
- Spark SQL编程指南(Python)
前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询.它的核心是一个特殊类型的Spark RDD:SchemaRDD. SchemaRDD类似于传统关 ...
- Spark SQL编程指南(Python)【转】
转自:http://www.cnblogs.com/yurunmiao/p/4685310.html 前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询 ...
- Spark官方3 ---------Spark Streaming编程指南(1.5.0)
Design Patterns for using foreachRDD dstream.foreachRDD是一个强大的原语,允许将数据发送到外部系统.然而,了解如何正确有效地使用该原语很重要.避免 ...
随机推荐
- SD卡FAT32获得高速的文件格式(图文介绍)
说明: MBR :Master Boot Record ( 主引导记录) DBR :DOS Boot Record ( 引导扇区) FAT :File Allocation Table ( 文件分配表 ...
- SQL Server中如何备份存储过程(SP)和函数(Fun)
考虑到安全因素,我们经常需要对数据库的存储过程(SP)和函数(Fun)进行备份 下面提供了一种简单的方式, 存储过程(SP)SQL代码如下: select p.name as SpName,m.def ...
- Java 多线程编程之九:使用 Executors 和 ThreadPoolExecutor 实现的 Java 线程池的例子
线程池用来管理工作线程的数量,它持有一个等待被执行的线程的队列. java.util.concurrent.Executors 提供了 java.util.concurrent.Exe ...
- Moq 测试 属性,常用方法
RhinoMock入门(7)——Do,With和Record-playback 摘要: (一)Do(delegate)有时候在测试过程中只返回一个静态的值是不够的,在这种情况下,Do()方法可以用来在 ...
- [转]解决MySQL出现大量unauthenticated user的问题
最近发现两台MySQL server在中午的时候忽然(很突然的那种)发飙,不断的挂掉.重启mysql也尽是失败,看mysql的errorlog,只能看到类似如下的信息: Forcing close o ...
- android Fragment 用法小结
Fragment 是android 3.0引入的新API,是作为Activity的子模块,必须嵌入Activity才能使用. Activity 与 Fragment的关系: 一.依附性: 1. Fra ...
- Linux环境进程间通信(五): 共享内存(上)
linux下进程间通信的几种主要手段: 管道(Pipe)及有名管道(named pipe):管道可用于具有亲缘关系进程间的通信,有名管道克服了管道没有名字的限制,因此,除具有管道所具有的功能外,它还允 ...
- AppBox_v3.0
AppBox_v2.0完整版免费下载,暨AppBox_v3.0正式发布! AppBox 是基于 FineUI 的通用权限管理框架,包括用户管理.职称管理.部门管理.角色管理.角色权限管理等模块. Ap ...
- ASP.NET Web安装程序
键发布ASP.NET Web安装程序,搞WebForm的童鞋看过来... 前言:最近公司有个Web要发布,但是以前都是由实施到甲方去发布,配置,这几天有点闲,同事让我搞一个一键发布,就和安装软件那样的 ...
- zabbix实现对磁盘动态监控
zabbix实现对磁盘动态监控 前言 zabbix一直是小规模互联网公司服务器性能监控首选,首先是免费,其次,有专门的公司和社区开发维护,使其稳定性和功能都在不断地增强和完善.zabbix拥有详细的U ...