Spark Graphx编程指南
1.GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图?
2.PageRank算法在图中发挥什么作用?
3.三角形计数算法的作用是什么?

Spark中文手册-编程指南
Spark之一个快速的例子Spark之基本概念
Spark之基本概念
Spark之基本概念(2)
Spark之基本概念(3)
Spark-sql由入门到精通
Spark-sql由入门到精通续
spark GraphX编程指南(1)
Pregel API
- class GraphOps[VD, ED] {
- def pregel[A]
- (initialMsg: A,
- maxIter: Int = Int.MaxValue,
- activeDir: EdgeDirection = EdgeDirection.Out)
- (vprog: (VertexId, VD, A) => VD,
- sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
- mergeMsg: (A, A) => A)
- : Graph[VD, ED] = {
- // Receive the initial message at each vertex
- var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
- // compute the messages
- var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
- var activeMessages = messages.count()
- // Loop until no messages remain or maxIterations is achieved
- var i = 0
- while (activeMessages > 0 && i < maxIterations) {
- // Receive the messages: -----------------------------------------------------------------------
- // Run the vertex program on all vertices that receive messages
- val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
- // Merge the new vertex values back into the graph
- g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
- // Send Messages: ------------------------------------------------------------------------------
- // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
- // get to send messages. More precisely the map phase of mapReduceTriplets is only invoked
- // on edges in the activeDir of vertices in newVerts
- messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
- activeMessages = messages.count()
- i += 1
- }
- g
- }
- }
复制代码
- import org.apache.spark.graphx._
- // Import random graph generation library
- import org.apache.spark.graphx.util.GraphGenerators
- // A graph with edge attributes containing distances
- val graph: Graph[Int, Double] =
- GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
- val sourceId: VertexId = 42 // The ultimate source
- // Initialize the graph such that all vertices except the root have distance infinity.
- val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
- val sssp = initialGraph.pregel(Double.PositiveInfinity)(
- (id, dist, newDist) => math.min(dist, newDist), // Vertex Program
- triplet => { // Send Message
- if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
- Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
- } else {
- Iterator.empty
- }
- },
- (a,b) => math.min(a,b) // Merge Message
- )
- println(sssp.vertices.collect.mkString("\n"))
复制代码
图构造者
- object GraphLoader {
- def edgeListFile(
- sc: SparkContext,
- path: String,
- canonicalOrientation: Boolean = false,
- minEdgePartitions: Int = 1)
- : Graph[Int, Int]
- }
复制代码
- # This is a comment
- 2 1
- 4 1
- 1 2
复制代码
- object Graph {
- def apply[VD, ED](
- vertices: RDD[(VertexId, VD)],
- edges: RDD[Edge[ED]],
- defaultVertexAttr: VD = null)
- : Graph[VD, ED]
- def fromEdges[VD, ED](
- edges: RDD[Edge[ED]],
- defaultValue: VD): Graph[VD, ED]
- def fromEdgeTuples[VD](
- rawEdges: RDD[(VertexId, VertexId)],
- defaultValue: VD,
- uniqueEdges: Option[PartitionStrategy] = None): Graph[VD, Int]
- }
复制代码
顶点和边RDDs
VertexRDDs
- class VertexRDD[VD] extends RDD[(VertexID, VD)] {
- // Filter the vertex set but preserves the internal index
- def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD]
- // Transform the values without changing the ids (preserves the internal index)
- def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
- def mapValues[VD2](map: (VertexId, VD) => VD2): VertexRDD[VD2]
- // Remove vertices from this set that appear in the other set
- def diff(other: VertexRDD[VD]): VertexRDD[VD]
- // Join operators that take advantage of the internal indexing to accelerate joins (substantially)
- def leftJoin[VD2, VD3](other: RDD[(VertexId, VD2)])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
- def innerJoin[U, VD2](other: RDD[(VertexId, U)])(f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
- // Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
- def aggregateUsingIndex[VD2](other: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
- }
复制代码
- val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 100L).map(id => (id, 1)))
- val rddB: RDD[(VertexId, Double)] = sc.parallelize(0L until 100L).flatMap(id => List((id, 1.0), (id, 2.0)))
- // There should be 200 entries in rddB
- rddB.count
- val setB: VertexRDD[Double] = setA.aggregateUsingIndex(rddB, _ + _)
- // There should be 100 entries in setB
- setB.count
- // Joining A and B should now be fast!
- val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)
复制代码
EdgeRDDs
- // Transform the edge attributes while preserving the structure
- def mapValues[ED2](f: Edge[ED] => ED2): EdgeRDD[ED2]
- // Revere the edges reusing both attributes and structure
- def reverse: EdgeRDD[ED]
- // Join two `EdgeRDD`s partitioned using the same partitioning strategy.
- def innerJoin[ED2, ED3](other: EdgeRDD[ED2])(f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]
复制代码
图算法
PageRank算法
- // Load the edges as a graph
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Run PageRank
- val ranks = graph.pageRank(0.0001).vertices
- // Join the ranks with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ranksByUsername = users.join(ranks).map {
- case (id, (username, rank)) => (username, rank)
- }
- // Print the result
- println(ranksByUsername.collect().mkString("\n"))
复制代码
连通体算法
- / Load the graph as in the PageRank example
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Find the connected components
- val cc = graph.connectedComponents().vertices
- // Join the connected components with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ccByUsername = users.join(cc).map {
- case (id, (username, cc)) => (username, cc)
- }
- // Print the result
- println(ccByUsername.collect().mkString("\n"))
复制代码
三角形计数算法
- // Load the edges in canonical order and partition the graph for triangle count
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
- // Find the triangle count for each vertex
- val triCounts = graph.triangleCount().vertices
- // Join the triangle counts with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
- (username, tc)
- }
- // Print the result
- println(triCountByUsername.collect().mkString("\n"))
复制代码
例子
- // Connect to the Spark cluster
- val sc = new SparkContext("spark://master.amplab.org", "research")
- // Load my user data and parse into tuples of user id and attribute list
- val users = (sc.textFile("graphx/data/users.txt")
- .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
- // Parse the edge data which is already in userId -> userId format
- val followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Attach the user attributes
- val graph = followerGraph.outerJoinVertices(users) {
- case (uid, deg, Some(attrList)) => attrList
- // Some users may not have attributes so we set them as empty
- case (uid, deg, None) => Array.empty[String]
- }
- // Restrict the graph to users with usernames and names
- val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)
- // Compute the PageRank
- val pagerankGraph = subgraph.pageRank(0.001)
- // Get the attributes of the top pagerank users
- val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
- case (uid, attrList, Some(pr)) => (pr, attrList.toList)
- case (uid, attrList, None) => (0.0, attrList.toList)
- }
- println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))
复制代码
相关内容:
Spark中文手册1-编程指南
http://www.aboutyun.com/thread-11413-1-1.html
Spark中文手册2:Spark之一个快速的例子
http://www.aboutyun.com/thread-11484-1-1.html
Spark中文手册3:Spark之基本概念
http://www.aboutyun.com/thread-11502-1-1.html
Spark中文手册4:Spark之基本概念(2)
http://www.aboutyun.com/thread-11516-1-1.html
Spark中文手册5:Spark之基本概念(3)
http://www.aboutyun.com/thread-11535-1-1.html
Spark中文手册6:Spark-sql由入门到精通
http://www.aboutyun.com/thread-11562-1-1.html
Spark中文手册7:Spark-sql由入门到精通【续】
http://www.aboutyun.com/thread-11575-1-1.html
Spark中文手册8:spark GraphX编程指南(1)
http://www.aboutyun.com/thread-11589-1-1.html
Spark中文手册10:spark部署:提交应用程序及独立部署模式
http://www.aboutyun.com/thread-11615-1-1.html
Spark中文手册11:Spark 配置指南
http://www.aboutyun.com/thread-10652-1-1.html
Spark Graphx编程指南的更多相关文章
- Spark—GraphX编程指南
Spark系列面试题 Spark面试题(一) Spark面试题(二) Spark面试题(三) Spark面试题(四) Spark面试题(五)--数据倾斜调优 Spark面试题(六)--Spark资源调 ...
- GraphX编程指南
GraphX编程指南 概述 入门 属性图 属性图示例 图算子 算子摘要列表 属性算子 结构化算子 Join算子 最近邻聚集 汇总消息(aggregateMessages) Map Reduce三元 ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- <译>Spark Sreaming 编程指南
Spark Streaming 编程指南 Overview A Quick Example Basic Concepts Linking Initializing StreamingContext D ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- Spark Streaming编程指南
Overview A Quick Example Basic Concepts Linking Initializing StreamingContext Discretized Streams (D ...
- Spark SQL编程指南(Python)
前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询.它的核心是一个特殊类型的Spark RDD:SchemaRDD. SchemaRDD类似于传统关 ...
- Spark SQL编程指南(Python)【转】
转自:http://www.cnblogs.com/yurunmiao/p/4685310.html 前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询 ...
- Spark官方3 ---------Spark Streaming编程指南(1.5.0)
Design Patterns for using foreachRDD dstream.foreachRDD是一个强大的原语,允许将数据发送到外部系统.然而,了解如何正确有效地使用该原语很重要.避免 ...
随机推荐
- C#动态表达式计算
C#动态表达式计算 应该有不少人开发过程中遇到过这样的需求,我们直接看图说话: 如上图所示,其中Entity为实体类,其中包括五个属性,该五个属性的值分别来自于数据库查询结果: 用户通过可视化界面进行 ...
- [转载]LVS快速搭建教程
LVS配置教程 作者:oldjiang 一.前言 相信专程来读此文的读者对LVS必然有一定的了解,首先看图: 毋庸置疑,Load Balancer是负载调度器,由它将网络请求无缝隙调度到真实服务器,至 ...
- boost------ref的使用(Boost程序库完全开发指南)读书笔记
STL和Boost中的算法和函数大量使用了函数对象作为判断式或谓词参数,而这些参数都是传值语义,算法或函数在内部保修函数对象的拷贝并使用,例如: #include "stdafx.h&quo ...
- net破解一(反编译,反混淆-剥壳,工具推荐)
net破解一(反编译,反混淆-剥壳,工具推荐) 大家好,前段时间做数据分析,需要解析对方数据,而数据文件是对方公司内部的生成方式,完全不知道它是怎么生成的. 不过还好能拿到客户端(正好是C#开发)所以 ...
- Helper Method
ASP.NET MVC 小牛之路]13 - Helper Method 我们平时编程写一些辅助类的时候习惯用“XxxHelper”来命名.同样,在 MVC 中用于生成 Html 元素的辅助类是 Sys ...
- VMware NAT方式 CentOS 6.8配置静态IP
一.打开虚拟机设置,配置网络连接,如下图 二.编辑 /etc/sysconfig/network,以配置网关 vim /etc/sysconfig/network NETWORKING=yes HOS ...
- 如何本地测试例如QQ登录等第三方接口
前言:现在基本是个网站就会集成第三方的一些接口,比如QQ登录.分享等等.但是在开发的时候,尤其是没有这方面经验的开发人员来说,调试流程时会显得迷茫,不知道怎么调试.这里就个人的这方面学习摸索做一个总结 ...
- General Structure of Quartz.NET and How To Implement It
General Structure of Quartz.NET and How To Implement It General Structure of Quartz.NET and How To ...
- ie8下下拉菜单文字为空
<html> <head> <title></title> <script type="text/javascript"> ...
- 线性回归,logistic回归和一般回归
1 摘要 本报告是在学习斯坦福大学机器学习课程前四节加上配套的讲义后的总结与认识.前四节主要讲述了回归问题,回归属于有监督学习中的一种方法.该方法的核心思想是从连续型统计数据中得到数学模型,然后将该数 ...