官方的demo

from numpy import array
from math import sqrt from pyspark import SparkContext from pyspark.mllib.clustering import KMeans, KMeansModel sc = SparkContext(appName="clusteringExample")
# Load and parse the data
data = sc.textFile("/root/spark-2.1.1-bin-hadoop2.6/data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE)) # Save and load model
#clusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
#sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")

带归一化的例子:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.functions.{col, udf} case class DataRow(label: Double, x1: Double, x2: Double)
val data = sqlContext.createDataFrame(sc.parallelize(Seq(
DataRow(3, 1, 2),
DataRow(5, 3, 4),
DataRow(7, 5, 6),
DataRow(6, 0, 0)
))) val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
val clusters = KMeans.train(parsedData, 3, 20)
val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
val result = data.select(col("label"), t(col("x1"), col("x2"))) The important part are the last two lines. Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1 and x2). Selects the label column along with the UDF applied to the x1 and x2 columns. Since the UDF will predict closestCluster, after this result will be a Dataframe consisting of (label, closestCluster)

参考:https://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering._ val rows = data.rdd.map(r => (r.getDouble(1),r.getDouble(2))).cache()
val vectors = rows.map(r => Vectors.dense(r._1, r._2))
val kMeansModel = KMeans.train(vectors, 3, 20)
val predictions = rows.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._1, r._2)))}
val df = predictions.toDF("id", "cluster")
df.show

Create column from RDD

It's very easy to obtain pairs of ids and clusters in form of RDD:

val idPointRDD = data.rdd.map(s => (s.getInt(0), Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)

Then you create DataFrame from that

val idCluster = idClusterRDD.toDF("id", "cluster")

It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.

Use UDF (User Defined Function)

Second method involves using clusters.predict method as UDF:

val bcClusters = sc.broadcast(clusters)
def predict(x: Double, y: Double): Int = {
bcClusters.value.predict(Vectors.dense(x, y))
}
sqlContext.udf.register("predict", predict _)

Now we can use it to add predictions to data:

val idCluster = data.selectExpr("id", "predict(x, y) as cluster")

Keep in mind that Spark API doesn't allow UDF deregistration. This means that closure data will be kept in the memory.

python spark kmeans demo的更多相关文章

  1. Python实现kMeans(k均值聚类)

    Python实现kMeans(k均值聚类) 运行环境 Pyhton3 numpy(科学计算包) matplotlib(画图所需,不画图可不必) 计算过程 st=>start: 开始 e=> ...

  2. RPi 2B python opencv camera demo example

    /************************************************************************************** * RPi 2B pyt ...

  3. [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

    [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子 从如下地址获取文件: https://github.com/databricks/spark-avro/r ...

  4. [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:

    [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...

  5. 【Python学习笔记】使用python进行kmeans聚类

    使用python进行kmeans聚类 假设我们要解决一个这样的问题. 以下是一些同学,大萌是一个学霸,而我们想要找到这些人中的潜在学霸,所以我们要把这些人分为两类--学霸与非学霸. 高数 英语 Pyt ...

  6. 数据挖掘-聚类分析(Python实现K-Means算法)

    概念: 聚类分析(cluster analysis ):是一组将研究对象分为相对同质的群组(clusters)的统计分析技术.聚类分析也叫分类分析,或者数值分类.聚类的输入是一组未被标记的样本,聚类根 ...

  7. python spark 随机森林入门demo

    class pyspark.mllib.tree.RandomForest[source] Learning algorithm for a random forest model for class ...

  8. python spark 决策树 入门demo

    Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API. ...

  9. 随机森林算法demo python spark

    关键参数 最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth. numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accu ...

随机推荐

  1. Hadoop MapReduce编程 API入门系列之网页排序(二十八)

    不多说,直接上代码. Map output bytes=247 Map output materialized bytes=275 Input split bytes=139 Combine inpu ...

  2. IEnumerable ICollection IList

  3. sublime3 install python3

    链接地址:https://blog.csdn.net/Ti__iT/article/details/78830040

  4. SQLServer inner join,left join,right join,outer join 备忘备忘

    LEFT JOIN LEFT JOIN 关键字会从左表那里返回所有的行,即使在右表中没有匹配的行. 即LEFT JOIN 的 ON 条件不会对数据行造成影响 RIGHT JOIN RIGHT JOIN ...

  5. 第4章 部署模式 Three-Tiered Distribution(三级分布)

    影响因素 Tiered Distribution 中讨论的影响因素也适用于此模式.有关这些通用影响因素的讨论,请参阅"Tiered Distribution".下列影响因素仅适用于 ...

  6. 基于HTML5陀螺仪实现ofo首页眼睛移动效果

    最近用ofo小黄车App的时候,发现以前下方扫一扫变成了一个眼睛动的小黄人,觉得蛮有意思的,这里用HTML5仿一下效果. ofo眼睛效果 效果分析 从效果中不难看出,是使用陀螺仪事件实现的. 这里先来 ...

  7. [翻译]开源PostgreSQL监控工具OPM

    一个好消息:九月,PostgreSQL OPM开发小组发布了开源的PostgreSQL数据库监控套件的第一个RELEASE版本OPM v2.3.PostgreSQL是先进的高级数据库,但它的一个重要的 ...

  8. trigger事件就是继承某一个类的事件.

    <html><head><script type="text/javascript" src="/jquery/jquery.js" ...

  9. 路飞学城Python-Day142

    第2节:UA身份伪装 反爬机制 User-Agent:请求载体的身份标识 通过不同的手段的当前的请求载体是不一样的,请求信息也是不一样的,常见的请求信息都是以键和值的形式存在 浏览器的开发者工具 Ne ...

  10. 漫谈 Google 的 Native Client(NaCl) 技术(二)---- 技术篇(兼谈 LLVM)

    转自:http://hzx5.blog.163.com/blog/static/40744388201172531637729/ 漫谈 Google 的 Native Client(NaCl) 技术( ...