官方的demo

from numpy import array
from math import sqrt from pyspark import SparkContext from pyspark.mllib.clustering import KMeans, KMeansModel sc = SparkContext(appName="clusteringExample")
# Load and parse the data
data = sc.textFile("/root/spark-2.1.1-bin-hadoop2.6/data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE)) # Save and load model
#clusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
#sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")

带归一化的例子:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.functions.{col, udf} case class DataRow(label: Double, x1: Double, x2: Double)
val data = sqlContext.createDataFrame(sc.parallelize(Seq(
DataRow(3, 1, 2),
DataRow(5, 3, 4),
DataRow(7, 5, 6),
DataRow(6, 0, 0)
))) val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
val clusters = KMeans.train(parsedData, 3, 20)
val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
val result = data.select(col("label"), t(col("x1"), col("x2"))) The important part are the last two lines. Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1 and x2). Selects the label column along with the UDF applied to the x1 and x2 columns. Since the UDF will predict closestCluster, after this result will be a Dataframe consisting of (label, closestCluster)

参考:https://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering._ val rows = data.rdd.map(r => (r.getDouble(1),r.getDouble(2))).cache()
val vectors = rows.map(r => Vectors.dense(r._1, r._2))
val kMeansModel = KMeans.train(vectors, 3, 20)
val predictions = rows.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._1, r._2)))}
val df = predictions.toDF("id", "cluster")
df.show

Create column from RDD

It's very easy to obtain pairs of ids and clusters in form of RDD:

val idPointRDD = data.rdd.map(s => (s.getInt(0), Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)

Then you create DataFrame from that

val idCluster = idClusterRDD.toDF("id", "cluster")

It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.

Use UDF (User Defined Function)

Second method involves using clusters.predict method as UDF:

val bcClusters = sc.broadcast(clusters)
def predict(x: Double, y: Double): Int = {
bcClusters.value.predict(Vectors.dense(x, y))
}
sqlContext.udf.register("predict", predict _)

Now we can use it to add predictions to data:

val idCluster = data.selectExpr("id", "predict(x, y) as cluster")

Keep in mind that Spark API doesn't allow UDF deregistration. This means that closure data will be kept in the memory.

python spark kmeans demo的更多相关文章

  1. Python实现kMeans(k均值聚类)

    Python实现kMeans(k均值聚类) 运行环境 Pyhton3 numpy(科学计算包) matplotlib(画图所需,不画图可不必) 计算过程 st=>start: 开始 e=> ...

  2. RPi 2B python opencv camera demo example

    /************************************************************************************** * RPi 2B pyt ...

  3. [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

    [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子 从如下地址获取文件: https://github.com/databricks/spark-avro/r ...

  4. [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:

    [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...

  5. 【Python学习笔记】使用python进行kmeans聚类

    使用python进行kmeans聚类 假设我们要解决一个这样的问题. 以下是一些同学,大萌是一个学霸,而我们想要找到这些人中的潜在学霸,所以我们要把这些人分为两类--学霸与非学霸. 高数 英语 Pyt ...

  6. 数据挖掘-聚类分析(Python实现K-Means算法)

    概念: 聚类分析(cluster analysis ):是一组将研究对象分为相对同质的群组(clusters)的统计分析技术.聚类分析也叫分类分析,或者数值分类.聚类的输入是一组未被标记的样本,聚类根 ...

  7. python spark 随机森林入门demo

    class pyspark.mllib.tree.RandomForest[source] Learning algorithm for a random forest model for class ...

  8. python spark 决策树 入门demo

    Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API. ...

  9. 随机森林算法demo python spark

    关键参数 最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth. numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accu ...

随机推荐

  1. 运维自动化-Ansible

    前言 天天说运维,究竟是干什么的?先看看工作流程呗.一般来说,运维工程师在一家企业里属于个位数的岗位,甚至只有一个.面对生产中NNN台服务器,NN个人员,工作量也是非常大的.所以嘛,图中的我好歹也会配 ...

  2. 状态模式(state)C++实现

    状态模式 当一个对象的内在状态改变时允许改变其行为,这个对象看起来像是改变了其类. 状态模式主要解决的是当控制一个对象状态转换的条件表达式过于复杂时的情况.把状态的判断逻辑转移到表示不同状态的一系列类 ...

  3. SurfaceView加载长图

    1:SurfaceView加载长图,移到.可以充当背景 效果截图 2:View (淡入淡出动画没实现:记录下) package com.guoxw.surfaceviewimage; import a ...

  4. 在APP开发设计过程中:如何设计启动页面?

    心理学上有一个“7秒理论”,说的是,一个人对另一个人的印象,在初次见面的七秒内就会形成,最近更有研究表明,这个时间可能更短——不到1秒.所以初次见面所展示的形象真的很重要.同理,用户在使用APP时,每 ...

  5. 数据仓库模型建设基础及kimball建模方法总结

    观察数据的角度称之为维.决策数据市多为数据,多维数据分析是决策分析的组要内容. OLAP是在OLTP的基础上发展起来的,OLTP是以数据库为基础的,面对的是操作人员和底层管理人员,对基本数据进行查询和 ...

  6. Asp.net Core 源码-PagedList<T>

    using System.Collections.Generic; using System.Linq; using System; using System.Linq.Expressions; us ...

  7. docker批量删除容器、镜像

    1.删除所有容器 docker rm `docker ps -a -q` docker rm $(docker ps -aq) 2.删除所有镜像 docker rmi `docker images - ...

  8. JDBC连接MySQL数据库(一)——数据库的基本连接

    JDBC的概念在使用之前我们先了解一下JDBC的概念, JDBC的全称是数据库连接(Java Database Connectivity),它是一套用于执行SQL语句时的API,应用程序可以通过这套A ...

  9. USACO 2008 Mar Silver 3.River Crossing 动态规划水题

    Code: #include<cstring> #include<algorithm> #include<cstdio> using namespace std; ...

  10. mysql数据库增量恢复

    mysqldump -uroot -p -B discuzx -F -x --master-data=2 --events|gzip >/root/discuzx.sql.gz 写入数据 删除数 ...