python spark kmeans demo
官方的demo
from numpy import array
from math import sqrt from pyspark import SparkContext from pyspark.mllib.clustering import KMeans, KMeansModel sc = SparkContext(appName="clusteringExample")
# Load and parse the data
data = sc.textFile("/root/spark-2.1.1-bin-hadoop2.6/data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE)) # Save and load model
#clusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
#sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
带归一化的例子:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.functions.{col, udf} case class DataRow(label: Double, x1: Double, x2: Double)
val data = sqlContext.createDataFrame(sc.parallelize(Seq(
DataRow(3, 1, 2),
DataRow(5, 3, 4),
DataRow(7, 5, 6),
DataRow(6, 0, 0)
))) val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
val clusters = KMeans.train(parsedData, 3, 20)
val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
val result = data.select(col("label"), t(col("x1"), col("x2"))) The important part are the last two lines. Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1 and x2). Selects the label column along with the UDF applied to the x1 and x2 columns. Since the UDF will predict closestCluster, after this result will be a Dataframe consisting of (label, closestCluster)
参考:https://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering._
val rows = data.rdd.map(r => (r.getDouble(1),r.getDouble(2))).cache()
val vectors = rows.map(r => Vectors.dense(r._1, r._2))
val kMeansModel = KMeans.train(vectors, 3, 20)
val predictions = rows.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._1, r._2)))}
val df = predictions.toDF("id", "cluster")
df.show
Create column from RDD
It's very easy to obtain pairs of ids and clusters in form of RDD:
val idPointRDD = data.rdd.map(s => (s.getInt(0), Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
Then you create DataFrame from that
val idCluster = idClusterRDD.toDF("id", "cluster")
It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.
Use UDF (User Defined Function)
Second method involves using clusters.predict method as UDF:
val bcClusters = sc.broadcast(clusters)
def predict(x: Double, y: Double): Int = {
bcClusters.value.predict(Vectors.dense(x, y))
}
sqlContext.udf.register("predict", predict _)
Now we can use it to add predictions to data:
val idCluster = data.selectExpr("id", "predict(x, y) as cluster")
Keep in mind that Spark API doesn't allow UDF deregistration. This means that closure data will be kept in the memory.
python spark kmeans demo的更多相关文章
- Python实现kMeans(k均值聚类)
Python实现kMeans(k均值聚类) 运行环境 Pyhton3 numpy(科学计算包) matplotlib(画图所需,不画图可不必) 计算过程 st=>start: 开始 e=> ...
- RPi 2B python opencv camera demo example
/************************************************************************************** * RPi 2B pyt ...
- [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子 从如下地址获取文件: https://github.com/databricks/spark-avro/r ...
- [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:
[Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...
- 【Python学习笔记】使用python进行kmeans聚类
使用python进行kmeans聚类 假设我们要解决一个这样的问题. 以下是一些同学,大萌是一个学霸,而我们想要找到这些人中的潜在学霸,所以我们要把这些人分为两类--学霸与非学霸. 高数 英语 Pyt ...
- 数据挖掘-聚类分析(Python实现K-Means算法)
概念: 聚类分析(cluster analysis ):是一组将研究对象分为相对同质的群组(clusters)的统计分析技术.聚类分析也叫分类分析,或者数值分类.聚类的输入是一组未被标记的样本,聚类根 ...
- python spark 随机森林入门demo
class pyspark.mllib.tree.RandomForest[source] Learning algorithm for a random forest model for class ...
- python spark 决策树 入门demo
Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API. ...
- 随机森林算法demo python spark
关键参数 最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth. numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accu ...
随机推荐
- BZOJ 4403 2982 Lucas定理模板
思路: Lucas定理的模板题.. 4403 //By SiriusRen #include <cstdio> using namespace std; ; #define int lon ...
- javascript 将单词首字母大写,其余小写
// 1 别人写的,我拿来参考了一下 function titleCase(str) { var array = str.toLowerCase().split(" "); for ...
- 关于中文期刊LaTeX的CCT相关
最近写完了大论文,回身看了一下CTeX的信息,看了下弄改进版套装的山大和清华的两位大神的博客,发现大神们倒腾过CCT. CCT的FTP还是一直可用的,ftp://ftp.cc.ac.cn/pub/cc ...
- Android学习——利用RecyclerView编写聊天界面
1.待会儿会用到RecyclerView,首先在app/build.gradle(注意有两个build.gradle,选择app下的那个)当中添加依赖库,如下: dependencies { comp ...
- (转) RabbitMQ学习之远程过程调用(RPC)(java)
http://blog.csdn.net/zhu_tianwei/article/details/40887885 在一般使用RabbitMQ做RPC很容易.客户端发送一个请求消息然后服务器回复一个响 ...
- win 运行
1.msconfig - 系统配置 - 服务-全部禁用 2.DXDIAG direct版本
- 应用六:Vue之父子组件间的三种通信方式
(注:本文适用于有一定Vue基础或开发经验的读者,文章就知识点的讲解不一定全面,但却是开发过程中很实用的) 组件是Vue的核心功能之一,也是我们在开发过程中经常要用到的.各个独立的组件之间如何进行数据 ...
- day06-08面向对象的三大特性
一.继承 1.1什么是继承?继承是一种创建新类的方式,在python中,新建的类可以继承一个或多个父类,父类又可称为基类或超类,新建的类称为派生类或子类python中类的继承分为:单继承和多继承 cl ...
- 训练1-N
给出N个整数,对着N个整数进行排序 Input 第1行:整数的数量N(1 <= N <= 50000)第2 - N + 1行:待排序的整数(-10^9 <= Ai <= 10^ ...
- php 安装mysql扩展注意事项
1.yum search php-mysql (Linux环境) 这一点,根据具体的情况会遇到不同的搜索结果.我搜索到的结果是:php-mysql.i386 : A module for PHP ap ...