scala实现kmeans算法

算法的概念不做过都解释，google一下一大把。直接贴上代码，有比较详细的注释了。

主程序：

 import scala.io.Source

 import scala.util.Random

 /**

  * @author vincent

  *

  */

 object LocalKMeans {

     def main(args: Array[String]) {

         val fileName = "/home/vincent/kmeans_data.txt"

         val knumbers = 3

         val rand = new Random()

         //  读取文本数据

         val lines = Source.fromFile(fileName).getLines.toArray

         val points = lines.map(line => {

             val parts = line.split("\t").map(_.toDouble)

             new Point(parts(0), parts(1))

         }).toArray

         //  随机初始化k个质心

         val centroids = new Array[Point](knumbers)

         for (i <- 0 until knumbers) {

             centroids(i) = points(new Random().nextInt(points.length))

         }

         val startTime = System.currentTimeMillis()

         println("initialize centroids:\n" + centroids.mkString("\n") + "\n")

         println("test points: \n" + points.mkString("\n") + "\n")

         val resultCentroids = kmeans(points, centroids, 0.001)

         val endTime = System.currentTimeMillis()

         val runTime = endTime - startTime

         println("run Time: " + runTime + "\nFinal centroids: \n" + resultCentroids.mkString("\n"))

     }

     //  算法的核心函数

     def kmeans(points: Seq[Point], centroids: Seq[Point], epsilon: Double): Seq[Point] = {

         //  最近质心为key值，将数据集分簇

         val clusters = points.groupBy(closestCentroid(centroids, _))

         println("clusters: \n" + clusters.mkString("\n") + "\n")

         //  分别计算簇中数据集的平均数，得到每个簇的新质心

         val newCentroids = centroids.map(oldCentroid => {

             clusters.get(oldCentroid) match {

                 case Some(pointsInCluster) => pointsInCluster.reduceLeft(_ + _) / pointsInCluster.length

                 case None => oldCentroid

             }

         })

         //  计算新质心相对与旧质心的偏移量

         val movement = (centroids zip newCentroids).map({ case (a, b) => a distance b })

         println("Centroids changed by\n" + movement.map(d => "%3f".format(d)).mkString("(", ", ", ")")

             + "\nto\n" + newCentroids.mkString(", ") + "\n")

         //  根据偏移值大小决定是否继续迭代，epsilon为最小偏移值

         if (movement.exists(_ > epsilon))

             kmeans(points, newCentroids, epsilon)

         else

             return newCentroids

     }

     //  计算最近质心

     def closestCentroid(centroids: Seq[Point], point: Point) = {

         centroids.reduceLeft((a, b) => if ((point distance a) < (point distance b)) a else b)

     }

 }

自定义Point类：

 /**

  * @author vincent

  *

  */

 object Point {

     def random() = {

         new Point(math.random * 50, math.random * 50)

     }

 }

 case class Point(val x: Double, val y: Double) {

     def +(that: Point) = new Point(this.x + that.x, this.y + that.y)

     def -(that: Point) = new Point(this.x - that.x, this.y - that.y)

     def /(d: Double) = new Point(this.x / d, this.y / d)

     def pointLength = math.sqrt(x * x + y * y)

     def distance(that: Point) = (this - that).pointLength

     override def toString = format("(%.3f, %.3f)", x, y)

 }

测试数据集：

12.044996    36.412378

31.881257    33.677009

41.703139    46.170517

43.244406    6.991669

19.319000    27.926669

3.556824    40.935215

29.328655    33.303675

43.702858    22.305344

28.978940    28.905725

10.426760    40.311507

scala实现kmeans算法的更多相关文章

[数据挖掘] - 聚类算法：K-means算法理解及SparkCore实现
聚类算法是机器学习中的一大重要算法,也是我们掌握机器学习的必须算法,下面对聚类算法中的K-means算法做一个简单的描述: 一.概述 K-means算法属于聚类算法中的直接聚类算法.给定一个对象(或记 ...
scala wordcount kmeans
scala wordcount kmeans k-means算法的输入对象是d维向量空间的一些点,对一个d维向量的点集进行聚类. k-means聚类算法会将集合D划分成k个聚簇.
Alink漫谈(一) : 从KMeans算法实现不同看Alink设计思想
Alink漫谈(一) : 从KMeans算法实现不同看Alink设计思想目录 Alink漫谈(一) : 从KMeans算法实现不同看Alink设计思想 0x00 摘要 0x01 Flink 是什么 ...
kmeans算法并行化的mpi程序
用c语言写了kmeans算法的串行程序,再用mpi来写并行版的,貌似参照着串行版来写并行版,效果不是很赏心悦目~ 并行化思路: 使用主从模式.由一个节点充当主节点负责数据的划分与分配,其他节点完成本地 ...
【原创】数据挖掘案例——ReliefF和K-means算法的医学应用
数据挖掘方法的提出,让人们有能力最终认识数据的真正价值,即蕴藏在数据中的信息和知识.数据挖掘 (DataMiriing),指的是从大型数据库或数据仓库中提取人们感兴趣的知识,这些知识是隐含的.事先未知 ...
kmeans算法c语言实现，能对不同维度的数据进行聚类
最近在苦于思考kmeans算法的MPI并行化,花了两天的时间把该算法看懂和实现了串行版. 聚类问题就是给定一个元素集合V,其中每个元素具有d个可观察属性,使用某种算法将V划分成k个子集,要求每个子集内 ...
kmeans算法实践
这几天学习了无监督学习聚类算法Kmeans,这是聚类中非常简单的一个算法,它的算法思想与监督学习算法KNN(K近邻算法)的理论基础一样都是利用了节点之间的距离度量,不同之处在于KNN是利用了有标签的数 ...
二分K-means算法
二分K-means聚类(bisecting K-means) 算法优缺点: 由于这个是K-means的改进算法,所以优缺点与之相同. 算法思想: 1.要了解这个首先应该了解K-means算法,可以看这 ...
视觉机器学习------K-means算法
K-means(K均值)是基于数据划分的无监督聚类算法. 一.基本原理聚类算法可以理解为无监督的分类方法,即样本集预先不知所属类别或标签,需要根据样本之间的距离或相似程度自动进行分类.聚 ...

随机推荐

C Primer Plus(第五版)4
第四章字符串和格式化输入输出 4.2 字符串简介字符串(character string)就是一个或多个字符的序列.下面是一个字符串的例子: “Zing went the strings of m ...
SQL 获取各表记录数的最快方法
select distinct o.name,i.rows from sysobjects o,sysindexes i where o.id=i.id and o.Xtype= 'U' and i ...
iOS 数据存储 - 归档和解归档
这里的归档主要是用于自定义类的归档和解档.我们这里使用NSKeyedArchiver和NSKeyedUnarchiver来归档和解档. 注意:自己定义的类需要实现<NSCoding>,如: ...
FAQ: Automatic Statistics Collection (文档 ID 1233203.1)
In this Document Purpose Questions and Answers What kind of statistics do the Automated tasks ...
CVU介绍
ORA.CVU New resource (Cluster Verification Utility) is added in 11.2.0.2 Unlike the previous resour ...
30. Distinct Subsequences
Distinct Subsequences OJ: https://oj.leetcode.com/problems/distinct-subsequences/ Given a string S a ...
Arch yaourt 安装
安装yaourt,最简单安装Yaourt的方式是添加Yaourt源至您的 /etc/pacman.conf:[archlinuxcn]#The Chinese Arch Linux communiti ...
Laxcus大数据管理系统2.0（14）- 后记
后记 Laxcus最早源于一个失败的搜索引擎项目,项目最后虽然终止了,但是项目中的部分技术,包括FIXP协议.Diffuse/Converge算法.以及很多新的数据处理理念却得以保留下来,这些成为后来 ...
Flex4/AS3.0自定义VideoPlayer组件皮肤,实现Flash视频播放器
要求必备知识本文要求基本了解 Adobe Flex编程知识. 开发环境 Flash Builder4/Flash Player11 演示地址演示地址资料下载 Adobe Flash Pla ...
MSP430F149学习之路——时钟2
代码一: /************************** 功能:LED每隔1秒闪烁一次 ****************************/ #include <msp430x14 ...

scala实现kmeans算法

scala实现kmeans算法的更多相关文章

随机推荐

热门专题