MapReduce Kmeans聚类算法

最近在网上查看用MapReduce实现的Kmeans算法，例子是不错，http://blog.csdn.net/jshayzf/article/details/22739063

但注释太少了，而且参数太多，如果新手学习的话不太好理解。所以自己按照个人的理解写了一个简单的例子并添加了详细的注释。

大致的步骤是：

1，Map每读取一条数据就与中心做对比，求出该条记录对应的中心，然后以中心的ID为Key，该条数据为value将数据输出。

2，利用reduce的归并功能将相同的Key归并到一起，集中与该Key对应的数据，再求出这些数据的平均值，输出平均值。

3，对比reduce求出的平均值与原来的中心，如果不相同，这将清空原中心的数据文件，将reduce的结果写到中心文件中。（中心的值存在一个HDFS的文件中）

删掉reduce的输出目录以便下次输出。

继续运行任务。

4，对比reduce求出的平均值与原来的中心，如果相同。则删掉reduce的输出目录，运行一个没有reduce的任务将中心ID与值对应输出。

 package MyKmeans;

 import java.io.IOException;

 import java.util.ArrayList;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.Text;

 import java.util.Arrays;

 import java.util.Iterator;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class MapReduce {

     public static class Map extends Mapper<LongWritable, Text, IntWritable, Text>{

         //中心集合

         ArrayList<ArrayList<Double>> centers = null;

         //用k个中心

         int k = 0;

         //读取中心

         protected void setup(Context context) throws IOException,

                 InterruptedException {

             centers = Utils.getCentersFromHDFS(context.getConfiguration().get("centersPath"),false);

             k = centers.size();

         }

         /**

          * 1.每次读取一条要分类的条记录与中心做对比，归类到对应的中心

          * 2.以中心ID为key，中心包含的记录为value输出(例如： 1 0.2 。  1为聚类中心的ID，0.2为靠近聚类中心的某个值)

          */

         protected void map(LongWritable key, Text value, Context context)

                 throws IOException, InterruptedException {

             //读取一行数据

             ArrayList<Double> fileds = Utils.textToArray(value);

             int sizeOfFileds = fileds.size();

             double minDistance = 99999999;

             int centerIndex = 0;

             //依次取出k个中心点与当前读取的记录做计算

             for(int i=0;i<k;i++){

                 double currentDistance = 0;

                 for(int j=0;j<sizeOfFileds;j++){

                     double centerPoint = Math.abs(centers.get(i).get(j));

                     double filed = Math.abs(fileds.get(j));

                     currentDistance += Math.pow((centerPoint - filed) / (centerPoint + filed), 2);

                 }

                 //循环找出距离该记录最接近的中心点的ID

                 if(currentDistance<minDistance){

                     minDistance = currentDistance;

                     centerIndex = i;

                 }

             }

             //以中心点为Key 将记录原样输出

             context.write(new IntWritable(centerIndex+1), value);

         }

     }

     //利用reduce的归并功能以中心为Key将记录归并到一起

     public static class Reduce extends Reducer<IntWritable, Text, Text, Text>{

         /**

          * 1.Key为聚类中心的ID value为该中心的记录集合

          * 2.计数所有记录元素的平均值，求出新的中心

          */

         protected void reduce(IntWritable key, Iterable<Text> value,Context context)

                 throws IOException, InterruptedException {

             ArrayList<ArrayList<Double>> filedsList = new ArrayList<ArrayList<Double>>();

             //依次读取记录集，每行为一个ArrayList<Double>

             for(Iterator<Text> it =value.iterator();it.hasNext();){

                 ArrayList<Double> tempList = Utils.textToArray(it.next());

                 filedsList.add(tempList);

             }

             //计算新的中心

             //每行的元素个数

             int filedSize = filedsList.get(0).size();

             double[] avg = new double[filedSize];

             for(int i=0;i<filedSize;i++){

                 //求没列的平均值

                 double sum = 0;

                 int size = filedsList.size();

                 for(int j=0;j<size;j++){

                     sum += filedsList.get(j).get(i);

                 }

                 avg[i] = sum / size;

             }

             context.write(new Text("") , new Text(Arrays.toString(avg).replace("[", "").replace("]", "")));

         }

     }

     @SuppressWarnings("deprecation")

     public static void run(String centerPath,String dataPath,String newCenterPath,boolean runReduce) throws IOException, ClassNotFoundException, InterruptedException{

         Configuration conf = new Configuration();

         conf.set("centersPath", centerPath);

         Job job = new Job(conf, "mykmeans");

         job.setJarByClass(MapReduce.class);

         job.setMapperClass(Map.class);

         job.setMapOutputKeyClass(IntWritable.class);

         job.setMapOutputValueClass(Text.class);

         if(runReduce){

             //最后依次输出不许要reduce

             job.setReducerClass(Reduce.class);

             job.setOutputKeyClass(Text.class);

             job.setOutputValueClass(Text.class);

         }

         FileInputFormat.addInputPath(job, new Path(dataPath));

         FileOutputFormat.setOutputPath(job, new Path(newCenterPath));

         System.out.println(job.waitForCompletion(true));

     }

     public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException {

         String centerPath = "hdfs://localhost:9000/input/centers.txt";

         String dataPath = "hdfs://localhost:9000/input/wine.txt";

         String newCenterPath = "hdfs://localhost:9000/out/kmean";

         int count = 0;

         while(true){

             run(centerPath,dataPath,newCenterPath,true);

             System.out.println(" 第 " + ++count + " 次计算 ");

             if(Utils.compareCenters(centerPath,newCenterPath )){

                 run(centerPath,dataPath,newCenterPath,false);

                 break;

             }

         }

     }

 }

 package MyKmeans;

 import java.io.IOException;

 import java.util.ArrayList;

 import java.util.List;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.FSDataInputStream;

 import org.apache.hadoop.fs.FSDataOutputStream;

 import org.apache.hadoop.fs.FileStatus;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IOUtils;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.util.LineReader;

 public class Utils {

     //读取中心文件的数据

     public static ArrayList<ArrayList<Double>> getCentersFromHDFS(String centersPath,boolean isDirectory) throws IOException{

         ArrayList<ArrayList<Double>> result = new ArrayList<ArrayList<Double>>();

         Path path = new Path(centersPath);

         Configuration conf = new Configuration();

         FileSystem fileSystem = path.getFileSystem(conf);

         if(isDirectory){

             FileStatus[] listFile = fileSystem.listStatus(path);

             for (int i = 0; i < listFile.length; i++) {

                 result.addAll(getCentersFromHDFS(listFile[i].getPath().toString(),false));

             }

             return result;

         }

         FSDataInputStream fsis = fileSystem.open(path);

         LineReader lineReader = new LineReader(fsis, conf);

         Text line = new Text();

         while(lineReader.readLine(line) > 0){

             ArrayList<Double> tempList = textToArray(line);

             result.add(tempList);

         }

         lineReader.close();

         return result;

     }

     //删掉文件

     public static void deletePath(String pathStr) throws IOException{

         Configuration conf = new Configuration();

         Path path = new Path(pathStr);

         FileSystem hdfs = path.getFileSystem(conf);

         hdfs.delete(path ,true);

     }

     public static ArrayList<Double> textToArray(Text text){

         ArrayList<Double> list = new ArrayList<Double>();

         String[] fileds = text.toString().split(",");

         for(int i=0;i<fileds.length;i++){

             list.add(Double.parseDouble(fileds[i]));

         }

         return list;

     }

     public static boolean compareCenters(String centerPath,String newPath) throws IOException{

         List<ArrayList<Double>> oldCenters = Utils.getCentersFromHDFS(centerPath,false);

         List<ArrayList<Double>> newCenters = Utils.getCentersFromHDFS(newPath,true);

         int size = oldCenters.size();

         int fildSize = oldCenters.get(0).size();

         double distance = 0;

         for(int i=0;i<size;i++){

             for(int j=0;j<fildSize;j++){

                 double t1 = Math.abs(oldCenters.get(i).get(j));

                 double t2 = Math.abs(newCenters.get(i).get(j));

                 distance += Math.pow((t1 - t2) / (t1 + t2), 2);

             }

         }

         if(distance == 0.0){

             //删掉新的中心文件以便最后依次归类输出

             Utils.deletePath(newPath);

             return true;

         }else{

             //先清空中心文件，将新的中心文件复制到中心文件中，再删掉中心文件

             Configuration conf = new Configuration();

             Path outPath = new Path(centerPath);

             FileSystem fileSystem = outPath.getFileSystem(conf);

             FSDataOutputStream overWrite = fileSystem.create(outPath,true);

             overWrite.writeChars("");

             overWrite.close();

             Path inPath = new Path(newPath);

             FileStatus[] listFiles = fileSystem.listStatus(inPath);

             for (int i = 0; i < listFiles.length; i++) {

                 FSDataOutputStream out = fileSystem.create(outPath);

                 FSDataInputStream in = fileSystem.open(listFiles[i].getPath());

                 IOUtils.copyBytes(in, out, 4096, true);

             }

             //删掉新的中心文件以便第二次任务运行输出

             Utils.deletePath(newPath);

         }

         return false;

     }

 }

数据集 http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

运行结果可以与http://blog.csdn.net/jshayzf/article/details/22739063的结果做对比（前提是初始的中心相同）

MapReduce Kmeans聚类算法的更多相关文章

Hadoop平台K-Means聚类算法分布式实现+MapReduce通俗讲解
Hadoop平台K-Means聚类算法分布式实现+MapReduce通俗讲解在Hadoop分布式环境下实现K-Means聚类算法的伪代码如下: 输入:参数0--存储样本数据的文本文件inpu ...
K-Means 聚类算法
K-Means 概念定义: K-Means 是一种基于距离的排他的聚类划分方法. 上面的 K-Means 描述中包含了几个概念: 聚类(Clustering):K-Means 是一种聚类分析(Clus ...
k-means聚类算法python实现
K-means聚类算法算法优缺点: 优点:容易实现缺点:可能收敛到局部最小值,在大规模数据集上收敛较慢使用数据类型:数值型数据算法思想 k-means算法实际上就是通过计算不同样本间的距离来判断他 ...
K-Means 聚类算法原理分析与代码实现
前言在前面的文章中,涉及到的机器学习算法均为监督学习算法. 所谓监督学习,就是有训练过程的学习.再确切点,就是有 "分类标签集" 的学习. 现在开始,将进入到非监督学习领域.从经 ...
Kmeans聚类算法原理与实现
Kmeans聚类算法 1 Kmeans聚类算法的基本原理 K-means算法是最为经典的基于划分的聚类方法,是十大经典数据挖掘算法之一.K-means算法的基本思想是:以空间中k个点为中心进行聚类,对 ...
机器学习六--K-means聚类算法
机器学习六--K-means聚类算法想想常见的分类算法有决策树.Logistic回归.SVM.贝叶斯等.分类作为一种监督学习方法,要求必须事先明确知道各个类别的信息,并且断言所有待分类项都有一个类别 ...
转载： scikit-learn学习之K-means聚类算法与 Mini Batch K-Means算法
版权声明:<—— 本文为作者呕心沥血打造,若要转载,请注明出处@http://blog.csdn.net/gamer_gyt <—— 目录(?)[+] ================== ...
沙湖王 | 用Scipy实现K-means聚类算法
沙湖王 | 用Scipy实现K-means聚类算法用Scipy实现K-means聚类算法
Matlab中K-means聚类算法的使用（K-均值聚类）
K-means聚类算法采用的是将N*P的矩阵X划分为K个类,使得类内对象之间的距离最大,而类之间的距离最小. 使用方法:Idx=Kmeans(X,K)[Idx,C]=Kmeans(X,K) [Idx, ...

随机推荐

WP布局之Pivot和Panorama
一.Pivot控件(枢轴控件) Pivot主要用于管理应用中的视图或者页面,此控件在WP中几乎处处可见,不管是短信的左右滑动,还是QQ的左右滑动都是此控件的功劳. 就是图片中的控件,是不是很熟悉呢. ...
Servlet小示例：jsp页面提交信息Servlet接收并打印输出
该示例采用doPost方法提交表单,该示例一共包含两个文件. 一个是用来提交用户信息的表单userForm2.jsp,另一个是用来接收参数的Servlet. userForm2.jsp <%@ ...
Android The content of the adapter has changed but ListView did not receive a notification终极解决方法
这几天做一个自动扫描SD卡上所有APK文件的小工具,扫描过程中会把APK添加到LISTVIEW中显示,结果出现以下错误:(有时候触摸更新数据时候,触摸listview也会报错) E/AndroidRu ...
神经网络指南Hacker's guide to Neural Networks
Hi there, I'm a CS PhD student at Stanford. I've worked on Deep Learning for a few years as part of ...
android rabbitMQ
http://www.cnblogs.com/wufawei/archive/2012/03/31/2427823.html http://www.raywenderlich.com/5527/get ...
jdbc知识问答分类：面试 2015-07-10 22:05 5人阅读评论(0) 收藏
1 JDBC连接数据库6步 Load the JDBC Driver Establish the Database Connection Create a Statement Object Execu ...
heatmap.2
heatmap.2 {gplots} R Documentation Enhanced Heat Map Description A heat map is a false color image ( ...
Proxifier设置代理
1.首先需要开启http代理选项---配置文件->高级->HTTP代理服务器,勾选“启用HTTP代理服务器支持” 2.然后开始添加代理服务器选择“配置文件->代理服务器”,在弹出框点 ...
C# 反编译-Reflector 反混淆-De4Dot 修改dll/exe代码-reflexil
反编译工具 Reflector 破解版下载地址:http://pan.baidu.com/s/15UwJo 使用方法:略反混淆工具De4Dot 开源软件下载地址http://pan.baidu.c ...
A06_RelativeLayout的属性设置
设有两个控件one和two,以控件one为基准.由于代码比较简单就不贴了,直接上效果图. 一.第一组:将控件two放在控件one的上.下.左.右.开始.结束. android:layout_below ...

MapReduce Kmeans聚类算法

MapReduce Kmeans聚类算法的更多相关文章

随机推荐

热门专题