一、Mahout命令使用

合成控制的数据集 synthetic_control.data 可以从此处下载，总共由600行X60列double型的数据组成，意思是有600个元组，每个元组是一个时间序列。

1. 把数据拷到集群上，放到kmeans/目录下

hadoop fs -mv synthetic_control.data kmeans/synthetic_control.data

2. 输入如下mahout命令进行KMeans聚类分析

当命令中有这个--numClusters( 代表聚类结果中簇的个数)参数的话，它会采用Kmeans聚类。如果没有配置这个参数的话，它会先采用Canopy聚类，-t1和-t2是用于Canopy聚类的配置参数。

二、源码学习

从Mahout源码可以分析出：进行KMeans聚类时，会产生四个步骤。

数据预处理，整理规范化数据
从上述数据中随机选择若干个数据当作Cluster的中心
迭代计算，调整形心
把数据分给各个Cluster

其中前俩步就是 KMeans聚类算法的准备工作。

主要流程可以从org.apache.mahout.clustering.syntheticcontrol.kmeans.Job#run()方法里看出一些端倪。

  public static void run(Configuration conf, Path input, Path output, DistanceMeasure measure, int k,

      double convergenceDelta, int maxIterations) throws Exception {

    //1. synthetic_control.data存储的文本格式，转换成Key/Value格式，存入到output/data目录。Key为保存一个Integer的Text类型， Value为VectorWritable类型。

    Path directoryContainingConvertedInput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT);

    log.info("Preparing Input");

    InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector");

    //2. 随机产生几个cluster，存入到output/clusters-0/part-randomSeed文件里。Key为Text, Value为ClusterWritable类型。

    log.info("Running random seed to get initial clusters");

    Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);

    clusters = RandomSeedGenerator.buildRandom(conf, directoryContainingConvertedInput, clusters, k, measure);

    //3. 进行聚类迭代运算，为每一个簇重新选出cluster centroid中心

    log.info("Running KMeans");

    KMeansDriver.run(conf, directoryContainingConvertedInput, clusters, output, measure, convergenceDelta,

        maxIterations, true, 0.0, false);

    //4. 根据上面选出的中心，把output/data里面的记录，都分配给各个cluster。输出运算结果，把sequencefile格式转化成textfile格式展示出来

    // run ClusterDumper

    ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-*-final"), new Path(output,

        "clusteredPoints"));

    clusterDumper.printClusters(null);

  }

RandomAccessSparseVector是一个Vector实现，里面有一个 OpenIntDoubleMap属性，该OpenIntDoubleMap不是继承自HashMap，而是自己实现了一套类似的hashMap，数据是通过一个Int数组和Long数组维护着，因此无法通过Iterator为遍历。
RandomSeedGenerator#buildRandom()是在上面的Vector里面随机抽样k个序列簇Kluster，采用的是一种蓄水池抽样（Reservoir Sampling）的方法：即先把前k个数放入蓄水池，对第k+1，我们以k/(k+1)概率决定是否要把它换入蓄水池，最终每个数都是以相同的概率k/n进入蓄水池。它通过强大的MersenneTwister伪随机生成器来随机产生，它产生的随机数长度可达2^19937 - 1，维度可高达623维，同时数值还可以精确到32位的均匀分布。

1. 迭代计算准备工作

真正在做KMeans聚类的代码是：

  public static Path buildClusters(Configuration conf, Path input, Path clustersIn, Path output,

      DistanceMeasure measure, int maxIterations, String delta, boolean runSequential) throws IOException,

      InterruptedException, ClassNotFoundException {

    double convergenceDelta = Double.parseDouble(delta);

    //从output/clusters-0/part-randomSeed文件里读出Cluster数据，放入到clusters变量中。

    List<Cluster> clusters = Lists.newArrayList();

    KMeansUtil.configureWithClusterInfo(conf, clustersIn, clusters);

    if (clusters.isEmpty()) {

      throw new IllegalStateException("No input clusters found in " + clustersIn + ". Check your -c argument.");

    }

    //把聚类策略（控制收敛程度）写进output/clusters-0/_policy文件中

    //同时，每个簇cluster在output/clusters-0/下对应生成part-000xx文件

    Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);

    ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);

    ClusterClassifier prior = new ClusterClassifier(clusters, policy);

    prior.writeToSeqFiles(priorClustersPath);

    //开始迭代maxIterations次执行Map/Reduce

    if (runSequential) {

      ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, maxIterations);

    } else {

      ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations);

    }

    return output;

  }

2. 迭代计算

调整cluster中心的Job的代码如下：

  public static void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations)

    throws IOException, InterruptedException, ClassNotFoundException {

    ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath);

    Path clustersOut = null;

    int iteration = 1;

    while (iteration <= numIterations) {

      conf.set(PRIOR_PATH_KEY, priorPath.toString());

      String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath;

      Job job = new Job(conf, jobName);

      job.setMapOutputKeyClass(IntWritable.class);

      job.setMapOutputValueClass(ClusterWritable.class);

      job.setOutputKeyClass(IntWritable.class);

      job.setOutputValueClass(ClusterWritable.class);

      job.setInputFormatClass(SequenceFileInputFormat.class);

      job.setOutputFormatClass(SequenceFileOutputFormat.class);

      //核心算法就在这个CIMapper和CIReducer里面

      job.setMapperClass(CIMapper.class);

      job.setReducerClass(CIReducer.class);

      FileInputFormat.addInputPath(job, inPath);

      clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration);

      priorPath = clustersOut;

      FileOutputFormat.setOutputPath(job, clustersOut);

      job.setJarByClass(ClusterIterator.class);

      if (!job.waitForCompletion(true)) {

        throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath);

      }

      ClusterClassifier.writePolicy(policy, clustersOut);

      FileSystem fs = FileSystem.get(outPath.toUri(), conf);

      iteration++;

      if (isConverged(clustersOut, conf, fs)) {

        break;

      }

    }

    //把最后一次迭代的结果目录重命名，加一个final

    Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX);

    FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn);

  }

2.1. Map阶段

CIMapper代码如下：

 @Override

  protected void map(WritableComparable<?> key, VectorWritable value, Context context) throws IOException,

      InterruptedException {

    Vector probabilities = classifier.classify(value.get());

    Vector selections = policy.select(probabilities);

    for (Iterator<Element> it = selections.iterateNonZero(); it.hasNext();) {

      Element el = it.next();

      classifier.train(el.index(), value.get(), el.get());

    }

  }

在这里面需要厘清

org.apache.mahout.clustering.iterator.KMeansClusteringPolicy

和

org.apache.mahout.clustering.classify.ClusterClassifier

这两个类。

前者是聚类的策略，可以说它提供聚类的核心算法。

后者是聚类的分类器，它的功能是基于聚类策略把数据进行分类。

2.1.1. ClusterClassifier 求点到Cluster形心的距离

ClusterClassifier.classify()求得某点到所有cluster中心的距离，得到的是一个数组。

@Override

  public Vector classify(Vector data, ClusterClassifier prior) {

    List<Cluster> models = prior.getModels();

    int i = 0;

    Vector pdfs = new DenseVector(models.size());

    for (Cluster model : models) {

      pdfs.set(i++, model.pdf(new VectorWritable(data)));

    }

    return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());

  }

上述代码中的org.apache.mahout.clustering.iterator.DistanceMeasureCluster.pdf(VectorWritable)求该点到Cluster形心的距离，其算法代码如下：

@Override

  public double pdf(VectorWritable vw) {

    return 1 / (1 + measure.distance(vw.get(), getCenter()));

  }

每一次迭代后，就会重新计算一次centroid，通过AbstractCluster.computeParameters来计算的。

pdfs.zSum()是pdfs double数组的和。然后再对pdfs进行归一化处理。

因此最后select()用于选出相似度最大的cluster的下标，并且对其赋予权重1.0。如下所示：

@Override

  public Vector select(Vector probabilities) {

    int maxValueIndex = probabilities.maxValueIndex();

    Vector weights = new SequentialAccessSparseVector(probabilities.size());

    weights.set(maxValueIndex, 1.0);

    return weights;

  }

2.1.2. ClusterClassifier 为求Cluster新形心做准备

接下来，为了重新得到新的中心，通过org.apache.mahout.clustering.classify.ClusterClassifier.train(int, Vector, double)为训练数据，即最后在AbstractCluster里面准备数据。

public void observe(Vector x, double weight) {

    if (weight == 1.0) {

      observe(x);

    } else {

      setS0(getS0() + weight);

      Vector weightedX = x.times(weight);

      if (getS1() == null) {

        setS1(weightedX);

      } else {

        getS1().assign(weightedX, Functions.PLUS);

      }

      Vector x2 = x.times(x).times(weight);

      if (getS2() == null) {

        setS2(x2);

      } else {

        getS2().assign(x2, Functions.PLUS);

      }

    }

  }

2.2. Reduce阶段

在CIReducer里面，对属于同一个Cluster里面的数据进行合并，并且求出centroid形心。

@Override

  protected void reduce(IntWritable key, Iterable<ClusterWritable> values, Context context) throws IOException,

      InterruptedException {

    Iterator<ClusterWritable> iter = values.iterator();

    Cluster first = iter.next().getValue(); // there must always be at least one

    while (iter.hasNext()) {

      Cluster cluster = iter.next().getValue();

      first.observe(cluster);

    }

    List<Cluster> models = Lists.newArrayList();

    models.add(first);

    classifier = new ClusterClassifier(models, policy);

    classifier.close();

    context.write(key, new ClusterWritable(first));

  }

2.2.1. Reduce中求centroid形心的算法

求centroid算法代码如下：

@Override

  public void computeParameters() {

    if (getS0() == 0) {

      return;

    }

    setNumObservations((long) getS0());

    setTotalObservations(getTotalObservations() + getNumObservations());

    setCenter(getS1().divide(getS0()));

    // compute the component stds

    if (getS0() > 1) {

      setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new SquareRootFunction()).divide(getS0()));

    }

    setS0(0);

    setS1(center.like());

    setS2(center.like());

  }

3. 聚类数据

真正对output/data记录分配给各个簇的代码是：

 private static void classifyClusterMR(Configuration conf, Path input, Path clustersIn, Path output,

      Double clusterClassificationThreshold, boolean emitMostLikely) throws IOException, InterruptedException,

      ClassNotFoundException {

    conf.setFloat(ClusterClassificationConfigKeys.OUTLIER_REMOVAL_THRESHOLD,

                  clusterClassificationThreshold.floatValue());

    conf.setBoolean(ClusterClassificationConfigKeys.EMIT_MOST_LIKELY, emitMostLikely);

    conf.set(ClusterClassificationConfigKeys.CLUSTERS_IN, clustersIn.toUri().toString());

    Job job = new Job(conf, "Cluster Classification Driver running over input: " + input);

    job.setJarByClass(ClusterClassificationDriver.class);

    job.setInputFormatClass(SequenceFileInputFormat.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);

    //进行记录分配

    job.setMapperClass(ClusterClassificationMapper.class);

    job.setNumReduceTasks(0);

    job.setOutputKeyClass(IntWritable.class);

    job.setOutputValueClass(WeightedVectorWritable.class);

    FileInputFormat.addInputPath(job, input);

    FileOutputFormat.setOutputPath(job, output);

    if (!job.waitForCompletion(true)) {

      throw new InterruptedException("Cluster Classification Driver Job failed processing " + input);

    }

  }

摘录地址：http://zcdeng.iteye.com/blog/1859711

(转)Mahout Kmeans Clustering 学习的更多相关文章

Mahout kmeans聚类
Mahout K-means聚类一.Kmeans 聚类原理 K-means算法是最为经典的基于划分的聚类方法,是十大经典数据挖掘算法之一.K-means算法的基本思想是:以空间中k个点为中心进行聚 ...
机器学习实战（Machine Learning in Action）学习笔记————06.k-均值聚类算法（kMeans）学习笔记
机器学习实战(Machine Learning in Action)学习笔记————06.k-均值聚类算法(kMeans)学习笔记关键字:k-均值.kMeans.聚类.非监督学习作者:米仓山下时间: ...
Deep Learning论文笔记之（一）K-means特征学习
Deep Learning论文笔记之(一)K-means特征学习 zouxy09@qq.com http://blog.csdn.net/zouxy09 自己平时看了一些论文,但老感 ...
Machine Learning—The k-means clustering algorithm
印象笔记同步分享:Machine Learning-The k-means clustering algorithm
基于K-means Clustering聚类算法对电商商户进行级别划分(含Octave仿真)
在从事电商做频道运营时,每到关键时间节点,大促前,季度末等等,我们要做的一件事情就是品牌池打分,更新所有店铺的等级.例如,所以的商户分入SKA,KA,普通店铺,新店铺这4个级别,对于不同级别的商户,会 ...
Python—kmeans算法学习笔记
一. 什么是聚类聚类简单的说就是要把一个文档集合根据文档的相似性把文档分成若干类,但是究竟分成多少类,这个要取决于文档集合里文档自身的性质.下面这个图就是一个简单的例子,我们可以把不同的文档聚合 ...
k-means聚类学习
4.1.摘要在前面的文章中,介绍了三种常见的分类算法.分类作为一种监督学习方法,要求必须事先明确知道各个类别的信息,并且断言所有待分类项都有一个类别与之对应.但是很多时候上述条件得不到满足,尤其是在 ...
Kmeans算法学习与SparkMlLib Kmeans算法尝试
K-means算法是最为经典的基于划分的聚类方法,是十大经典数据挖掘算法之一.K-means算法的基本思想是:以空间中k个点为中心进行聚类,对最靠近他们的对象归类.通过迭代的方法,逐次更新各聚类中心的 ...
Andrew Ng机器学习编程作业:K-means Clustering and Principal Component Analysis
作业文件 machine-learning-ex7 1. K-means聚类在这节练习中,我们将实现K-means聚类,并将其应用到图片压缩上.我们首先从二维数据开始,获得一个直观的感受K-mea ...

随机推荐

java压缩多个文件
首先创建一个工具类,定义好接口,这里的参数1:fileList:多个文件的path+name2: zipFileName:压缩后的文件名下面是代码,注释已经很详细了 public class ZIP ...
20145303 实验二 Java面向对象程序设计
实验二 Java面向对象程序设计实验内容初步掌握单元测试和TDD 理解并掌握面向对象三要素:封装.继承.多态初步掌握UML建模熟悉S.O.L.I.D原则了解设计模式实验要求 1.没有Lin ...
git的应用
对git的应用 (终于第一次用会git) 根据胡东晖同学的博客(使用git推送代码到开源中国以及IDEA环境下使用git)与热心指导,自己跟着做了,虽然途中出了很多很多问题,但是好在最后还是成功了!! ...
openssl 编译
不要费事编译了,直接下载吧! https://www.npcglib.org/~stathis/blog/precompiled-openssl/ 下载 openssl https://www.ope ...
apache配置ssl
1.确认是否安装ssl模块是否有mod_ssl.so文件 2.生成证书和密钥 linux下步骤1:生成密钥命令:openssl genrsa 1024 > server.key 说 ...
web.xml上下文初始化参数
1.在web.xml文件中配置上下文参数  <!-- 上下文初始化的参数可以被应用程序用所有ser ...
Java虚拟机组成详解
导读:详细而深入的总结,是对知识“豁然开朗”之后的“刻骨铭心”,想忘记都难. Java虚拟机(Java Virtual Machine)下文简称jvm,上一篇我们对jvm有了大体的认识,进入本文之后我 ...
mui app在线更新
一参考资料二代码 HTML代码 CSS代码 JS代码接口代码一.参考资料 http://ask.dcloud.net.cn/article/182 二.代码 1. HTML代码 <div ...
HTML5如何做横屏适配
在移动端中我们经常碰到横屏竖屏的问题,那么我们应该如何去判断或者针对横屏.竖屏来写不同的代码呢. 首先在head中加入如下代码: 1 <meta name="viewport" ...
hadoop2.6.0的eclipse插件安装
1.安装插件下载插件hadoop-eclipse-plugin-2.6.0.jar并将其放到eclips安装目录->plugins(插件)文件夹下.然后启动eclipse. 配置 hadoop ...

(转)Mahout Kmeans Clustering 学习