不多说,直接上干货!

  首先,别在windows下搭建什么,安装什么Cygwin啊!直接在linux,对于企业里推荐用CentOS6.5,在学校里用Ubuntu。

Mahout安装所需软件清单:

软件        版本          说明

操作系统    CentOS6.5        64位

JDK      jdk1.7.0_79        

Hadoop      2.6.0          

Mahout     mahout-distribution-0.8

  为什么采用这个版本,而不是0.9及其以后的版本,是因为差别有点大,比如fpg关联规则算法。以及网上参考资料少

  说在前面的话,

  关于Mahout的安装配置,这里介绍两种方式:其一,下载源码(直接下载源码或者通过svn下载源码都可以),然后使用Maven进行编译;其二,下载完整包进行解压缩。这里我使用的是完整包进行解压缩安装。

一、 mahout-distribution-0.8.tar.gz的下载

http://archive.apache.org/dist/mahout/0.8/

  我这里,以稳定版本mahout-0.9为例

  当然,这里也可以使用wget命令在线下载,很简单,不多说。

二、 mahout-distribution-0.8.tar.gz的安装

  1、先新建好目录

  我一般喜欢在/usr/loca/下新建

[root@djt002 local]# pwd
/usr/local
[root@djt002 local]# ll
total
drwxr-xr-x. root root Sep bin
drwxr-xr-x. hadoop hadoop Mar : data
drwxr-xr-x. hadoop hadoop Feb : elasticsearch
drwxr-xr-x. root root Sep etc
drwxr-xr-x. hadoop hadoop Jan : flume
drwxr-xr-x. root root Sep games
drwxr-xr-x. hadoop hadoop Jan : hadoop
drwxr-xr-x. hadoop hadoop Mar : hbase
drwxr-xr-x. hadoop hadoop Mar : hive
drwxr-xr-x. root root Sep include
drwxr-xr-x. hadoop hadoop Jan : jdk
drwxr-xr-x. root root Sep lib
drwxr-xr-x. root root Sep lib64
drwxr-xr-x. root root Sep libexec
drwxr-xr-x. root root Sep sbin
drwxr-xr-x. root root Jan : share
drwxr-xr-x. hadoop hadoop Mar : sqoop
drwxr-xr-x. root root Sep src
[root@djt002 local]# mkdir mahout
[root@djt002 local]# ll
total
drwxr-xr-x. root root Sep bin
drwxr-xr-x. hadoop hadoop Mar : data
drwxr-xr-x. hadoop hadoop Feb : elasticsearch
drwxr-xr-x. root root Sep etc
drwxr-xr-x. hadoop hadoop Jan : flume
drwxr-xr-x. root root Sep games
drwxr-xr-x. hadoop hadoop Jan : hadoop
drwxr-xr-x. hadoop hadoop Mar : hbase
drwxr-xr-x. hadoop hadoop Mar : hive
drwxr-xr-x. root root Sep include
drwxr-xr-x. hadoop hadoop Jan : jdk
drwxr-xr-x. root root Sep lib
drwxr-xr-x. root root Sep lib64
drwxr-xr-x. root root Sep libexec
drwxr-xr-x root root Apr : mahout
drwxr-xr-x. root root Sep sbin
drwxr-xr-x. root root Jan : share
drwxr-xr-x. hadoop hadoop Mar : sqoop
drwxr-xr-x. root root Sep src
[root@djt002 local]# chown -R hadoop:hadoop mahout
[root@djt002 local]# ll
total
drwxr-xr-x. root root Sep bin
drwxr-xr-x. hadoop hadoop Mar : data
drwxr-xr-x. hadoop hadoop Feb : elasticsearch
drwxr-xr-x. root root Sep etc
drwxr-xr-x. hadoop hadoop Jan : flume
drwxr-xr-x. root root Sep games
drwxr-xr-x. hadoop hadoop Jan : hadoop
drwxr-xr-x. hadoop hadoop Mar : hbase
drwxr-xr-x. hadoop hadoop Mar : hive
drwxr-xr-x. root root Sep include
drwxr-xr-x. hadoop hadoop Jan : jdk
drwxr-xr-x. root root Sep lib
drwxr-xr-x. root root Sep lib64
drwxr-xr-x. root root Sep libexec
drwxr-xr-x hadoop hadoop Apr : mahout
drwxr-xr-x. root root Sep sbin
drwxr-xr-x. root root Jan : share
drwxr-xr-x. hadoop hadoop Mar : sqoop
drwxr-xr-x. root root Sep src
[root@djt002 local]#

  2、上传mahout压缩包

[root@djt002 local]# su hadoop
[hadoop@djt002 local]$ cd mahout/
[hadoop@djt002 mahout]$ pwd
/usr/local/mahout
[hadoop@djt002 mahout]$ ll
total
[hadoop@djt002 mahout]$ rz [hadoop@djt002 mahout]$ ll
total
-rw-r--r-- hadoop hadoop Apr : mahout-distribution-0.8.tar.gz
[hadoop@djt002 mahout]$

  3、解压

[hadoop@djt002 mahout]$ pwd
/usr/local/mahout
[hadoop@djt002 mahout]$ ll
total
-rw-r--r-- hadoop hadoop Apr : mahout-distribution-0.8.tar.gz
[hadoop@djt002 mahout]$ tar -zxvf mahout-distribution-0.9.tar.gz

  4、删除压缩包和赋予用户组

[hadoop@djt002 mahout]$ pwd
/usr/local/mahout
[hadoop@djt002 mahout]$ ll
total
drwxrwxr-x hadoop hadoop Apr : mahout-distribution-0.8
-rw-r--r-- hadoop hadoop Apr : mahout-distribution-0.8.tar.gz
[hadoop@djt002 mahout]$ rm mahout-distribution-0.9.tar.gz
[hadoop@djt002 mahout]$ ll
total
drwxrwxr-x hadoop hadoop Apr : mahout-distribution-0.8
[hadoop@djt002 mahout]$

  5、mahout的配置

[root@djt002 mahout-distribution-0.8]# pwd
/usr/local/mahout/mahout-distribution-0.8
[root@djt002 mahout-distribution-0.8]# vim /etc/profile

#mahout
export MAHOUT_HOME=/usr/local/mahout/mahout-distribution-0.8
export MAHOUT_HOME_CONF_DIR=/usr/local/mahout/mahout-distribution-0.8/conf
export PATH=$PATH:$MAHOUT_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib:$MAHOUT_HOME/lib:$JRE_HOME/lib:$CLASSPATH

[root@djt002 mahout-distribution-0.9]# source /etc/profile

  认识下mahout的目录结构

[hadoop@djt002 mahout-distribution-0.8]$ pwd
/usr/local/mahout/mahout-distribution-0.8
[hadoop@djt002 mahout-distribution-0.8]$ ll
total
drwxrwxr-x hadoop hadoop Apr : bin
drwxrwxr-x hadoop hadoop Apr : buildtools
drwxr-xr-x hadoop hadoop Jul conf
drwxrwxr-x hadoop hadoop Apr : core
drwxrwxr-x hadoop hadoop Apr : distribution
drwxrwxr-x hadoop hadoop Apr : docs
drwxrwxr-x hadoop hadoop Apr : examples
drwxrwxr-x hadoop hadoop Apr : integration
drwxrwxr-x hadoop hadoop Apr : lib
-rw-r--r-- hadoop hadoop Jul LICENSE.txt
-rw-r--r-- hadoop hadoop Jul mahout-core-0.8.jar
-rw-r--r-- hadoop hadoop Jul mahout-core-0.8-job.jar
-rw-r--r-- hadoop hadoop Jul mahout-examples-0.8.jar
-rw-r--r-- hadoop hadoop Jul mahout-examples-0.8-job.jar
-rw-r--r-- hadoop hadoop Jul mahout-integration-0.8.jar
-rw-r--r-- hadoop hadoop Jul mahout-math-0.8.jar
drwxrwxr-x hadoop hadoop Apr : math
-rw-r--r-- hadoop hadoop Jul NOTICE.txt
-rw-r--r-- hadoop hadoop Jul README.txt
[hadoop@djt002 mahout-distribution-0.8]$

三、验证mahout是否安装成功

[hadoop@djt002 mahout-distribution-0.8]$ bin/mahout --help
Running on hadoop, using /usr/local/hadoop/hadoop-2.6./bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Unknown program '--help' chosen.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
concatmatrices: : Concatenates matrices of same cardinality into a single matrix
cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
fkmeans: : Fuzzy K-means clustering
hmmpredict: : Generate random sequence of observations by given HMM
itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
kmeans: : K-means clustering
lucene.vector: : Generate Vectors from a Lucene index
lucene2seq: : Generate Text SequenceFiles from a Lucene index
matrixdump: : Dump matrix in CSV format
matrixmult: : Take the product of two matrices
parallelALS: : ALS-WR factorization of a rating matrix
qualcluster: : Runs clustering experiments and summarizes results in a CSV
recommendfactorized: : Compute recommendations using the factorization of a rating matrix
recommenditembased: : Compute recommendations using item-based collaborative filtering
regexconverter: : Convert text files on a per line basis based on regular expressions
resplit: : Splits a set of SequenceFiles into a number of equal splits
rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
runlogistic: : Run a logistic regression model against CSV data
seq2encoded: : Encoded Sparse Vector generation from Text sequence files
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
seqdumper: : Generic Sequence File dumper
seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
seqwiki: : Wikipedia xml dump to sequence file
spectralkmeans: : Spectral k-means clustering
split: : Split Input data into test and train sets
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
streamingkmeans: : Streaming k-means clustering
svd: : Lanczos Singular Value Decomposition
testnb: : Test the Vector-based Bayes classifier
trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
trainlogistic: : Train a logistic regression using stochastic gradient descent
trainnb: : Train the Vector-based Bayes classifier
transpose: : Take the transpose of a matrix
validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
vectordump: : Dump vectors from a sequence file to text
viterbi: : Viterbi decoding of hidden states from given output states sequence
[hadoop@djt002 mahout-distribution-0.9]$

  出现上述的界面,说明mahout安装成功,因为,自动列出mahout已经实现的所有命令。

运行mahout自带的示例(确保hadoop集群已开启)

mahout中的算法大致可以分为三大类:

  聚类,协同过滤和分类

其中

  常用聚类算法有:canopy聚类,k均值算法(kmeans),模糊k均值,层次聚类,LDA聚类等

  常用分类算法有:贝叶斯,逻辑回归,支持向量机,感知器,神经网络等

  因为,我的版本是mahout-0.8,所以mahout-examples-0.8-job.jar。

  以下是运行mahout自带的keans算法
$HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job   或者   以下是运行mahout自带的cnopy算法
$HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar   org.apache.mahout.clustering.syntheticcontrol.canopy.Job

[hadoop@djt002 mahout-distribution-0.9]$ $HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar   org.apache.mahout.clustering.syntheticcontrol.canopy.Job
// :: INFO canopy.Job: Running with default arguments
// :: INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:
// :: WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
// :: INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1493332712225_0001
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://djt002:9000/user/hadoop/testdata
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:)
at org.apache.hadoop.mapreduce.Job$.run(Job.java:)
at org.apache.hadoop.mapreduce.Job$.run(Job.java:)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:)
at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:)
at org.apache.mahout.clustering.syntheticcontrol.canopy.Job.run(Job.java:)
at org.apache.mahout.clustering.syntheticcontrol.canopy.Job.main(Job.java:)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:)
at java.lang.reflect.Method.invoke(Method.java:)
at org.apache.hadoop.util.RunJar.run(RunJar.java:)
at org.apache.hadoop.util.RunJar.main(RunJar.java:)
[hadoop@djt002 mahout-distribution-0.9]$

  准备测试数据

练习数据下载地址:

http://download.csdn.net/detail/qq1010885678/8582941

  上面的练习数据是用来检测kmeans聚类算法的数据。

  将练习数据(data.txt)上传到hdfs中对应的hdfs://djt002:9000/user/hadoop/testdata目录下即可。(这是样本数据集,可以适用各种算法)

     我这里,上传测试数据。到我本地linux自己写的一个路径。(这里为了自己所需哈)

[hadoop@djt002 mahout]$ pwd
/usr/local/mahout
[hadoop@djt002 mahout]$ ll
total
drwxrwxr-x hadoop hadoop Apr : mahout-distribution-0.8
[hadoop@djt002 mahout]$ mkdir mahoutData
[hadoop@djt002 mahout]$ ll
total
drwxrwxr-x hadoop hadoop Apr : mahoutData
drwxrwxr-x hadoop hadoop Apr : mahout-distribution-0.8
[hadoop@djt002 mahout]$ cd mahoutData/
[hadoop@djt002 mahoutData]$ pwd
/usr/local/mahout/mahoutData
[hadoop@djt002 mahoutData]$ ll
total
[hadoop@djt002 mahoutData]$ rz
CC[hadoop@djt002 mahoutData]$ ll
total
[hadoop@djt002 mahoutData]$ rz [hadoop@djt002 mahoutData]$ ll
total
-rw-r--r-- hadoop hadoop Apr : data.txt
[hadoop@djt002 mahoutData]$

  然后,将/usr/local/mahout/mahoutData/下的测试数据,上传到hdfs://djt002:9000/user/hadoop/testdata下

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -put /usr/local/mahout/mahoutData/data.txt  hdfs://djt002:9000/user/hadoop/testdata

或者 [hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -copyFromLocal  /usr/local/mahout/mahoutData/data.txt  hdfs://djt002:9000/user/hadoop/testdata/ [hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -ls hdfs://djt002:9000/user/hadoop/testdata/
-rw-r--r-- hadoop supergroup -- : hdfs://djt002:9000/user/hadoop/testdata

  也许中间会出现,这个数据集,你会上传不了。解决方案如下

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -put /usr/local/mahout/mahoutData/data.txt  hdfs://djt002:9000/user/hadoop/testdata/
put: `hdfs://djt002:9000/user/hadoop/testdata': File exists
[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -rm hdfs://djt002:9000/user/hadoop/testdata/
// :: INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = minutes, Emptier interval = minutes.
Deleted hdfs://djt002:9000/user/hadoop/testdata
[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -mkdir hdfs://djt002:9000/user/hadoop/testdata/
[hadoop@djt002 mahoutData]$

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -put /usr/local/mahout/mahoutData/data.txt  hdfs://djt002:9000/user/hadoop/testdata/
[hadoop@djt002 mahoutData]$

使用kmeans算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar   org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

  注意,是不需输入路径和输出路径的啊!(自带的jar包里都已经写死了的)

  (注意:如果你是选择用mahout压缩包里自带的kmeans算法的话,则它的输入路径是testdata是固定死的,

        即hdfs:djt002://9000/user/hadoop/testdata/  )

  并且每次运行hadoop都要删掉原来的output目录!

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -rm -r hdfs://djt002:9000/user/hadoop/output/*

  ....

  由于聚类算法是一种迭代的过程(之后会讲解)

  所以,它会一直重复的执行mr任务到符合要求(这其中的过程可能有点久。。。)

Kmeans运行结果如下:

, 7.311, 10.611, 6.924, 3.440, 9.465, 4.764, 2.838, 8.807, 1.960, 2.864, 6.728, 0.369, 1.374, -0.167, 2.125, 8.306, 4.908, -0.432]
1.0 : [distance=29.095866076790845]: = [30.817, 28.079, 24.628, 23.933, 28.660, 25.704, 27.501, 23.513, 30.377, 27.595, 22.938, 26.684, 25.208, 26.834, 22.931, 17.732, 17.544, 24.167, 25.602, 19.269, 14.978, 17.223, 18.962, 22.281, 17.035, 23.789, 14.878, 18.113, 10.981, 11.661, 14.331, 19.942, 11.175, 10.714, 15.675, 15.468, 16.010, 14.972, 15.101, 15.131, 15.154, 10.492, 14.754, 5.222, 5.393, 13.606, 11.775, 6.307, 3.370, 10.107, 7.779, 10.209, 1.493, 4.822, 0.019, 8.019, -0.279, -0.049, 5.757, 2.718]
1.0 : [distance=24.674726284993667]: = [31.347, 28.245, 34.275, 29.885, 30.573, 32.373, 24.031, 24.057, 24.099, 23.777, 28.993, 29.853, 26.485, 29.245, 28.145, 22.528, 20.390, 20.570, 27.921, 18.786, 22.144, 20.163, 17.616, 19.541, 20.342, 22.061, 21.358, 23.951, 13.447, 12.974, 18.406, 17.349, 17.425, 11.041, 14.912, 10.147, 16.731, 9.845, 14.840, 18.283, 18.426, 10.059, 16.760, 14.187, 14.301, 14.277, 12.823, 15.574, 10.789, 10.957, 8.361, 4.116, 3.732, 3.508, 2.288, 9.768, 9.661, 2.183, 6.933, 4.670]
1.0 : [distance=31.366016794511612]: = [35.439, 24.104, 27.345, 28.982, 34.488, 27.952, 32.550, 25.255, 29.188, 24.766, 29.235, 20.520, 19.745, 27.306, 29.226, 27.510, 21.879, 25.199, 19.470, 19.373, 19.371, 26.519, 19.270, 18.184, 24.926, 15.082, 17.402, 14.351, 22.618, 22.343, 22.627, 15.136, 16.385, 13.479, 21.914, 21.072, 18.025, 15.178, 19.715, 11.919, 18.650, 16.242, 12.783, 17.710, 17.715, 8.372, 13.702, 7.537, 9.190, 11.098, 13.714, 8.595, 11.006, 15.031, 10.061, 7.613, 13.295, 12.292, 12.478, 11.095]
1.0 : [distance=26.598263851474357]: = [26.273, 31.229, 29.741, 34.208, 33.329, 33.610, 31.072, 22.530, 28.587, 21.130, 23.557, 28.078, 27.546, 25.825, 18.454, 25.903, 24.448, 24.003, 23.199, 22.158, 17.711, 23.922, 20.550, 15.913, 17.699, 13.883, 17.494, 16.360, 20.679, 11.790, 18.424, 10.493, 11.001, 17.994, 11.673, 11.014, 11.437, 16.197, 16.435, 7.331, 15.089, 16.779, 14.449, 9.551, 11.331, 10.564, 5.992, 8.369, 11.402, 7.865, 2.526, 4.632, 9.335, 6.772, 3.018, 3.675, 0.455, 5.362, 6.945, 7.901]
1.0 : [distance=27.50313693276032]: = [26.148, 30.828, 27.122, 31.797, 26.812, 24.681, 31.379, 22.047, 22.034, 24.293, 30.875, 22.493, 30.889, 19.167, 19.199, 27.696, 17.370, 27.648, 23.842, 26.493, 23.635, 23.577, 20.884, 18.786, 18.898, 18.091, 22.021, 20.674, 23.890, 12.646, 18.448, 17.732, 17.897, 14.679, 13.598, 12.689, 19.832, 12.489, 9.745, 18.990, 18.820, 16.517, 12.024, 14.131, 13.394, 15.473, 11.140, 5.094, 15.265, 14.651, 8.299, 3.163, 12.039, 4.893, 7.552, 12.315, 9.581, 5.462, 2.984, 8.981]
1.0 : [distance=41.63476648186727]: = [30.822, 26.592, 32.747, 31.626, 31.853, 32.258, 34.720, 25.605, 24.215, 29.830, 28.270, 30.519, 27.139, 32.953, 29.208, 27.265, 31.003, 24.601, 27.746, 29.257, 25.375, 9.397, 11.854, 18.179, 11.058, 12.507, 14.945, 19.796, 9.565, 19.152, 11.940, 16.022, 17.441, 10.963, 10.996, 8.929, 15.033, 8.991, 20.548, 17.140, 13.223, 14.981, 10.412, 19.554, 19.192, 13.297, 15.799, 11.817, 12.925, 12.827, 13.102, 13.449, 11.540, 17.939, 8.543, 13.994, 15.765, 16.096, 16.662, 8.968]
1.0 : [distance=47.92825575495409]: = [35.675, 32.252, 33.359, 31.057, 24.062, 29.028, 24.791, 27.460, 25.859, 28.450, 30.435, 27.962, 28.948, 27.236, 28.649, 29.507, 35.871, 31.607, 25.408, 30.508, 32.454, 26.580, 27.593, 34.277, 27.145, 33.938, 27.016, 12.593, 10.910, 4.930, 4.463, 5.002, 11.772, 15.086, 10.525, 13.935, 10.900, 15.151, 8.885, 14.374, 13.364, 13.354, 6.827, 14.907, 4.364, 15.200, 14.254, 8.839, 13.155, 7.695, 8.300, 15.678, 14.164, 10.802, 9.084, 5.791, 10.142, 16.019, 12.784, 12.437]
1.0 : [distance=48.93716831670561]: = [31.775, 33.510, 25.615, 27.700, 24.828, 33.067, 34.310, 28.609, 34.490, 35.751, 25.563, 26.692, 34.970, 30.595, 26.545, 35.828, 29.338, 24.678, 33.323, 33.962, 34.928, 16.294, 8.878, 12.901, 7.906, 6.083, 6.624, 11.364, 9.335, 11.368, 10.111, 15.291, 13.921, 10.583, 15.977, 16.325, 11.815, 11.675, 11.011, 16.201, 9.244, 15.829, 10.276, 16.145, 13.675, 9.326, 10.849, 6.772, 17.498, 7.973, 16.450, 9.991, 6.178, 16.111, 17.548, 13.860, 10.801, 8.851, 10.028, 8.332]
1.0 : [distance=45.830951493743164]: = [28.636, 35.554, 28.989, 26.883, 30.280, 35.294, 33.550, 32.722, 30.094, 32.951, 34.356, 33.583, 27.756, 33.049, 25.218, 31.894, 34.318, 25.636, 32.570, 24.817, 27.464, 12.408, 9.314, 12.147, 8.343, 7.502, 11.223, 12.910, 10.207, 14.853, 6.479, 11.333, 14.162, 5.533, 14.142, 15.040, 13.506, 5.263, 6.361, 13.789, 13.502, 8.490, 11.222, 15.391, 9.330, 15.925, 13.675, 13.507, 12.027, 12.400, 11.421, 8.011, 12.951, 8.780, 11.031, 12.124, 12.020, 12.910, 8.291, 10.597]
1.0 : [distance=48.07002341109426]: = [34.335, 30.938, 31.953, 31.146, 24.519, 24.393, 27.696, 29.874, 26.767, 33.089, 31.371, 26.233, 26.383, 35.661, 32.663, 27.685, 29.277, 31.761, 34.650, 24.940, 33.434, 26.849, 28.714, 26.581, 34.825, 34.026, 8.823, 12.634, 12.694, 6.279, 13.644, 16.651, 18.078, 7.975, 9.274, 9.208, 12.879, 12.729, 6.976, 17.832, 13.330, 6.326, 12.131, 11.842, 16.716, 10.425, 9.445, 14.400, 15.696, 11.028, 10.608, 15.190, 9.076, 17.909, 9.846, 15.013, 13.913, 11.743, 11.699, 10.152]
// :: INFO clustering.ClusterDumper: Wrote clusters
[hadoop@djt002 mahoutData]$

  mahout无异常!!!

  注意:执行完这个kmeans算法之后产生的文件按普通方式是查看不了的,看到的只是一堆莫名其妙的数据!!!

  查看聚类分析的结果:

  需要用mahout的seqdumper命令来下载到本地linux上才能查看正常结果。

[hadoop@djt002 ~]$ $MAHOUT_HOME/bin/mahout seqdumper -i /user/hadoop/output/data/part-m-00000 -o ~/res.txt

[hadoop@djt002 ~]$ $MAHOUT_HOME/bin/mahout seqdumper -i /user/hadoop/output/data/part-m- -o ~/res.txt
Running on hadoop, using /usr/local/hadoop/hadoop-2.6./bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar
// :: INFO common.AbstractJob: Command line arguments: {--endPhase=[], --input=[/user/hadoop/output/data/part-m-], --output=[/home/hadoop/res.txt], --startPhase=[], --tempDir=[temp]}
// :: INFO driver.MahoutDriver: Program took ms (Minutes: 0.14583333333333334)
[hadoop@djt002 ~]$ ll
total
-rw-r--r--. hadoop hadoop Feb : anagram.jar
drwxrwxr-x. hadoop hadoop Mar : app
drwxr-xr-x. hadoop hadoop Jan : Desktop
drwxrwxr-x. hadoop hadoop Feb : djt
drwxr-xr-x. hadoop hadoop Jan : Documents
drwxr-xr-x. hadoop hadoop Jan : Downloads
drwxrwxr-x. hadoop hadoop Jan : flume
drwxr-xr-x. hadoop hadoop Jan : Music
drwxr-xr-x. hadoop hadoop Jan : Pictures
drwxr-xr-x. hadoop hadoop Jan : Public
-rw-rw-r-- hadoop hadoop Apr : res.txt
drwxr-xr-x. hadoop hadoop Jan : Templates
drwxrwxr-x. hadoop hadoop Mar : tvdata
drwxr-xr-x. hadoop hadoop Jan : Videos
[hadoop@djt002 ~]$ sz res.txt

Input Path: /user/hadoop/output/data/part-m-
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable
Key: : Value: {:28.7812,:26.6311,:29.1495,:28.9207,:35.6541,:33.7596,:35.2479,:25.3969,:25.0293,:33.0292,:34.9424,:26.5235,:24.5556,:26.1927,:36.0253,:29.5054,:25.4652,:29.27,:29.2171,:32.8717,:32.8717,:27.7849,:26.1203,:28.0721,:28.4353,:34.9879,:34.9318,:25.04,:31.2834,:29.747,:26.2353,:34.4632,:28.9167,:31.0558,:33.3182,:32.4721,:28.9964,:24.3437,:31.4333,:34.1173,:35.5344,:35.4973,:27.0443,:27.1159,:33.7431,:32.337,:32.0036,:26.3693,:25.8717,:31.3381,:25.7744,:27.6623,:30.7326,:28.1584,:33.3759,:34.2553,:30.9772,:28.9402,:34.5249,:25.0466}
Key: : Value: {:24.8923,:32.5981,:26.9414,:27.8789,:28.3038,:31.5926,:27.9516,:31.4861,:34.0765,:31.9874,:25.0701,:35.6273,:31.0205,:33.1089,:27.4867,:30.4719,:32.1005,:24.1311,:31.1887,:27.5415,:24.488,:35.5469,:33.6472,:26.3458,:26.1471,:26.4244,:33.6564,:33.6615,:32.8217,:29.4047,:26.5301,:25.741,:25.5511,:32.8357,:24.1491,:28.4661,:24.8578,:30.4686,:32.5577,:27.5918,:35.9519,:28.9861,:25.7906,:31.6595,:26.6418,:31.391,:25.9562,:31.4167,:26.691,:27.5532,:30.7447,:35.4102,:35.1422,:31.5203,:34.2484,:28.5322,:28.5157,:30.6213,:27.811,:28.4331}
Key: : Value: {:31.3987,:24.246,:31.6114,:27.8613,:26.9631,:28.5491,:25.2239,:24.9717,:27.3086,:24.3323,:28.8778,:32.5614,:26.5966,:27.4809,:28.2572,:32.3851,:29.5446,:31.4781,:27.2587,:31.8387,:35.0625,:32.4358,:31.5137,:29.6082,:25.2919,:29.9897,:25.5772,:30.2001,:24.2905,:27.1717,:31.0561,:30.6316,:31.2452,:31.4391,:24.2075,:31.351,:26.3583,:26.6814,:33.6318,:31.5717,:32.6293,:34.1444,:35.1253,:27.3068,:25.5387,:26.5819,:28.0861,:34.1202,:29.343,:26.3983,:26.9337,:31.0308,:35.0173,:24.7131,:33.9002,:27.3057,:26.8059,:35.9725,:24.0455,:32.5434}
Key: : Value: {:25.774,:28.3714,:35.9346,:27.97,:32.3667,:25.2702,:31.4549,:28.132,:27.5587,:29.2806,:24.824,:35.0966,:28.7261,:24.3749,:29.9578,:31.6264,:27.3659,:25.0102,:28.9916,:28.9564,:24.3037,:29.4268,:25.5265,:35.769,:26.9752,:32.5492,:34.6156,:34.2021,:25.6033,:31.156,:26.8908,:30.5262,:26.5077,:34.3336,:27.6083,:30.9827,:31.3209,:32.2279,:34.6292,:24.314,:32.4185,:34.2054,:29.8557,:27.32,:28.2979,:30.2773,:29.3849,:32.0968,:25.3069,:35.4209,:33.3303,:25.3679,:35.3155,:35.1146,:24.8938,:24.7381,:27.8433,:31.8725,:30.4447,:31.5787}
Key: : Value: {:27.1798,:33.4129,:29.6526,:24.6555,:26.9245,:28.9446,:24.5596,:35.798,:33.1247,:24.6081,:28.0295,:31.1274,:27.9601,:24.5119,:35.4154,:33.0321,:31.1057,:31.6565,:25.3216,:27.9634,:29.4686,:34.9446,:35.8773,:29.1348,:30.2123,:29.9993,:35.3375,:33.2025,:25.6264,:34.9244,:27.9072,:29.2498,:27.4335,:33.833,:33.9931,:34.2149,:35.111,:32.6355,:27.7218,:33.1739,:31.2651,:32.3223,:33.204,:34.2366,:35.7198,:34.862,:35.0757,:26.5173,:31.0179,:33.6928,:28.6486,:31.3701,:35.9497,:30.8644,:33.1276,:25.9481,:33.3094,:24.2875,:25.1472,:27.576}
....
....

  当然,你可以去看输出目录下/user/hadoop/output的其他的,比如clusters-0、clusters-1等,我这里仅仅是

看的是/user/hadoop/output/data/下的。

使用canopy算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar   org.apache.mahout.clustering.syntheticcontrol.canopy.Job

  这里不多赘述。

使用dirichlet 算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar   org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

  这里不多赘述。

使用meanshift算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar   org.apache.mahout.clustering.syntheticcontrol.meanshift.Job

  这里不多赘述。

 总结

  mahout压缩包,给我们的默认输入路径是/user/hadoop/testdata  和  输出路径是 /user/hadoop/output

  其实,我们是自己可以跟上自定义的输入路径和自定义输出路径的。

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar   org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop  jar  /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar   org.apache.mahout.clustering.syntheticcontrol.kmeans.Job   -i   /user/hadoop/mahoutData/data.txt   -o  /user/hadoop/output
												

mahout-distribution-0.9.tar.gz的安装的与配置、启动与运行自带的mahout算法的更多相关文章

  1. Linux下编译安装mysql-5.0.45.tar.gz

    安装环境:VMware9(桥接模式) + Linux bogon 2.6.32-642.3.1.el6.x86_64(查看linux版本信息:uname -a) 先给出MySQL For Linux ...

  2. Centos6.5 安装 MariaDB-10.0.20-linux-x86_64.tar.gz

    下载mariadb :https://downloads.mariadb.org/  我选择mariadb-10.0.20-linux-x86_64.tar.gz这个版本 复制安装文件 /opt 目录 ...

  3. 手动安装mysql-5.0.45.tar.gz

    Linux下编译安装 安装环境:VMware9(桥接模式) + Linux bogon 2.6.32-642.3.1.el6.x86_64(查看linux版本信息:uname -a) 先给出MySQL ...

  4. Apache-kylin-2.0.0-bin-hbase1x.tar.gz的下载与安装(图文详解)

    首先,对于Apache Kylin的安装,我有话要说. 由于Apache Kylin本身只是一个Server,所以安装部署还是比较简单的.但是它的前提要求是Hadoop.Hive.HBase必须已经安 ...

  5. redis-5.0.5.tar.gz 安装

    参考5.0安装,地址:https://my.oschina.net/u/3367404/blog/2979102 前言 安装Redis需要知道自己需要哪个版本,有针对性的安装. 比如如果需要redis ...

  6. 编译安装 keepalived-2.0.16.tar.gz

    一.下载安装包 wget https://www.keepalived.org/software/keepalived-2.0.16.tar.gz 安装相关依赖 把所有的rpm包放在一个目录下. rp ...

  7. linux安装 redis(redis-3.0.2.tar.gz) 和 mongodb(mongodb-linux-x86_64-rhel62-4.0.0)

    1:首先 要下载 这两个 压缩包 注意:liunx是否已经安装过 gcc没安装的话 先安装:yum install gcc-c++ 2:安装 redis:redis-3.0.2.tar.gz (1): ...

  8. linux下安装nginx(nginx(nginx-1.8.0.tar.gz),openssl(openssl-fips-2.0.9.tar.gz) ,zlib(zlib-1.2.11.tar.gz),pcre(pcre-8.39.tar.gz))

    :要按顺序安装: 1:先检查是否安装 gcc ,没有先安装:通过yum install gcc-c++完成安 2:openssl : tar -zxf  openssl-fips-2.0.9.tar. ...

  9. 在Foreda上安装apache-tomcat-7.0.42.tar.gz

    开发环境JDK和Tomcat应该和部署环境一致,要不容易出现奇奇怪怪的问题.所以Aspire机器上的Tomcat要装一个新版本了. 装Tomcat基本等于一个解压和移动的过程,确实简单. 第一步:解压 ...

随机推荐

  1. glEnable(GL_DEPTH_TEST)作用

    glEnable(GL_DEPTH_TEST): 用来开启更新深度缓冲区的功能,也就是,如果通过比较后深度值发生变化了,会进行更新深度缓冲区的操作.启动它,OpenGL就可以跟踪再Z轴上的像素,这样, ...

  2. android动画-拖动

    先上图看效果 实质上说是动画有点不妥,确切的说应该是手势的处理,废话不多说看代码 SimpleDragSample.java public class SimpleDragSample extends ...

  3. linux(debian/ubuntu)下连接安卓手机--小米4为例

    更改:如今小米连接Ubuntu等Linux系统,直接改动手机上的连接方式就可以. --------------------------------------------- 因为安卓手机底层就是lin ...

  4. centos 7.3 配置vnc 服务 图形界面登录

    1.检查系统是否有安装tigervnc-server软件包 rpm -qa |grep vnc 默认的系统未装tigervnc-server软件包 2.安装tigervnc-server软件包 yum ...

  5. axure母版使用实例之百度门户

    1.首先构建页面基本结构 2.新建母板 3.将母板应用于各个页面 4.在母板中隐藏聚焦背景及下拉二级菜单 5.在母板中添加事件:打开相应界面.显示/隐藏二级菜单 5.设置页面加载效果:给点击的一级菜单 ...

  6. Coderfroces 862 C. Mahmoud and Ehab and the xor

    C. Mahmoud and Ehab and the xor Mahmoud and Ehab are on the third stage of their adventures now. As ...

  7. Linux桌面词典 星际译王(StarDict)

    星际译王(StarDict)是利用GTK(GIMP TOOLKIT)开发的国际化的.跨平台的自由的桌面字典软件.它并不包含字典档,使用者须自行下载配合使用.它可以运行于多种不同的平台,如Linux, ...

  8. 2017-2018年红头发新版Cisco认证网络工程师(CCNA-R&S)全新讲解分享

    网名"红头发",多年授课经验,业内资深思科认证讲师,其所写的CISCO认证原创技术文章风靡各大网站与培训机构.精通CISCO各类路由交换产品,熟悉JUNIPER M/T系列路由产品 ...

  9. 谈谈 .NET Reflector

    著名的 .NET Reflector 如今要收费了,价格还不低: .NET Reflector Standard: $95 .NET Reflector VS: $195 .NET Reflector ...

  10. zoj 2778 - Triangular N-Queens Problem

    题目:在三角形的棋盘上放n皇后问题. 分析:找规律题目.依照题目的输出,能够看出构造法则: 先填奇数,后填偶数.以下我们仅仅要证明这样的构造的存在性就可以. 解法:先给出集体构造方法,从(1.n-f( ...