mahout贝叶斯算法开发思路（拓展篇）1

首先说明一点，此篇blog解决的问题是就下面的数据如何应用mahout中的贝叶斯算法？（这个问题是在上篇（。。。完结篇）blog最后留的问题，如果想直接使用该工具，可以在mahout贝叶斯算法拓展下载）：

0.2	0.3	0.4：1

0.32	0.43	0.45：1

0.23	0.33	0.54：1

2.4	2.5	2.6：2

2.3	2.2	2.1：2

5.4	7.2	7.2：3

5.6	7	6：3

5.8	7.1	6.3：3

6	6	5.4：3

11	12	13：4

前篇blog上面的数据在最后的空格使用冒号代替（因为样本向量和标识的解析需要不同的解析符号，同一个的话解析就会出问题）。关于上面的数据其实就是说样本[0.2,0.3,0.4]被贴上了标签1，其他依次类推，然后这个作为训练数据训练贝叶斯模型，最后通过上面的数据进行分类建议模型的准确度。

处理的过程大概可以分为7个步骤：1.转换原始数据到贝叶斯算法可以使用的数据格式；2. 把所有的标识转换为数值型格式；3.对原始数据进行处理获得贝叶斯模型的属性参数值1；4.对原始数据进行处理获得贝叶斯模型的属性参数值2；5.根据3、4的结果把贝叶斯模型写入文件；6.对原始数据进行自分类；7.根据6的结果对贝叶斯模型进行评价。

下面分别介绍：

1. 数据格式转换：

代码如下：

package mahout.fansy.bayes.transform;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import org.apache.hadoop.util.ToolRunner;

import org.apache.mahout.common.AbstractJob;

import org.apache.mahout.common.HadoopUtil;

import org.apache.mahout.math.NamedVector;

import org.apache.mahout.math.RandomAccessSparseVector;

import org.apache.mahout.math.Vector;

import org.apache.mahout.math.VectorWritable;

public class TFText2VectorWritable extends AbstractJob {

	/**

	 * 处理把

	 * [2.1,3.2,1.2:a

	 * 2.1,3.2,1.3:b]

	 * 这样的数据转换为 key:new Text(a),value:new VectorWritable(2.1,3.2,1.2:a) 的序列数据

	 * @param args

	 * @throws Exception

	 */

	public static void main(String[] args) throws Exception {

		ToolRunner.run(new Configuration(), new TFText2VectorWritable(),args);

	}

	@Override

	public int run(String[] args) throws Exception {

		addInputOption();

	    addOutputOption();

	    // 增加向量之间的分隔符，默认为逗号；

	    addOption("splitCharacterVector","scv", "Vector split character,default is ','", ",");

	    // 增加向量和标示的分隔符，默认为冒号；

	    addOption("splitCharacterLabel","scl", "Vector and Label split character,default is ':'", ":");

	    if (parseArguments(args) == null) {

		      return -1;

		}

	    Path input = getInputPath();

	    Path output = getOutputPath();

	    String scv=getOption("splitCharacterVector");

	    String scl=getOption("splitCharacterLabel");

	    Configuration conf=getConf();

	//    FileSystem.get(output.toUri(), conf).deleteOnExit(output);//如果输出存在，删除输出

	    HadoopUtil.delete(conf, output);

	    conf.set("SCV", scv);

	    conf.set("SCL", scl);

	    Job job=new Job(conf);

	    job.setJobName("transform text to vector by input:"+input.getName());

	    job.setJarByClass(TFText2VectorWritable.class); 

	    job.setInputFormatClass(TextInputFormat.class);

	    job.setOutputFormatClass(SequenceFileOutputFormat.class);

	    job.setMapperClass(TFMapper.class);

	    job.setMapOutputKeyClass(Text.class);

	    job.setMapOutputValueClass(VectorWritable.class);

	    job.setNumReduceTasks(0);

	    job.setOutputKeyClass(Text.class);

	    job.setOutputValueClass(VectorWritable.class);

	    TextInputFormat.setInputPaths(job, input);

	    SequenceFileOutputFormat.setOutputPath(job, output);

	    if(job.waitForCompletion(true)){

	    	return 0;

	    }

		return -1;

	}

	public static class TFMapper extends Mapper<LongWritable,Text,Text,VectorWritable>{

		private String SCV;

		private String SCL;

		/**

		 * 初始化分隔符参数

		 */

		@Override

		public void setup(Context ctx){

			SCV=ctx.getConfiguration().get("SCV");

			SCL=ctx.getConfiguration().get("SCL");

		}

		/**

		 * 解析字符串，并输出

		 * @throws InterruptedException

		 * @throws IOException

		 */

		@Override

		public void map(LongWritable key,Text value,Context ctx) throws IOException, InterruptedException{

			String[] valueStr=value.toString().split(SCL);

			if(valueStr.length!=2){

				return;  // 没有两个说明解析错误,退出

			}

			String name=valueStr[1];

			String[] vector=valueStr[0].split(SCV);

			Vector v=new RandomAccessSparseVector(vector.length);

			for(int i=0;i<vector.length;i++){

				double item=0;

				try{

					item=Double.parseDouble(vector[i]);

				}catch(Exception e){

					return; // 如果不可以转换，说明输入数据有问题

				}

				v.setQuick(i, item);

			}

			NamedVector nv=new NamedVector(v,name);

			VectorWritable vw=new VectorWritable(nv);

			ctx.write(new Text(name), vw);

		}

	}

}

上面的代码只使用了Mapper对数据进行处理即可，把原始数据的Text格式使用分隔符进行解析输出<Text,VectorWritable>对应<标识，样本向量>，贝叶斯算法处理的数据格式是VectorWritable的，所以要进行转换。其中的解析符号是根据传入的参数进行设置的。如果要单独运行该类，传入的参数如下：

usage: <command> [Generic Options] [Job-Specific Options]

Generic Options:

 -archives <paths>              comma separated archives to be unarchived

                                on the compute machines.

 -conf <configuration file>     specify an application configuration file

 -D <property=value>            use value for given property

 -files <paths>                 comma separated files to be copied to the

                                map reduce cluster

 -fs <local|namenode:port>      specify a namenode

 -jt <local|jobtracker:port>    specify a job tracker

 -libjars <paths>               comma separated jar files to include in

                                the classpath.

 -tokenCacheFile <tokensFile>   name of the file with the tokens

Job-Specific Options:

  --input (-i) input                                    Path to job input

                                                        directory.

  --output (-o) output                                  The directory pathname

                                                        for output.

  --splitCharacterVector (-scv) splitCharacterVector    Vector split

                                                        character,default is

                                                        ','

  --splitCharacterLabel (-scl) splitCharacterLabel      Vector and Label split

                                                        character,default is

                                                        ':'

  --help (-h)                                           Print out help

  --tempDir tempDir                                     Intermediate output

                                                        directory

  --startPhase startPhase                               First phase to run

  --endPhase endPhase                                   Last phase to run

其中-scv和-scl参数是自己加的，其他参考mahout中的AbstractJob的默认设置；

2.转换标识

这一步的主要操作是把输入文件的所有标识全部读取出来，然后进行转换，转换为数值型，代码如下：

package mahout.fansy.bayes;

import java.io.IOException;

import java.util.Collection;

import java.util.HashSet;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.SequenceFile;

import org.apache.hadoop.io.Text;

import org.apache.mahout.common.Pair;

import org.apache.mahout.common.iterator.sequencefile.PathFilters;

import org.apache.mahout.common.iterator.sequencefile.PathType;

import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;

import com.google.common.io.Closeables;

public class WriteIndexLabel {

	/**

	 * @param args

	 * @throws IOException

	 */

	public static void main(String[] args) throws IOException {

		String inputPath="hdfs://ubuntu:9000/user/mahout/output_bayes/part-m-00000";

		String labPath="hdfs://ubuntu:9000/user/mahout/output_bayes/index.bin";

		Configuration conf=new Configuration();

		conf.set("mapred.job.tracker", "ubuntu:9001");

		long t=writeLabelIndex(inputPath,labPath,conf);

		System.out.println(t);

	}

	/**

	 * 从输入文件中读出全部标识，并加以转换,然后写入文件

	 * @param inputPath

	 * @param labPath

	 * @param conf

	 * @return

	 * @throws IOException

	 */

	public static long writeLabelIndex(String inputPath,String labPath,Configuration conf) throws IOException{

		long labelSize=0;

		Path p=new Path(inputPath);

		Path lPath=new Path(labPath);

		SequenceFileDirIterable<Text, IntWritable> iterable =

	              new SequenceFileDirIterable<Text, IntWritable>(p, PathType.LIST, PathFilters.logsCRCFilter(), conf);

		labelSize = writeLabel(conf, lPath, iterable);

		return labelSize;

	}

	/**

	 * 把数字和标识的映射写入文件

	 * @param conf

	 * @param indexPath

	 * @param labels

	 * @return

	 * @throws IOException

	 */

	public static long writeLabel(Configuration conf,Path indexPath,Iterable<Pair<Text,IntWritable>> labels) throws IOException{

		FileSystem fs = FileSystem.get(indexPath.toUri(), conf);

	    SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, indexPath, Text.class, IntWritable.class);

	    Collection<String> seen = new HashSet<String>();

	    int i = 0;

	    try {

	      for (Object label : labels) {

	        String theLabel = ((Pair<?,?>) label).getFirst().toString();

	        if (!seen.contains(theLabel)) {

	          writer.append(new Text(theLabel), new IntWritable(i++));

	          seen.add(theLabel);

	        }

	      }

	    } finally {

	      Closeables.closeQuietly(writer);

	    }

	    System.out.println("labels number is : "+i);

	    return i;

	}

}

这一步要返回一个参数，即标识的一共个数，用于后面的处理需要。

3. 获得贝叶斯模型属性值1：

这个相当于 TrainNaiveBayesJob的第一个prepareJob，本来是可以直接使用mahout中的mapper和reducer的，但是其中mapper关于key的解析和我使用的不同，所以解析也不同，所以这一步骤的mapper可以认为就是TrainNaiveBayesJob中第一个prepareJob的mapper，只是做了很少的修改。此步骤的代码如下：

package mahout.fansy.bayes;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import org.apache.hadoop.util.ToolRunner;

import org.apache.mahout.classifier.naivebayes.BayesUtils;

import org.apache.mahout.common.AbstractJob;

import org.apache.mahout.common.HadoopUtil;

import org.apache.mahout.common.mapreduce.VectorSumReducer;

import org.apache.mahout.math.VectorWritable;

import org.apache.mahout.math.map.OpenObjectIntHashMap;

/**

 * 贝叶斯算法第一个job任务相当于 TrainNaiveBayesJob的第一个prepareJob

 * 只用修改Mapper即可，Reducer还用原来的

 * @author Administrator

 *

 */

public class BayesJob1 extends AbstractJob {

	/**

	 * @param args

	 * @throws Exception

	 */

	public static void main(String[] args) throws Exception {

		ToolRunner.run(new Configuration(), new BayesJob1(),args);

	}

	@Override

	public int run(String[] args) throws Exception {

		addInputOption();

	    addOutputOption();

	    addOption("labelIndex","li", "The path to store the label index in");

	    if (parseArguments(args) == null) {

		      return -1;

		}

	    Path input = getInputPath();

	    Path output = getOutputPath();

	    String labelPath=getOption("labelIndex");

	    Configuration conf=getConf();

	    HadoopUtil.cacheFiles(new Path(labelPath), getConf());

	    HadoopUtil.delete(conf, output);

	    Job job=new Job(conf);

	    job.setJobName("job1 get scoreFetureAndLabel by input:"+input.getName());

	    job.setJarByClass(BayesJob1.class); 

	    job.setInputFormatClass(SequenceFileInputFormat.class);

	    job.setOutputFormatClass(SequenceFileOutputFormat.class);

	    job.setMapperClass(BJMapper.class);

	    job.setMapOutputKeyClass(IntWritable.class);

	    job.setMapOutputValueClass(VectorWritable.class);

	    job.setCombinerClass(VectorSumReducer.class);

	    job.setReducerClass(VectorSumReducer.class);

	    job.setOutputKeyClass(IntWritable.class);

	    job.setOutputValueClass(VectorWritable.class);

	    SequenceFileInputFormat.setInputPaths(job, input);

	    SequenceFileOutputFormat.setOutputPath(job, output);

	    if(job.waitForCompletion(true)){

	    	return 0;

	    }

		return -1;

	}

	/**

	 * 自定义Mapper，只是解析的地方有改动而已

	 * @author Administrator

	 *

	 */

	public static class BJMapper extends Mapper<Text, VectorWritable, IntWritable, VectorWritable>{

		public enum Counter { SKIPPED_INSTANCES }

		  private OpenObjectIntHashMap<String> labelIndex;

		  @Override

		  protected void setup(Context ctx) throws IOException, InterruptedException {

		    super.setup(ctx);

		    labelIndex = BayesUtils.readIndexFromCache(ctx.getConfiguration()); //

		  }

		  @Override

		  protected void map(Text labelText, VectorWritable instance, Context ctx) throws IOException, InterruptedException {

		    String label = labelText.toString();

		    if (labelIndex.containsKey(label)) {

		      ctx.write(new IntWritable(labelIndex.get(label)), instance);

		    } else {

		      ctx.getCounter(Counter.SKIPPED_INSTANCES).increment(1);

		    }

		  }

	}

}

如果要单独使用此类，可以参考下面的调用方式：

usage: <command> [Generic Options] [Job-Specific Options]

Generic Options:

 -archives <paths>              comma separated archives to be unarchived

                                on the compute machines.

 -conf <configuration file>     specify an application configuration file

 -D <property=value>            use value for given property

 -files <paths>                 comma separated files to be copied to the

                                map reduce cluster

 -fs <local|namenode:port>      specify a namenode

 -jt <local|jobtracker:port>    specify a job tracker

 -libjars <paths>               comma separated jar files to include in

                                the classpath.

 -tokenCacheFile <tokensFile>   name of the file with the tokens

Job-Specific Options:

  --input (-i) input               Path to job input directory.

  --output (-o) output             The directory pathname for output.

  --labelIndex (-li) labelIndex    The path to store the label index in

  --help (-h)                      Print out help

  --tempDir tempDir                Intermediate output directory

  --startPhase startPhase          First phase to run

  --endPhase endPhase              Last phase to run

其中的-li参数是自己加的，其实就是第2步骤中求得的标识的总个数，其他参考AbstractJob默认参数。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

mahout贝叶斯算法开发思路（拓展篇）1的更多相关文章

mahout贝叶斯算法开发思路（拓展篇）2
如果想直接下面算法调用包,可以直接在mahout贝叶斯算法拓展下载,该算法调用的方式如下: $HADOOP_HOME/bin hadoop jar mahout.jar mahout.fansy.ba ...
Mahout贝叶斯算法拓展篇3---分类无标签数据
代码測试环境:Hadoop2.4+Mahout1.0 前面博客:mahout贝叶斯算法开发思路(拓展篇)1和mahout贝叶斯算法开发思路(拓展篇)2 分析了Mahout中贝叶斯算法针对数值型数据的处 ...
朴素贝叶斯算法下的情感分析——C#编程实现
这篇文章做了什么朴素贝叶斯算法是机器学习中非常重要的分类算法,用途十分广泛,如垃圾邮件处理等.而情感分析(Sentiment Analysis)是自然语言处理(Natural Language Pr ...
C#编程实现朴素贝叶斯算法下的情感分析
C#编程实现这篇文章做了什么朴素贝叶斯算法是机器学习中非常重要的分类算法,用途十分广泛,如垃圾邮件处理等.而情感分析(Sentiment Analysis)是自然语言处理(Natural Lang ...
Python机器学习笔记：朴素贝叶斯算法
朴素贝叶斯是经典的机器学习算法之一,也是为数不多的基于概率论的分类算法.对于大多数的分类算法,在所有的机器学习分类算法中,朴素贝叶斯和其他绝大多数的分类算法都不同.比如决策树,KNN,逻辑回归,支持向 ...
机器学习---用python实现朴素贝叶斯算法（Machine Learning Naive Bayes Algorithm Application）
在<机器学习---朴素贝叶斯分类器(Machine Learning Naive Bayes Classifier)>一文中,我们介绍了朴素贝叶斯分类器的原理.现在,让我们来实践一下. 在 ...
朴素贝叶斯算法java实现（多项式模型）
网上有很多对朴素贝叶斯算法的说明的文章,在对算法实现前,参考了一下几篇文章: NLP系列(2)_用朴素贝叶斯进行文本分类(上) NLP系列(3)_用朴素贝叶斯进行文本分类(下) 带你搞懂朴素贝叶斯分类 ...
【数据挖掘】朴素贝叶斯算法计算ROC曲线的面积
题记: 近来关于数据挖掘学习过程中,学习到朴素贝叶斯运算ROC曲线.也是本节实验课题,roc曲线的计算原理以及如果统计TP.FP.TN.FN.TPR.FPR.ROC面积等等.往往运用 ...
朴素贝叶斯算法的python实现
朴素贝叶斯算法优缺点优点:在数据较少的情况下依然有效,可以处理多类别问题缺点:对输入数据的准备方式敏感适用数据类型:标称型数据算法思想: 朴素贝叶斯比如我们想判断一个邮件是不是垃圾邮件,那么 ...

随机推荐

SGU 134.Centroid( 树形dp )
一道入门树dp, 求一棵树的重心...我是有多无聊去写这种题...傻X题写了也没啥卵用以后还是少写好.. ----------------------------------------------- ...
leetcode Reverse Integer python
class Solution(object): def reverse(self, x): """ :type x: int :rtype: int "&quo ...
R与数据分析旧笔记（六）多元线性分析下
逐步回归向前引入法:从一元回归开始,逐步加快变量,使指标值达到最优为止向后剔除法:从全变量回归方程开始,逐步删去某个变量,使指标值达到最优为止逐步筛选法:综合上述两种方法多元线性回归的核心问题 ...
在python文本编辑器里如何设置Tab为4个空格
python中缩进一般为四个空格,我总结3种常用编辑器中种如何设置Tab键为四个空格第一种:下载python3.5时自带de 一个IDLE编辑器在Options选项下的Configure IDLE ...
MySQL学习系列一---命令行连接mysql和执行sql文件
1.命令行连接mysql #mysql -h(主机) -u(用户名) -p (数据库名) mysql -hlocalhost -uroot -p testdb Enter password: **** ...
Windows 系统消息范围和前缀，以及消息大全
Windows系统定义的消息类别消息标识符前缀消息分类ABM 应用桌面工具栏消息BM 按钮控件消息CB 组合框控件消息CBEM 扩展组合框控件消息CDM 通用对话框消息DBT 设备消息DL 拖曳列表 ...
android中 MediaStore提取缩略图和原始图像
android中 MediaStore提取缩略图和原始图像 . 欢迎转载:http://blog.csdn.net/djy1992/article/details/10005767 提取图像的Thum ...
【转载】Android Studio jar、so、library项目依赖，原文链接http://zhengxiaopeng.com/2014/12/13/Android-Studio-jar、so、library项目依赖/
前言 Android Studio(以下简称AS)在13年I/O大会后放出预览版到现在放出的正式版1.0(PS.今天又更新到1.0.1了)历时一年多了,虽然Google官方推出的Android开发者的 ...
HDFS 2中Namenode启动时WebUI的变化
在HDFS1中NameNode启动顺序是这样的: 1. 读取Fsimage文件 2. 读取edit logs文件,逐行执行里面的操作 3. 写checkpoint,生成新的Fsimage(老的Fs ...
pojAGTC（LCS,DP）
题目链接: 啊哈哈,点我点我题意:给两个字符串,找出经过多少个操作能够使得两个串相等.. 思路:找出两个串的最长公共子序列,然后用最大的串的长度减去最长公共子序列的长度得到的就是须要的操作数.. 题 ...

mahout贝叶斯算法开发思路（拓展篇）1

mahout贝叶斯算法开发思路（拓展篇）1的更多相关文章

随机推荐

热门专题