Naive Bayes在mapreduce上的实现

Naive Bayes是比较常用的分类器，因为思想比较简单。之所以说是naive，是因为他假设用于分类的特征在类确定的条件下是条件独立的，这个假设使得分类变得很简单，但会损失一定的精度。

具体推导可以看《统计学习方法》

经过推导我们可知y=argMaxP(Y=ck)*P(X=x|Y=ck)。那么我们需要求先验概率也就是P（Y=ck）和求条件概率p(X=x|Y=ck).

具体的例子以：http://blog.163.com/jiayouweijiewj@126/blog/static/1712321772010102802635243/来说明。

我这里一共用了4个mapreduce，因为采用了多项式模型，先验概率P(c)= 类c下单词总数/整个训练样本的单词总数。类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|)（|V|是单词种类数）。输入是：

1:Chinese Beijing Chinese

1:Chinese Chinese Shanghai

1:Chinese Macao

0:Tokyo Japan Chinese

1 一个mapreduce是用于求在各个类别下的单词数，这个是为了后面求先验概率用的。

输出为：

0 3

1 8

2 一个mapreduce用于求条件概率，输出为：

0:Chinese 0.2222222222222222

0:Japan 0.2222222222222222

0:Tokyo 0.2222222222222222

1:Beijing 0.14285714285714285

1:Chinese 0.42857142857142855

1:Macao 0.14285714285714285

1:Shanghai 0.14285714285714285

3 一个mapreduce用于计算单词种类数，输出为：

num is 6

4 最后一个mapreduce是用于预测的。

下面说下各个mapreduce的实现：

1 求各个类别下的单词数，这个比较简单，就是以类别为key，然后进行单词统计就好。

附上代码：

 package hadoop.MachineLearning.Bayes.Pro;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class PriorProbability {//用于求各个类别下的单词数，为后面求先验概率

     public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

         String input="hdfs://10.107.8.110:9000/Bayes/Bayes_input/";

         String output="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Pro/";

         Job job = Job.getInstance(conf, "ProirProbability");

         job.setJarByClass(hadoop.MachineLearning.Bayes.Pro.PriorProbability.class);

         // TODO: specify a mapper

         job.setMapperClass(MyMapper.class);

         //job.setMapInputKeyClass(LongWritable.class);

         // TODO: specify a reducer

         job.setMapOutputKeyClass(Text.class);

         job.setMapOutputValueClass(Text.class);

         job.setReducerClass(MyReducer.class);

         // TODO: specify output types

         job.setOutputKeyClass(Text.class);

         job.setOutputValueClass(IntWritable.class);

         // TODO: specify input and output DIRECTORIES (not files)

         FileInputFormat.setInputPaths(job, new Path(input));

         FileOutputFormat.setOutputPath(job, new Path(output));

         if (!job.waitForCompletion(true))

             return;

     }

 }

 package hadoop.MachineLearning.Bayes.Pro;

 import java.io.IOException;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Mapper.Context;

 public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

     public void map(LongWritable ikey, Text ivalue, Context context)

             throws IOException, InterruptedException {

         String[] line=ivalue.toString().split(":| ");

         int size=line.length-1;

         context.write(new Text(line[0]),new Text(String.valueOf(size)));

     }

 }

 package hadoop.MachineLearning.Bayes.Pro;

 import java.io.IOException;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.Reducer.Context;

 public class MyReducer extends Reducer<Text, Text, Text, IntWritable> {

     public void reduce(Text _key, Iterable<Text> values, Context context)

             throws IOException, InterruptedException {

         // process values

         int sum=0;

         for (Text val : values) {

             sum+=Integer.parseInt(val.toString());

         }

         context.write(_key,new IntWritable(sum));

     }

 }

2 求文档中的单词种类数，自己实现的方法不太好，思路是，对每一行的输入都以相同的key输出，然后在combiner中先利用set求得该节点上的不重复的单词，接着在reduce中再利用set，将所有单词求种类数。感觉好一点的话是先按照单词进行规约，最后再利用一个mapreduce对单词种类数进行统计。但是考虑到刚学会mapreduce不久还不会写链式，而且一个bayes已经写了4个mapreduce就不考虑再复杂化了。

package hadoop.MachineLearning.Bayes.Count;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Count {//计算文档中的单词种类数目

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "Count");

        String input="hdfs://10.107.8.110:9000/Bayes/Bayes_input";

        String output="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Count";

        job.setJarByClass(hadoop.MachineLearning.Bayes.Count.Count.class);

        // TODO: specify a mapper

        job.setMapperClass(MyMapper.class);

        // TODO: specify a reducer

        job.setCombinerClass(MyCombiner.class);

        job.setReducerClass(MyReducer.class);

        // TODO: specify output types

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        // TODO: specify input and output DIRECTORIES (not files)

        FileInputFormat.setInputPaths(job, new Path(input));

        FileOutputFormat.setOutputPath(job, new Path(output));

        if (!job.waitForCompletion(true))

            return;

    }

}

package hadoop.MachineLearning.Bayes.Count;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

    public void map(LongWritable ikey, Text ivalue, Context context)

            throws IOException, InterruptedException {

        String[] line=ivalue.toString().split(":| ");

        String key="1";

        System.out.println("   ");

        System.out.println("   ");

        System.out.println("   ");

        for(int i=1;i<line.length;i++){

            System.out.println(line[i]);

            context.write(new Text(key),new Text(line[i]));//以相同的key进行输出，使得能最后输出到一个reduce中

        }

    }

}

package hadoop.MachineLearning.Bayes.Count;

import java.io.IOException;

import java.util.HashSet;

import java.util.Iterator;

import java.util.Set;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MyCombiner extends Reducer<Text, Text, Text, Text> {//先在本地的节点上利用set删去重复的单词

    public void reduce(Text _key, Iterable<Text> values, Context context)

            throws IOException, InterruptedException {

        // process values

        Set set=new HashSet();

        for (Text val : values) {

            set.add(val.toString());

        }

        for(Iterator it=set.iterator();it.hasNext();){

            context.write(new Text("1"),new Text(it.next().toString()));

        }

    }

}

package hadoop.MachineLearning.Bayes.Count;

import java.io.IOException;

import java.util.HashSet;

import java.util.Set;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer<Text, Text, Text, Text> {//通过combiner后，再利用set对单词进行去重，最后得到种类数

    public void reduce(Text _key, Iterable<Text> values, Context context)

            throws IOException, InterruptedException {

        // process values

        Set set=new HashSet();

        for (Text val : values) {

            set.add(val.toString());

        }

        context.write(new Text("num is "),new Text(String.valueOf(set.size())));

    }

}

3 求条件概率.这里需要用到该类别下该单词的数目sum,该类别下的单词总数，文档中的单词种类数。这些都可以在之前的输出文件中获得，我这里都用map去接受这些数据。由于有些单词没有出现在该类别下，例如P(Japan | yes)=P(Tokyo | yes)，如果将他们当作0处理，那么导致该条件概率会是0，所以这里用了平滑的方法可以参考上述的链接。这里有个细节，就是条件概率生成的会比较多，需要一种高效的存储和查找方式，我这里因为水平不够，就直接用map来存放了，如果对于大的数据，这个会很低效。

package hadoop.MachineLearning.Bayes.Cond;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class CondiPro {//用于求条件概率

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        String input="hdfs://10.107.8.110:9000/Bayes/Bayes_input";

        String output="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Con";

        String proPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Pro";//这是之前求各个类别下单词数目的输出

        String countPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Count";//这是之前求的单词种类数

        conf.set("propath",proPath);

        conf.set("countPath",countPath);

        Job job = Job.getInstance(conf, "ConditionPro");

        job.setJarByClass(hadoop.MachineLearning.Bayes.Cond.CondiPro.class);

        // TODO: specify a mapper

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(IntWritable.class);

        // TODO: specify a reducer

        job.setReducerClass(MyReducer.class);

        // TODO: specify output types

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        // TODO: specify input and output DIRECTORIES (not files)

        FileInputFormat.setInputPaths(job, new Path(input));

        FileOutputFormat.setOutputPath(job, new Path(output));

        if (!job.waitForCompletion(true))

            return;

    }

}

package hadoop.MachineLearning.Bayes.Cond;

import java.io.IOException;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable ikey, Text ivalue, Context context)

            throws IOException, InterruptedException {

        String[] line=ivalue.toString().split(":| ");

        for(int i=1;i<line.length;i++){

            String key=line[0]+":"+line[i];

            context.write(new Text(key),new IntWritable(1));

        }

    }

}

package hadoop.MachineLearning.Bayes.Cond;

import java.io.IOException;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.DoubleWritable;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer<Text, IntWritable, Text, DoubleWritable> {

    public Map<String,Integer> map;

    public int count=0;

    public void setup(Context context) throws IOException{

        Configuration conf=context.getConfiguration();

        String proPath=conf.get("propath");

        String countPath=conf.get("countPath");//

        map=Utils.getMapFormHDFS(proPath);//获得各个类别下的单词数

        count=Utils.getCountFromHDFS(countPath);//获得单词种类数

    }

    public void reduce(Text _key, Iterable<IntWritable> values, Context context)

            throws IOException, InterruptedException {

        // process values

        int sum=0;

        for (IntWritable val : values) {

            sum+=val.get();

        }

        int type=Integer.parseInt(_key.toString().split(":")[0]);

        double probability=0.0;

        for(Map.Entry<String,Integer> entry:map.entrySet()){

            if(type==Integer.parseInt(entry.getKey())){

                probability=(sum+1)*1.0/(entry.getValue()+count);//条件概率的计算

            }

        }

        context.write(_key,new DoubleWritable(probability));

    }

}

package hadoop.MachineLearning.Bayes.Cond;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.util.LineReader;

public class Utils {

    /**

     * @param args

     * @throws IOException

     */

    public static Map<String,Integer> getMapFormHDFS(String input) throws IOException{

        Configuration conf=new Configuration();

        Path path=new Path(input);

        FileSystem fs=path.getFileSystem(conf);

        FileStatus[] stats=fs.listStatus(path);

        Map<String,Integer> map=new HashMap();

        for(int i=0;i<stats.length;i++){

            if(stats[i].isFile()){

                FSDataInputStream infs=fs.open(stats[i].getPath());

                LineReader reader=new LineReader(infs,conf);

                Text line=new Text();

                while(reader.readLine(line)>0){

                    String[] temp=line.toString().split("    ");

                    //System.out.println(temp.length);

                    map.put(temp[0],Integer.parseInt(temp[1]));

                }

                reader.close();

            }

        }

        return map;

    }

    public static Map<String,Double> getMapFormHDFS(String input,boolean j) throws IOException{

        Configuration conf=new Configuration();

        Path path=new Path(input);

        FileSystem fs=path.getFileSystem(conf);

        FileStatus[] stats=fs.listStatus(path);

        Map<String,Double> map=new HashMap();

        for(int i=0;i<stats.length;i++){

            if(stats[i].isFile()){

                FSDataInputStream infs=fs.open(stats[i].getPath());

                LineReader reader=new LineReader(infs,conf);

                Text line=new Text();

                while(reader.readLine(line)>0){

                    String[] temp=line.toString().split("    ");

                    //System.out.println(temp.length);

                    map.put(temp[0],Double.parseDouble(temp[1]));

                }

                reader.close();

            }

        }

        return map;

    }

    public static int getCountFromHDFS(String input) throws IOException{

        Configuration conf=new Configuration();

        Path path=new Path(input);

        FileSystem fs=path.getFileSystem(conf);

        FileStatus[] stats=fs.listStatus(path);

        int count=0;

        for(int i=0;i<stats.length;i++){

            if(stats[i].isFile()){

                FSDataInputStream infs=fs.open(stats[i].getPath());

                LineReader reader=new LineReader(infs,conf);

                Text line=new Text();

                while(reader.readLine(line)>0){

                    String[] temp=line.toString().split("    ");

                    //System.out.println(temp.length);

                    count=Integer.parseInt(temp[1]);

                }

                reader.close();

            }

        }

        return count;

    }

    public static void main(String[] args) throws IOException {

        // TODO Auto-generated method stub

        String proPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Pro";

        String countPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Count/";

        Map<String,Integer> map=Utils.getMapFormHDFS(proPath);

        for(Map.Entry<String,Integer> entry:map.entrySet()){

            System.out.println(entry.getKey()+"->"+entry.getValue());

        }

        int count=Utils.getCountFromHDFS(countPath);

        System.out.println("count is "+count);

    }

}

4 预测，例如输入Chinese, Chinese, Chinese, Tokyo, Japan。那就分别对每个单词以0，1的类别进行输出，输出为type:words，接着就是在条件概率中查找，进行简单的累乘即可。

package hadoop.MachineLearning.Bayes.Predict;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.DoubleWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Predict {

    public static void main(String[] args) throws Exception {//预测

        Configuration conf = new Configuration();

        String input="hdfs://10.107.8.110:9000/Bayes/Predict_input";

        String output="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Predict";

        String condiProPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Con";

        String proPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Pro";

        String countPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Count";

        conf.set("condiProPath",condiProPath);

        conf.set("proPath",proPath);

        conf.set("countPath",countPath);

        Job job = Job.getInstance(conf, "Predict");

        job.setJarByClass(hadoop.MachineLearning.Bayes.Predict.Predict.class);

        // TODO: specify a mapper

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(Text.class);

        // TODO: specify a reducer

        job.setReducerClass(MyReducer.class);

        // TODO: specify output types

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(DoubleWritable.class);

        // TODO: specify input and output DIRECTORIES (not files)

        FileInputFormat.setInputPaths(job, new Path(input));

        FileOutputFormat.setOutputPath(job, new Path(output));

        if (!job.waitForCompletion(true))

            return;

    }

}

package hadoop.MachineLearning.Bayes.Predict;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

    public Map<String,Integer> map=new HashMap();

    public void setup(Context context) throws IOException{

        Configuration conf=context.getConfiguration();

        String proPath=conf.get("proPath");

        map=Utils.getMapFormHDFS(proPath);

    }

    public void map(LongWritable ikey, Text ivalue, Context context)

            throws IOException, InterruptedException {

        for(Map.Entry<String,Integer> entry:map.entrySet()){

            context.write(new Text(entry.getKey()),ivalue);//对每一行数据，打上所有类别，方便后续的求条件概率

        }

    }

}

package hadoop.MachineLearning.Bayes.Predict;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.DoubleWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer<Text, Text, Text, DoubleWritable> {

    public Map<String,Double> mapDouble=new HashMap();//存放条件概率

    public Map<String,Integer> mapInteger=new HashMap();//存放各个类别下的单词数

    public Map<String,Double> noFind=new HashMap();//用于那些单词没有出现在某个类别中的

    public Map<String,Double> prePro=new HashMap();//求的后的先验概率

    public void setup(Context context) throws IOException{

        Configuration conf=context.getConfiguration();

        String condiProPath=conf.get("condiProPath");

        String proPath=conf.get("proPath");

        String countPath=conf.get("countPath");

        mapDouble=Utils.getMapFormHDFS(condiProPath,true);

        mapInteger=Utils.getMapFormHDFS(proPath);

        int count=Utils.getCountFromHDFS(countPath);

        for(Map.Entry<String,Integer> entry:mapInteger.entrySet()){

            double pro=0.0;

            noFind.put(entry.getKey(),(1.0/(count+entry.getValue())));

        }

        int sum=0;

        for(Map.Entry<String,Integer> entry:mapInteger.entrySet()){

            sum+=entry.getValue();

        }

        for(Map.Entry<String,Integer> entry:mapInteger.entrySet()){

            prePro.put(entry.getKey(),(entry.getValue()*1.0/sum));

        }

    }

    public void reduce(Text _key, Iterable<Text> values, Context context)

            throws IOException, InterruptedException {

        // process values

        String type=_key.toString();

        double pro=1.0;

        for (Text val : values) {

            String[] words=val.toString().split(" ");

            for(int i=0;i<words.length;i++){

                String condi=type+":"+words[i];

                if(mapDouble.get(condi)!=null){//如果该单词出现在该类别中，说明有条件概率

                    pro=pro*mapDouble.get(condi);

                }else{//如果该单词不在该类别中，就采用默认的条件概率

                    pro=pro*noFind.get(type);

                }

            }

        }

        pro=pro*prePro.get(type);

        context.write(new Text(type),new DoubleWritable(pro));

    }

}

package hadoop.MachineLearning.Bayes.Predict;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.util.LineReader;

public class Utils {

    /**

     * @param args

     * @throws IOException

     */

    public static Map<String,Integer> getMapFormHDFS(String input) throws IOException{

        Configuration conf=new Configuration();

        Path path=new Path(input);

        FileSystem fs=path.getFileSystem(conf);

        FileStatus[] stats=fs.listStatus(path);

        Map<String,Integer> map=new HashMap();

        for(int i=0;i<stats.length;i++){

            if(stats[i].isFile()){

                FSDataInputStream infs=fs.open(stats[i].getPath());

                LineReader reader=new LineReader(infs,conf);

                Text line=new Text();

                while(reader.readLine(line)>0){

                    String[] temp=line.toString().split("    ");

                    //System.out.println(temp.length);

                    map.put(temp[0],Integer.parseInt(temp[1]));

                }

                reader.close();

            }

        }

        return map;

    }

    public static Map<String,Double> getMapFormHDFS(String input,boolean j) throws IOException{

        Configuration conf=new Configuration();

        Path path=new Path(input);

        FileSystem fs=path.getFileSystem(conf);

        FileStatus[] stats=fs.listStatus(path);

        Map<String,Double> map=new HashMap();

        for(int i=0;i<stats.length;i++){

            if(stats[i].isFile()){

                FSDataInputStream infs=fs.open(stats[i].getPath());

                LineReader reader=new LineReader(infs,conf);

                Text line=new Text();

                while(reader.readLine(line)>0){

                    String[] temp=line.toString().split("    ");

                    //System.out.println(temp.length);

                    map.put(temp[0],Double.parseDouble(temp[1]));

                }

                reader.close();

            }

        }

        return map;

    }

    public static int getCountFromHDFS(String input) throws IOException{

        Configuration conf=new Configuration();

        Path path=new Path(input);

        FileSystem fs=path.getFileSystem(conf);

        FileStatus[] stats=fs.listStatus(path);

        int count=0;

        for(int i=0;i<stats.length;i++){

            if(stats[i].isFile()){

                FSDataInputStream infs=fs.open(stats[i].getPath());

                LineReader reader=new LineReader(infs,conf);

                Text line=new Text();

                while(reader.readLine(line)>0){

                    String[] temp=line.toString().split("    ");

                    //System.out.println(temp.length);

                    count=Integer.parseInt(temp[1]);

                }

                reader.close();

            }

        }

        return count;

    }

    public static void main(String[] args) throws IOException {

        // TODO Auto-generated method stub

        String proPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Pro";

        String countPath="hdfs://10.107.8.110:9000/Bayes/Bayes_output/Count/";

        Map<String,Integer> map=Utils.getMapFormHDFS(proPath);

        for(Map.Entry<String,Integer> entry:map.entrySet()){

            System.out.println(entry.getKey()+"->"+entry.getValue());

        }

        int count=Utils.getCountFromHDFS(countPath);

        System.out.println("count is "+count);

    }

}

Naive Bayes在mapreduce上的实现的更多相关文章

Naive Bayes在mapreduce上的实现(转)
Naive Bayes在mapreduce上的实现原文地址 http://www.cnblogs.com/sunrye/p/4553732.html Naive Bayes是比较常用的分类器,因为思 ...
(转载)微软数据挖掘算法：Microsoft Naive Bayes 算法（3）
介绍: Microsoft Naive Bayes 算法是一种基于贝叶斯定理的分类算法,可用于探索性和预测性建模. Naïve Bayes 名称中的 Naïve 一词派生自这样一个事实:该算法使用贝叶 ...
[Machine Learning & Algorithm] 朴素贝叶斯算法（Naive Bayes）
生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本文介绍朴素贝叶斯分类器(Naive Bayes classifier),它是一种简单有效的常用分类算法. 一.病人分类的例子让我从一个例子 ...
Spark MLlib 之 Naive Bayes
1.前言: Naive Bayes(朴素贝叶斯)是一个简单的多类分类算法,该算法的前提是假设各特征之间是相互独立的.Naive Bayes 训练主要是为每一个特征,在给定的标签的条件下,计算每个特征在 ...
Naive Bayes理论与实践
Naive Bayes: 简单有效的常用分类算法,典型用途:垃圾邮件分类假设:给定目标值时属性之间相互条件独立同样,先验概率的贝叶斯估计是优点: 1. 无监督学习的一种,实现简单,没有迭代,学习 ...
[ML] Naive Bayes for Text Classification
TF-IDF Algorithm From http://www.ruanyifeng.com/blog/2013/03/tf-idf.html Chapter 1, 知道了"词频" ...
朴素贝叶斯方法（Naive Bayes Method）
朴素贝叶斯是一种很简单的分类方法,之所以称之为朴素,是因为它有着非常强的前提条件-其所有特征都是相互独立的,是一种典型的生成学习算法.所谓生成学习算法,是指由训练数据学习联合概率分布P(X,Y ...
朴素贝叶斯算法（Naive Bayes）
朴素贝叶斯算法(Naive Bayes) 阅读目录一.病人分类的例子二.朴素贝叶斯分类器的公式三.账号分类的例子四.性别分类的例子生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本 ...
Naive Bayes (NB Model) 初识
1,Bayes定理 P(A,B)=P(A|B)P(B); P(A,B)=P(B|A)P(A); P(A|B)=P(B|A)P(A)/P(B); 贝叶斯定理变形 2,概率图模型 2.1 定义概 ...

随机推荐

js中的clientWidth offsetWidth scrollWidth等的含义
网页可见区域宽: document.body.clientWidth;网页可见区域高: document.body.clientHeight;网页可见区域宽: document.body.offset ...
Android PopupWindow菜单
初学Android,引用了这篇文章的代码 http://www.cnblogs.com/jiezzy/archive/2012/08/15/2640584.html 使用PopupWindow制作自定 ...
【转】从1到N这N个数中1的出现了多少次？
给定一个十进制整数N,求出从1到N的所有整数中出现"1"的个数. 例如:N=2,1,2出现了1个"1". N=12,1,2,3,4,5,6,7,8,9,10,1 ...
Slow HTTP Denial of Service Attack 漏洞解决
修改tomcat conf 下 server.xml 文件 <Connector port="8080" protocol="HTTP/1.1" con ...
Swift 响应式编程浅析
这里我讲一下响应式编程(Reactive Programming)是如何将异步编程推到一个全新高度的. 异步编程真的很难大多数有关响应式编程的演讲和文章都是在展示Reactive框架如何好如何惊人, ...
AI 人工智能探索（四）
在写之前,先对昨天寻路插件再做一些补充,因为该插件不是很完善,所以当我发现有不能满足需求的时候,就会试图更改源代码,或增加接口来符合我的需求. 昨天补充了一条是自身转向代码,今天补充另外一条,是及时 ...
JVM垃圾收集算法——分代收集算法
分代收集算法(Generational Collection): 当前商业虚拟机的垃圾收集都采用"分代收集算法". 这种算法并没有什么新的思想,只是根据对象存活周期的不同将内存划分 ...
java 接口参数
Example6_5.java interface SpeakHello { void speakHello(); } class Chinese implements SpeakHello { pu ...
dfs和bfs的简单总结
首先是dfs,又名深度优先搜索.看名字就知道,它的核心思想就是一直搜索,先在一条路上面一路撸到底,如果到底没有办法前进了,那么判断是否到达终点,如果没有到达,那么就回溯到之前的点再撸. dfs的要点: ...
hdu_5718_Oracle(大数模拟)
题目连接:hdu_5718_Oracle 题意: 给你一串数,让你分出两个正整数,使其和最大,若不能分出来就输出"Uncertain" 题解: 当时比赛的时候还天真的去搞大数模版, ...

Naive Bayes在mapreduce上的实现

Naive Bayes在mapreduce上的实现的更多相关文章

随机推荐

热门专题