MapReduce实现线性回归

1. 软件版本号：

Hadoop2.6.0（IDEA中源代码编译使用CDH5.7.3，相应Hadoop2.6.0），集群使用原生Hadoop2.6.4。JDK1.8，Intellij IDEA 14 。

源代码能够在https://github.com/fansy1990/linear_regression 下载。

2. 实现思路：

本博客实现的是一元一次线性方程，等于是最简单的线性方程了。採用的是Couresa里面的机器学习中的大数据线性方程的方法来更新參数值的（即随机梯度下降方法，当然也能够使用批量梯度下降方法来实现，仅仅是在LinearRegressionJob中实现的不一样而已），假设对随机梯度下降或者批量梯度下降不了解的话。须要先去看看。以下是实现思路：

2.1 Shuffle Data（打乱数据）：

假设要採用随机梯度下降的话，那么须要保持原始数据随机，所以这里的第一步就是随机打乱原始数据。

採用的思路是：在Mapper端输出随机值作为key，输出当前记录作为value，在Reducer端直接遍历每一个key的全部values，直接输出value以及NullWritable.get就可以。

在这里加入一个额外的參数randN。这个參数表示在Mapper端随机值时，多少个原始数据使用同一个随机值。假设randN为1。那么每一个原始数据都会使用一个随机值作为key。假设randN为2，那么每两个原始数据使用一个随机值，假设randN为0或小于0。那么全部数据都使用同一个随机值（注意，这个时候事实上在Reducer端的values事实上也是乱序的，请读者思考为什么？）。

其Mapper中map核心实现例如以下所看到的

 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        if(randN <= 0) { // 假设randN 比0小。那么不再次打乱数据

            context.write(randFloatKey,value);

            return ;

        }

        if(++countI >= randN){// 假设randN等于1。那么每次随机的值都是不一样的

            randFloatKey.set(random.nextFloat());

            countI =0;

        }

        context.write(randFloatKey,value);

    }

2.2 Linear Regression（线性回归）：

线性回归採用随机梯度下降的方法来更新theta0和theta1 （仅仅实现了一元一次，所以仅仅有两个參数），每一个Mapper都会使用相同的初始化參数（theta0=1和theta1=0），在每一个Mapper中使用自己的数据来更新theta0和theta1，更新的公式为：

theta0 = theta0 -alpha*(h(x)-y)x

theta1 = theta1 -alpha*(h(x)-y)x

当中，h(x)= theta0 + theta1 * x ；同一时候。须要注意这里的更新是同步更新，其核心代码例如以下所看到的：

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        float[] xy = Utils.str2float(value.toString().split(splitter));

        float x = xy[0];

        float y = xy[1];

        // 同步更新 theta0 and theta1

        lastTheta0 = theta0;

        theta0 -=  alpha *(theta0+theta1* x - y) * x; // 保持theta0 和theta1 不变

        theta1 -= alpha *(lastTheta0 + theta1 * x -y) * x;// 保持theta0 和theta1 不变

    }

然后在每一个Mapper的cleanup函数中直接输出theta的參数值就可以

protected void cleanup(Context context) throws IOException, InterruptedException {

        theta0_1.set(theta0 + splitter + theta1);

        context.write(theta0_1,NullWritable.get());

    }

因为在每一个mapper中已经更新了theta的各个參数值，所以不须要使用reducer就可以；同一时候。因为測试数据比較小。所以设置mapreduce.input.fileinputformat.split.maxsize的大小，读者须要依据自己实际数据的大小来设置。其Driver类核心代码例如以下所看到的：

conf.setLong("mapreduce.input.fileinputformat.split.maxsize",700L);// 获取多个mapper；

job.setNumReduceTasks(0);

2.3 Combine Theta （合并參数值）：

在2.2步中已经算得了各个theta值。那么应该怎样来合并这些求得得各个theta值呢？能够直接用平均值么？对于一元一次线性回归是能够直接使用平均值来作为终于合并后的theta值的，可是针对其它的线性回归（特指有多个局部最小值的线性回归。这样求得的多个theta值合并就会有问题了）。

假设仅仅是使用平均值的话。那么在2.2步事实上加一个Reducer就能够完毕了，这里提出了一种另外的方式来合并theta值。即採用各个theta值的全局误差作为參数来进行加权。所以，在Mapper的setup中会读取2.2中的多个输出theta值。在map函数中针对各个原始数据求其误差，输出到reducer的数据为theta值和其误差；其核心代码例如以下所看到的：

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        float[] xy = Utils.str2float(value.toString().split(splitter));

        for(int i =0;i<thetas.size() ;i++){

            // error = (theta0 + theta1 * x - y) ^2

            thetaErrors[i] += (thetas.get(i)[0]+ thetas.get(i)[1] * xy[0] -xy[1]) *

                    (thetas.get(i)[0]+ thetas.get(i)[1] * xy[0] -xy[1]) ;

            thetaNumbers[i]+= 1;

        }

    }

protected void cleanup(Context context) throws IOException, InterruptedException {

        for(int i =0;i<thetas.size() ;i++){

            theta.set(thetas.get(i));

            floatAndLong.set(thetaErrors[i],thetaNumbers[i]);

            context.write(theta,floatAndLong);

        }

    }

在Reducer端。直接针对每一个键（也就是theta值）把各个误差加起来，在cleanup函数中採用加权来合并theta值，其核心代码例如以下所看到的：

protected void reduce(FloatAndFloat key, Iterable<FloatAndLong> values, Context context) throws IOException, InterruptedException {

        float sumF = 0.0f;

        long sumL = 0L ;

        for(FloatAndLong value:values){

            sumF +=value.getSumFloat();

            sumL += value.getSumLong();

        }

        theta_error.add(new float[]{key.getTheta0(),key.getTheta1(), (float)Math.sqrt((double)sumF / sumL)});

        logger.info("theta:{}, error:{}", new Object[]{key.toString(),Math.sqrt(sumF/sumL)});

    }

protected void cleanup(Context context) throws IOException, InterruptedException {

        // 怎样加权？

        // 方式1：假设误差越小。那么说明权重应该越大；

        // 方式2：直接平均值

        float [] theta_all = new float[2];

        if("average".equals(method)){

//            theta_all = theta_error.get(0);

            for(int i=0;i< theta_error.size();i++){

                theta_all[0] += theta_error.get(i)[0];

                theta_all[1] += theta_error.get(i)[1];

            }

            theta_all[0] /= theta_error.size();

            theta_all[1] /= theta_error.size();

        } else {

            float sumErrors = 0.0f;

            for(float[] d:theta_error){

                sumErrors += 1/d[2];

            }

            for(float[] d: theta_error){

                theta_all[0] += d[0] * 1/d[2] /sumErrors;

                theta_all[1] += d[1] * 1/d[2] /sumErrors;

            }

        }

        context.write(new FloatAndFloat(theta_all),NullWritable.get());

    }

2.4 验证

这里的验证指的是使用2.3步求的得合并后的theta值求全局误差，因为在2.3步也求得了各个theta值的全局误差。所以这里能够对照看下哪个theta值最优；其Mapper能够直接使用2.3步骤的mapper，而reducer也相似2.3步骤中的reducer，仅仅是终于输出就不须要cleanup中的合并了。

3. 执行结果：

3.1 shuffle Job

測试类：

public static void main(String[] args) throws Exception {

        args = new String[]{

                "hdfs://master:8020/user/fanzhe/linear_regression.txt",

                "hdfs://master:8020/user/fanzhe/shuffle_out",

                "1"

        }    ;

        ToolRunner.run(Utils.getConf(),new ShuffleDataJob(),args);

    }

原始数据：（能够在源代码中的resource文件夹中下载 linear_regression.txt）

6.1101,17.592

5.5277,9.1302

8.5186,13.662

。

。

。

Shuffle输出：

每次输出应该都是不一样的（使用了随机数），能够看到数据确实被随机化了。

3.2 Linear Regression

測试类：

public static void main(String[] args) throws Exception {

//        <input> <output> <theta0;theta1;alpha> <splitter> // 注意第三个參数使用分号切割

        args = new String[]{

                "hdfs://master:8020/user/fanzhe/shuffle_out",

                "hdfs://master:8020/user/fanzhe/linear_regression",

                "1;0;0.01",

                ","

        }    ;

        ToolRunner.run(Utils.getConf(),new LinearRegressionJob(),args);

    }

查看输出结果：

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" />

从输出结果能够看出。两个结果相差还是非常大的，这个主要是因为測试数据比較少的原因。假设数据比較大。而且被非常好的shuffle的话。那么这两个值应该是相差不大的；

3.3 Combine Theta

測试类：

public static void main(String[] args) throws Exception {

//        <input> <output> <theta_path> <splitter> <average|weight>

        args = new String[]{

                "hdfs://master:8020/user/fanzhe/shuffle_out",

                "hdfs://master:8020/user/fanzhe/single_linear_regression_error",

                "hdfs://master:8020/user/fanzhe/linear_regression",

                ",",

                "weight"

        }    ;

        ToolRunner.run(Utils.getConf(),new SingleLinearRegressionError(),args);

    }

这里设置的合并theta值的方式使用加权。读者能够设置为average，从而使用平均值；

结果：

依据日志能够看出theta參数值选取以下的一个，其误差会比較小，合并后的參数值为：

看到其结果是在两个theta參数值之间。

假设是平均值。那么其输出结果为：

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" />

3.4 验证

验证測试类：

public static void main(String[] args) throws Exception {

//        <input> <output> <theta_path> <splitter>

        args = new String[]{

                "hdfs://master:8020/user/fanzhe/shuffle_out",

                "hdfs://master:8020/user/fanzhe/last_linear_regression_error",

                "hdfs://master:8020/user/fanzhe/single_linear_regression_error",

                ",",

        }    ;

        ToolRunner.run(Utils.getConf(),new LastLinearRegressionError(),args);

    }

输出结果为：

从结果中能够看出，合并后的结果并没有原来当中的一个Theta參数组值的效果好，只是这个也可能和数据量有关，依据输出结果。也能够把合并后的theta值以及合并前的对照。然后使用最优的theta来作为最后的输出。

假设是平均值，那么其输出结果为：

从上面的结果能够看到加权的组合比平均值的组合效果好点。

4. 总结

1. 改算法仅仅针对有一个局部最优解（也就是全局最优解）的情况，否则，在合并阶段会有问题。

2. 通过小量数据验证，使用合并后的效果并没有使用合并前的最优解的效果好，这个可能是数据问题，待验证；

3. 通过非常直观的想象，普通情况下使用加权组合要比平均组好效果好。

分享，成长。快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990