Reducejoin sample

示例文件同sample join analysis

之前的示例是使用map端的join.这次使用reduce端的join.

根据源的类别写不同的mapper，处理不同的文件，输出的key都是studentno.value是其他的信息同时加上类别信息。

然后使用multipleinputs不同的路径注册不同的mapper.

reduce端相同的studentno的学生信息和考试成绩分配给同一个reduce,而且value中包含了这些信息，

把这些信息抽取出来，再做笛卡尔积即可。

下面的示例代码中，我没有使用multipleinputs来处理，自己修改了TextInputFormat的一些信息，使用返回文件名和当前行的信息。

根据文件名我在mapper中处理两个不同文件的信息，加上不同的类别送出去。

下面的代码中还有很多可以优化的地方，以后再更新。

package myexamples;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.compress.CompressionCodec;

import org.apache.hadoop.io.compress.CompressionCodecFactory;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.JobContext;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.LineReader;

public class reducejoin {

    public static class MyTextInputFormat extends FileInputFormat<Text, Text> {

        @Override

        public MyLineRecordReader createRecordReader(InputSplit split,

                TaskAttemptContext context) {

            return new MyLineRecordReader();

        }

        @Override

        protected boolean isSplitable(JobContext context, Path file) {

            CompressionCodec codec = new CompressionCodecFactory(

                    context.getConfiguration()).getCodec(file);

            return codec == null;

        }

    }

    public static class MyLineRecordReader extends RecordReader<Text, Text> {

        private static final Log LOG = LogFactory

                .getLog(LineRecordReader.class);

        private CompressionCodecFactory compressionCodecs = null;

        private long start;

        private long pos;

        private long end;

        private LineReader in;

        private int maxLineLength;

        private Text key = null;

        private Text value = null;

        Text filename = null;

        public void initialize(InputSplit genericSplit,

                TaskAttemptContext context) throws IOException {

            FileSplit split = (FileSplit) genericSplit;

            Configuration job = context.getConfiguration();

            this.maxLineLength = job.getInt(

                    "mapred.linerecordreader.maxlength", Integer.MAX_VALUE);

            start = split.getStart();

            end = start + split.getLength();

            final Path file = split.getPath();

            key = new Text(file.getName());

            compressionCodecs = new CompressionCodecFactory(job);

            final CompressionCodec codec = compressionCodecs.getCodec(file);

            // open the file and seek to the start of the split

            FileSystem fs = file.getFileSystem(job);

            FSDataInputStream fileIn = fs.open(split.getPath());

            boolean skipFirstLine = false;

            if (codec != null) {

                in = new LineReader(codec.createInputStream(fileIn), job);

                end = Long.MAX_VALUE;

            } else {

                if (start != 0) {

                    skipFirstLine = true;

                    --start;

                    fileIn.seek(start);

                }

                in = new LineReader(fileIn, job);

            }

            if (skipFirstLine) { // skip first line and re-establish "start".

                start += in.readLine(new Text(), 0,

                        (int) Math.min((long) Integer.MAX_VALUE, end - start));

            }

            this.pos = start;

        }

        public boolean nextKeyValue() throws IOException {

            if (key == null) {

            }

            if (value == null) {

                value = new Text();

            }

            int newSize = 0;

            while (pos < end) {

                newSize = in.readLine(value, maxLineLength, Math.max(

                        (int) Math.min(Integer.MAX_VALUE, end - pos),

                        maxLineLength));

                if (newSize == 0) {

                    break;

                }

                pos += newSize;

                if (newSize < maxLineLength) {

                    break;

                }

                // line too long. try again

                LOG.info("Skipped line of size " + newSize + " at pos "

                        + (pos - newSize));

            }

            if (newSize == 0) {

                key = null;

                value = null;

                return false;

            } else {

                return true;

            }

        }

        @Override

        public Text getCurrentKey() {

            return key;

        }

        @Override

        public Text getCurrentValue() {

            return value;

        }

        /**

         * Get the progress within the split

         */

        public float getProgress() {

            if (start == end) {

                return 0.0f;

            } else {

                return Math.min(1.0f, (pos - start) / (float) (end - start));

            }

        }

        public synchronized void close() throws IOException {

            if (in != null) {

                in.close();

            }

        }

    }

    public static class studentMapper extends Mapper<Text, Text, Text, Text> {

        public void map(Text key, Text value, Context context)

                throws IOException, InterruptedException {

            Text newvalue = null;

            String strv = value.toString().substring(

                    value.toString().indexOf(","));

            if (key.toString().contains("student")) // student file

                newvalue = new Text("student" + strv);

            else

                newvalue = new Text("score" + strv);

            Text newkey = new Text(value.toString().substring(0,

                    value.toString().indexOf(",")));

            context.write(newkey, newvalue);

        }

    }

    public static class studentReducer extends Reducer<Text, Text, Text, Text> {

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            List<String> students = new ArrayList<String>();

            List<String> scores = new ArrayList<String>();

            for (Text value : values)

                if (value.toString().startsWith("student"))

                    students.add(value.toString().substring(8));

                else

                    scores.add(value.toString().substring(6));

            // split real results

            for (String student : students)

                for (String score : scores)

                    context.write(key, new Text(student + "," + score));

        }

    }

    public static void main(String[] args) throws Exception {

        args = "hdfs://namenode:9000/user/hadoop/student/ hdfs://namenode:9000/user/hadoop/reducejoinout"

                .split(" ");

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args)

                .getRemainingArgs();

        if (otherArgs.length != 2) {

            System.err.println("Usage: wordcount <in> <out>");

            System.exit(2);

        }

        myUtils.myUtils.DeleteFolder(conf, otherArgs[1]);

        conf.set("io.sort.mb", "10");

        Job job = new Job(conf, "reduce join");

        job.setInputFormatClass(MyTextInputFormat.class);

        // job.setOutputFormatClass(SequenceFileOutputFormat.class);

        job.setJarByClass(reducejoin.class);

        job.setMapperClass(studentMapper.class);

        job.setReducerClass(studentReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

Reducejoin sample的更多相关文章

MapReduce 示例：减少 Hadoop MapReduce 中的侧连接
摘要:在排序和reducer 阶段,reduce 侧连接过程会产生巨大的网络I/O 流量,在这个阶段,相同键的值被聚集在一起. 本文分享自华为云社区<MapReduce 示例:减少 Hadoop ...
Linux下UPnP sample分析
一.UPnP简介 UPnP(Universal Plug and Play)技术是一种屏蔽各种数字设备的硬件和操作系统的通信协议.它是一种数字网络中间件技术,建立在TCP/IP.HTTP协 ...
cocos2d-x for android配置 & 运行 Sample on Linux OS
1.从http://www.cocos2d-x.org/download下载稳定版比如cocos2d-x-2.2 2.解压cocos2d-x-2.2.zip,比如本文将其解压到 /opt 目录下 3 ...
android studio2.2 的Find Sample Code点击没有反应
1 . 出现的问题描述: 右键点击Find Sample Code后半天没有反应,然后提示 Samples are currently unavailable for :{**** ...
jmeter（四）Sample之http请求
启动jmeter,建立一个测试计划这里再次说说怎么安装和启动jmeter吧,昨天下午又被人问到怎样安装和使用,我也是醉了:在我看来,百度能解决百分之八十的问题,特别是基础的问题... 安装:去官网下 ...
jcaptcha sample 制作验证码
Skip to end of metadata Created by marc antoine garrigue, last modified by Jeremy Waters on Feb 23, ...
Python 对不均衡数据进行Over sample（重抽样）
需要重采样的数据文件(Libsvm format),如heart_scale +1 1:0.708333 2:1 3:1 4:-0.320755 5:-0.105023 6:-1 7:1 8:-0.4 ...
Basic linux command-with detailed sample
Here I will list some parameters which people use very ofen, I will attach the output of the command ...
例子：RSS Reader Sample
本例演示了Rss xml信息的获取,以及如何使用SyndicationFeed来进行符合Rss规范的xml进行解析. SyndicationFeed 解析完成后可以得到SyndicationItem ...

随机推荐

老毛桃安装Win8(哪里不会点哪里，so easy)
先来一张美女图,是不是很漂亮呢!继续往下看! 英雄不问出路,美女不看岁数!求推荐啊! 每次碰到妹子找我装系统的时候我都毫不犹豫的答应了,心里暗暗想到:好好表现啊!此刻的心情比见家长还要激动和紧张! 有 ...
Web基础开发最核心要解决的问题
Web基础开发要解决的问题,往往也就是那些框架出现的目的 - 要解决问题. 1. 便捷的Db操作: 2. 高效的表单处理: 3. 灵活的Url路由: 4. 合理的代码组织结构: 5. 架构延伸缓存. ...
解决Cannot change version of project facet Dynamic Web M
dynamic web module 版本之间的区别: Servlet 3.0 December 2009 JavaEE 6, JavaSE 6 Pluggability, Ease of devel ...
[moka同学笔记]yii2.0数据库操作以及分页
1.model中models/article.php 1 <?php 2 3 namespace app\models; 4 5 use Yii; 6 7 /** 8 * This is the ...
mysql命令行备份数据库
MySQL数据库使用命令行备份|MySQL数据库备份命令例如: 数据库地址:127.0.0.1 数据库用户名:root 数据库密码:pass 数据库名称:myweb 备份数据库到D盘跟目录 mysq ...
[翻译]:SQL死锁-阻塞探测
到了这篇,才是真正动手解决问题的时候,有了死锁之后就要分析死锁的原因,具体就是需要定位到具体的SQL语句上.那么如何发现产生死锁的问题本质呢?下面这篇讲的非常细了,还提到了不少实用的SQL,但对我个人 ...
推荐两个很好用的javascript模板引擎
http://www.jsviews.com/#jsrender,支持if/for等常用逻辑,自称下一代jquery template plugin标准 https://github.com/janl ...
Swift 学习笔记第一天－变量常量，及数据类型
1.定义变量用关键字 var 比如 var i=2 2.定义常量用let 如let c=3 可见Swift 定义时不用指定类型.由编译器推断如果想指定类型 var i:Int32=2 练习 let ...
通过GPS数据反向地理信息编码, 得到当前位置信息
检查可用性这属于基础知识, 不赘述, 总的来说,你的设备的支持要打开, 添加CoreLocation的framework, 引用头文件, 添加委托,然后, 好的实践是在使用前编程检查相关可用性: - ...
Echarts图表控件使用总结1(Line，Bar)
问题篇(详解):http://www.cnblogs.com/hanyinglong/p/4708337.html 1.前言 a.在系统开发过程中可能会使用到图表控件,一个好的图标控件可以使我们的网站 ...

Reducejoin sample

Reducejoin sample的更多相关文章

随机推荐

热门专题