Reducejoin sample

示例文件同sample join analysis

之前的示例是使用map端的join.这次使用reduce端的join.

根据源的类别写不同的mapper，处理不同的文件，输出的key都是studentno.value是其他的信息同时加上类别信息。

然后使用multipleinputs不同的路径注册不同的mapper.

reduce端相同的studentno的学生信息和考试成绩分配给同一个reduce,而且value中包含了这些信息，

把这些信息抽取出来，再做笛卡尔积即可。

下面的示例代码中，我没有使用multipleinputs来处理，自己修改了TextInputFormat的一些信息，使用返回文件名和当前行的信息。

根据文件名我在mapper中处理两个不同文件的信息，加上不同的类别送出去。

下面的代码中还有很多可以优化的地方，以后再更新。

package myexamples;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.compress.CompressionCodec;

import org.apache.hadoop.io.compress.CompressionCodecFactory;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.JobContext;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.LineReader;

public class reducejoin {

    public static class MyTextInputFormat extends FileInputFormat<Text, Text> {

        @Override

        public MyLineRecordReader createRecordReader(InputSplit split,

                TaskAttemptContext context) {

            return new MyLineRecordReader();

        }

        @Override

        protected boolean isSplitable(JobContext context, Path file) {

            CompressionCodec codec = new CompressionCodecFactory(

                    context.getConfiguration()).getCodec(file);

            return codec == null;

        }

    }

    public static class MyLineRecordReader extends RecordReader<Text, Text> {

        private static final Log LOG = LogFactory

                .getLog(LineRecordReader.class);

        private CompressionCodecFactory compressionCodecs = null;

        private long start;

        private long pos;

        private long end;

        private LineReader in;

        private int maxLineLength;

        private Text key = null;

        private Text value = null;

        Text filename = null;

        public void initialize(InputSplit genericSplit,

                TaskAttemptContext context) throws IOException {

            FileSplit split = (FileSplit) genericSplit;

            Configuration job = context.getConfiguration();

            this.maxLineLength = job.getInt(

                    "mapred.linerecordreader.maxlength", Integer.MAX_VALUE);

            start = split.getStart();

            end = start + split.getLength();

            final Path file = split.getPath();

            key = new Text(file.getName());

            compressionCodecs = new CompressionCodecFactory(job);

            final CompressionCodec codec = compressionCodecs.getCodec(file);

            // open the file and seek to the start of the split

            FileSystem fs = file.getFileSystem(job);

            FSDataInputStream fileIn = fs.open(split.getPath());

            boolean skipFirstLine = false;

            if (codec != null) {

                in = new LineReader(codec.createInputStream(fileIn), job);

                end = Long.MAX_VALUE;

            } else {

                if (start != 0) {

                    skipFirstLine = true;

                    --start;

                    fileIn.seek(start);

                }

                in = new LineReader(fileIn, job);

            }

            if (skipFirstLine) { // skip first line and re-establish "start".

                start += in.readLine(new Text(), 0,

                        (int) Math.min((long) Integer.MAX_VALUE, end - start));

            }

            this.pos = start;

        }

        public boolean nextKeyValue() throws IOException {

            if (key == null) {

            }

            if (value == null) {

                value = new Text();

            }

            int newSize = 0;

            while (pos < end) {

                newSize = in.readLine(value, maxLineLength, Math.max(

                        (int) Math.min(Integer.MAX_VALUE, end - pos),

                        maxLineLength));

                if (newSize == 0) {

                    break;

                }

                pos += newSize;

                if (newSize < maxLineLength) {

                    break;

                }

                // line too long. try again

                LOG.info("Skipped line of size " + newSize + " at pos "

                        + (pos - newSize));

            }

            if (newSize == 0) {

                key = null;

                value = null;

                return false;

            } else {

                return true;

            }

        }

        @Override

        public Text getCurrentKey() {

            return key;

        }

        @Override

        public Text getCurrentValue() {

            return value;

        }

        /**

         * Get the progress within the split

         */

        public float getProgress() {

            if (start == end) {

                return 0.0f;

            } else {

                return Math.min(1.0f, (pos - start) / (float) (end - start));

            }

        }

        public synchronized void close() throws IOException {

            if (in != null) {

                in.close();

            }

        }

    }

    public static class studentMapper extends Mapper<Text, Text, Text, Text> {

        public void map(Text key, Text value, Context context)

                throws IOException, InterruptedException {

            Text newvalue = null;

            String strv = value.toString().substring(

                    value.toString().indexOf(","));

            if (key.toString().contains("student")) // student file

                newvalue = new Text("student" + strv);

            else

                newvalue = new Text("score" + strv);

            Text newkey = new Text(value.toString().substring(0,

                    value.toString().indexOf(",")));

            context.write(newkey, newvalue);

        }

    }

    public static class studentReducer extends Reducer<Text, Text, Text, Text> {

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            List<String> students = new ArrayList<String>();

            List<String> scores = new ArrayList<String>();

            for (Text value : values)

                if (value.toString().startsWith("student"))

                    students.add(value.toString().substring(8));

                else

                    scores.add(value.toString().substring(6));

            // split real results

            for (String student : students)

                for (String score : scores)

                    context.write(key, new Text(student + "," + score));

        }

    }

    public static void main(String[] args) throws Exception {

        args = "hdfs://namenode:9000/user/hadoop/student/ hdfs://namenode:9000/user/hadoop/reducejoinout"

                .split(" ");

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args)

                .getRemainingArgs();

        if (otherArgs.length != 2) {

            System.err.println("Usage: wordcount <in> <out>");

            System.exit(2);

        }

        myUtils.myUtils.DeleteFolder(conf, otherArgs[1]);

        conf.set("io.sort.mb", "10");

        Job job = new Job(conf, "reduce join");

        job.setInputFormatClass(MyTextInputFormat.class);

        // job.setOutputFormatClass(SequenceFileOutputFormat.class);

        job.setJarByClass(reducejoin.class);

        job.setMapperClass(studentMapper.class);

        job.setReducerClass(studentReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

Reducejoin sample的更多相关文章

MapReduce 示例：减少 Hadoop MapReduce 中的侧连接
摘要:在排序和reducer 阶段,reduce 侧连接过程会产生巨大的网络I/O 流量,在这个阶段,相同键的值被聚集在一起. 本文分享自华为云社区<MapReduce 示例:减少 Hadoop ...
Linux下UPnP sample分析
一.UPnP简介 UPnP(Universal Plug and Play)技术是一种屏蔽各种数字设备的硬件和操作系统的通信协议.它是一种数字网络中间件技术,建立在TCP/IP.HTTP协 ...
cocos2d-x for android配置 & 运行 Sample on Linux OS
1.从http://www.cocos2d-x.org/download下载稳定版比如cocos2d-x-2.2 2.解压cocos2d-x-2.2.zip,比如本文将其解压到 /opt 目录下 3 ...
android studio2.2 的Find Sample Code点击没有反应
1 . 出现的问题描述: 右键点击Find Sample Code后半天没有反应,然后提示 Samples are currently unavailable for :{**** ...
jmeter（四）Sample之http请求
启动jmeter,建立一个测试计划这里再次说说怎么安装和启动jmeter吧,昨天下午又被人问到怎样安装和使用,我也是醉了:在我看来,百度能解决百分之八十的问题,特别是基础的问题... 安装:去官网下 ...
jcaptcha sample 制作验证码
Skip to end of metadata Created by marc antoine garrigue, last modified by Jeremy Waters on Feb 23, ...
Python 对不均衡数据进行Over sample（重抽样）
需要重采样的数据文件(Libsvm format),如heart_scale +1 1:0.708333 2:1 3:1 4:-0.320755 5:-0.105023 6:-1 7:1 8:-0.4 ...
Basic linux command-with detailed sample
Here I will list some parameters which people use very ofen, I will attach the output of the command ...
例子：RSS Reader Sample
本例演示了Rss xml信息的获取,以及如何使用SyndicationFeed来进行符合Rss规范的xml进行解析. SyndicationFeed 解析完成后可以得到SyndicationItem ...

随机推荐

【JS复习笔记】03 继承
关于继承好吧,说到底JS还是原型继承的,而不是类继承.所以在这个上面要经常用到prototype去继承另一个对象. 所有的构造器函数都约定命名为首字母大写的形式,并且不以首字母大写的形式拼写任何其它 ...
Access-Control-Allow-Origin: Dealing with CORS Errors in Angular
https://daveceddia.com/access-control-allow-origin-cors-errors-in-angular/ Getting this error in you ...
FreeBSD 9.1安装KMS 这是一个伪命题###### ，9....
FreeBSD 9.1安装KMS 这是一个伪命题###### ,9.1的内核已经加入了KMS内核支持需要更新ports中的xorg到打了补丁的版本,无意中发现了一个pkg源,这个事也搞定了 free ...
.NET向APNS苹果消息推送通知
一.Apns简介: Apns是苹果推送通知服务. 二.原理: APNs会对用户进行物理连接认证,和设备令牌认证(简言之就是苹果的服务器检查设备里的证书以确定其为苹果设备):然后,将服务器的信息接收并且 ...
SpringMvc+Mybatis 框架搭建
本文承接上一篇[idea使用maven搭建springmvc] 开篇:在main/resources下新建dbconfig.properties.spring.xml.spring-mybatis.x ...
Orchard中文版源码下载
本版本基于Orchard1.7.2修改: 新增Bootstrap主题新增中文语言包增加了对Sqlite.Orchard数据库的支持优化工程,减少临时符号生成,增加工程效率和一些BUG的修正默 ...
[翻译]:SQL死锁-锁与事务级别
其实这一篇呢与解决我项目中遇到的问题也是必不可少的.上一篇讲到了各种锁之间的兼容性,里面有一项就是共享锁会引起死锁,如何避免呢,将我们的查询都设置中read uncommitted是否可行呢?其结果显 ...
ESRI.ArcGIS.esriSystem名称空间问题
在AO或AE开发中,并没有ESRI.ArcGIS.esriSystem这个dll,只有ESRI.ArcGIS.System,凡是需要ESRI.ArcGIS.esriSystem命名空间时,添加ESRI ...
SharePoint 2013 图像呈现形式介绍
由于图像呈现形式依赖 SharePoint Server 2013 中的其他功能,因此需确保您满足本节中的先决条件,才能执行本文中的过程.先决条件包括: • 发布网站集您要在其中添加图像呈现形式的网 ...
How to upgrade workflow assembly in MOSS 2007
This problem generally start when you are having an existing custom workflow and there are instances ...

Reducejoin sample

Reducejoin sample的更多相关文章

随机推荐

热门专题