hadoop 使用map合并小文件到SequenceFile

上一例是直接用SequenceFile的createWriter来实现，本例采用mapreduce的方式。

1、把小文件整体读入需要自定义InputFormat格式，自定义InputFormat格式需要先定义RecordReader读取方式，为了整体读入，RecordReader使用一次性读入所有字节。

1.1 继承RecordReader泛型，重写这个类。

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.BytesWritable;

import org.apache.hadoop.io.IOUtils;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class WholeFileRecordReader extends RecordReader<NullWritable,BytesWritable> {

    private FileSplit fileSplit;

    private Configuration conf;

    private BytesWritable value = new BytesWritable();

    private boolean processed = false;

    /**

     * Called once at initialization.

     *

     * @param split   the split that defines the range of records to read

     * @param context the information about the task

     * @throws IOException

     * @throws InterruptedException

     */

    @Override

    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {

        this.fileSplit = (FileSplit) split;

        this.conf = context.getConfiguration();

    }

    /**

     * Read the next key, value pair.

     *

     * @return true if a key/value pair was read

     * @throws IOException

     * @throws InterruptedException

     */

    @Override

    public boolean nextKeyValue() throws IOException, InterruptedException {

        if(!processed){

            byte[] contents = new byte[(int)fileSplit.getLength()];

            Path file = fileSplit.getPath();

            FileSystem fs = file.getFileSystem(conf);

            FSDataInputStream in = null;

            try {

                in = fs.open(file);

                IOUtils.readFully(in,contents,0,contents.length);//一次全部读取

                value.set(contents,0,contents.length);

            }finally {

                IOUtils.closeStream(in);

            }

            processed = true;

            return true;

        }

        return false;

    }

    /**

     * Get the current key

     *

     * @return the current key or null if there is no current key

     * @throws IOException

     * @throws InterruptedException

     */

    @Override

    public NullWritable getCurrentKey() throws IOException, InterruptedException {

        return NullWritable.get();

    }

    /**

     * Get the current value.

     *

     * @return the object that was read

     * @throws IOException

     * @throws InterruptedException

     */

    @Override

    public BytesWritable getCurrentValue() throws IOException, InterruptedException {

        return value;

    }

    /**

     * The current progress of the record reader through its data.

     *

     * @return a number between 0.0 and 1.0 that is the fraction of the data read

     * @throws IOException

     * @throws InterruptedException

     */

    @Override

    public float getProgress() throws IOException, InterruptedException {

        return processed ? 1.0f:0.0f;

    }

    /**

     * Close the record reader.

     */

    @Override

    public void close() throws IOException {

    }

}

1.2 继承FileInputFormat泛型，重写文件输入格式

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.BytesWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.JobContext;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import java.io.IOException;

public class WholeFileInputFormat extends FileInputFormat<NullWritable,BytesWritable> {

    /**

     * Is the given filename splittable? Usually, true, but if the file is

     * stream compressed, it will not be.

     * <p>

     * The default implementation in <code>FileInputFormat</code> always returns

     * true. Implementations that may deal with non-splittable files <i>must</i>

     * override this method.

     * <p>

     * <code>FileInputFormat</code> implementations can override this and return

     * <code>false</code> to ensure that individual input files are never split-up

     * so that  process entire files.

     *

     * @param context  the job context

     * @param filename the file name to check

     * @return is this file splitable?

     */

    @Override

    protected boolean isSplitable(JobContext context, Path filename) {

        return false;//文件不分片，为了整体读入

    }

    /**

     * Create a record reader for a given split. The framework will call

     * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before

     * the split is used.

     *

     * @param split   the split to be read

     * @param context the information about the task

     * @return a new record reader

     * @throws IOException

     * @throws InterruptedException

     */

    @Override

    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {

        WholeFileRecordReader recordReader = new WholeFileRecordReader();

        recordReader.initialize(split,context);

        return recordReader;

    }

}

2、MAPPER，不要写reduce，本例只是合并文件。

public class SequenceFileMapper extends Mapper<NullWritable,BytesWritable,Text,BytesWritable> {

    enum FileCounter {

        FILENUM

    }

    private Text filenameKey;

    /**

     * Called once at the beginning of the task.

     *

     * @param context

     */

    @Override

    protected void setup(Context context) throws IOException, InterruptedException {

        InputSplit split = context.getInputSplit();

        Path path = ((FileSplit)split).getPath();

        filenameKey = new Text(path.toString());

    }

    /**

     * Called once for each key/value pair in the input split. Most applications

     * should override this, but the default is the identity function.

     *

     * @param key

     * @param value

     * @param context

     */

    @Override

    protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {

        context.write(filenameKey,value);

        //自定义计数器

        context.getCounter(FileCounter.FILENUM).increment(1);

        //动态计数器

        context.getCounter("FileNameList",filenameKey.toString()).increment(1);

    }

}

3、执行job，使用辅助类Tool，也可以不用，直接写job执行就可以。

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.BytesWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class SmallFilesToSequenceFileConverter extends Configured implements Tool {

    /**

     * Execute the command with the given arguments.

     *

     * @param args command specific arguments.

     * @return exit code.

     * @throws Exception

     */

    @Override

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();

        if(conf==null){

            return -1;

        }

        Path outPath = new Path(args[1]);

        FileSystem fileSystem = outPath.getFileSystem(conf);

        //删除输出路径

        if(fileSystem.exists(outPath))

        {

            fileSystem.delete(outPath,true);

        }

        Job job = Job.getInstance(conf,"SmallFilesToSequenceFile");

        job.setJarByClass(SmallFilesToSequenceFileConverter.class);

        job.setMapperClass(SequenceFileMapper.class);

        job.setInputFormatClass(WholeFileInputFormat.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(BytesWritable.class);

        FileInputFormat.addInputPath(job,new Path(args[0]));

        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        return job.waitForCompletion(true) ? 0:1;

    }

    public static void main(String[] args) throws Exception{

        long startTime = System.currentTimeMillis();

        int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(),args);

        System.exit(exitCode);

        long endTime = System.currentTimeMillis();

        long timeSpan = endTime - startTime;

        System.out.println("运行耗时："+timeSpan+"毫秒。");

    }

}

4、上传集群运行，打包成jar包的时候把META-INF目录和src目录放同级，防止找不到函数入口。

#手动调整reduce数量为2，运算后会生成两个part

[hadoop@bigdata-senior01 ~]$ hadoop jar SmallFilesToSequenceFileConverter.jar -D mapreduce.job.reduces=2 /demo /output3

...
[hadoop@bigdata-senior01 ~]$ hadoop fs -ls /output3
Found 3 items
-rw-r--r--   1 hadoop supergroup          0 2019-02-18 16:17 /output3/_SUCCESS
-rw-r--r--   1 hadoop supergroup      60072 2019-02-18 16:17 /output3/part-r-00000
-rw-r--r--   1 hadoop supergroup      28520 2019-02-18 16:17 /output3/part-r-00001

hadoop 使用map合并小文件到SequenceFile的更多相关文章

[转载]mapreduce合并小文件成sequencefile
mapreduce合并小文件成sequencefile http://blog.csdn.net/xiao_jun_0820/article/details/42747537
Hadoop合并小文件的几种方法
1.Hadoop HAR 将众多小文件打包成一个大文件进行存储,并且打包后原来的文件仍然可以通过Map-Reduce进行操作,打包后的文件由索引和存储两大部分组成: 缺点: 一旦创建就不能修改,也不支 ...
Hadoop HDFS编程 API入门系列之合并小文件到HDFS（三）
不多说,直接上代码. 代码 package zhouls.bigdata.myWholeHadoop.HDFS.hdfs7; import java.io.IOException;import ja ...
Hadoop实战项目：小文件合并
项目背景在实际项目中,输入数据往往是由许多小文件组成,这里的小文件是指小于HDFS系统Block大小的文件(默认128M),早期的版本所定义的小文件是64M,这里的hadoop-2.2.0所定义的小 ...
Hadoop记录-hive merge小文件
1. Map输入合并小文件对应参数:set mapred.max.split.size=256000000; #每个Map最大输入大小set mapred.min.split.size.per.no ...
HDFS 07 - HDFS 性能调优之合并小文件
目录 1 - 为什么要合并小文件 2 - 合并本地的小文件,上传到 HDFS 3 - 合并 HDFS 的小文件,下载到本地 4 - 通过 Java API 实现文件合并和上传版权声明 1 - 为什么 ...
Hive 利用 on tez 引擎合并小文件
Hive 利用 on tez 引擎合并小文件标签(空格分隔): Hive \[f(N) + \sum_{i=2}^N f(N-i+1)*X_i\] SET hive.exec.dynamic.pa ...
用Hadoop AVRO进行大量小文件的处理（转）
使用使用使用使用 HDFS 保存大量小文件的缺点:1.Hadoop NameNode 在内存中保存所有文件的“元信息”数据.据统计,每一个文件需要消耗 NameNode600 字节内存.如果需要保 ...
iceberg合并小文件冲突测试
基于iceberg的master分支的9b6b5e0d2(2022-2-9). 参数说明 1.PARTIAL_PROGRESS_ENABLED(partial-progress.enabled) 默认 ...

随机推荐

<进阶版>Markdown指南
有道云笔记内置Markdown编辑器和使用指南. “进阶版”有道云笔记Markdown指南,教你如何进一步掌握待办.清单.流程图和甘特图. 0 待办和清单待办事项和清单在日常工作.生活中经常被使用. ...
IO 转换流
package TestIo; import java.io.*; /** * 转换流 */ public class TestConvertStream { public static void m ...
回写盘写速度被限速为10M左右
问题现像如下图所示: 用hd-speed等测试虚拟盘速度都能达到90M/s左右,但复制文件到虚拟盘速度最高只有10M/s 原因:由于客户机开机加载这个随机驱动和随机进程后,会对磁盘启动进程等有扫描检查 ...
nodejs HTTP服务
nodejs中的HTTP服务 nodejs最重要的方面之一是具有非常迅速的实现HTTP和HTTPS服务器和服务的能力.http服务是相当低层次的,你可能要用到不同的模块,如express来实现完整 ...
c# enum 解析
解析定义的枚举 public enum OrderPaymentStatus { /// <summary> /// 未支付 /// </summary> [Descripti ...
android 几个工具方法
集合几个工具方法,方便以后使用. 1.获取手机分辨率屏幕: public static void printScreenInfor(Context context){ DisplayMetrics ...
微信小程序—day03
昨日问题接着上一篇,昨天遇到的scroll-view组件不能滚动的问题. 今天经过调试,发现是由于:图片的实际宽高,大于给image设定的宽高导致的. 解决办法: 减小图片的实际宽高,使之小于ima ...
Monkey用真机做测试的步骤
1 必备条件 1) 手机需要先获取root权限: 2) 手机和电脑相连(电脑可以访问手机里面的文件) 2 操作步骤 1) 使用adb devices 命令查看电脑手机是否相连: 下图表示手机已连上电 ...
卡片游戏（Throwing card away I,UVa10935)
题目描述: 解题思路: 直接模拟 #include <iostream> using namespace std; ] ; int main(int argc, char *argv[]) ...
5.azkaban权限管理
权限简介 user 登录azkaban的用户注意,如果不给用户roles groups,则用户就是普通用户,只能创建\查看\执行\调度自己的任务,不能看别人的 group group:用户的集合,给 ...

hadoop 使用map合并小文件到SequenceFile

hadoop 使用map合并小文件到SequenceFile的更多相关文章

随机推荐

热门专题