Hadoop:读取ｈｄｆｓ上ｚｉｐ压缩包并解压到ｈｄｆｓ的实现代码

背景：

目前工作中遇到一大批的数据，如果不压缩直接上传到ｆｔｐ上就会遇到ｆｔｐ空间资源不足问题，没办法只能压缩后上传，上穿完成后在ｌｉｎｕｘ上下载。但是ｌｉｎｕｘ客户端的资源只有２０Ｇ左右一个压缩包解压后就要占用１６Ｇ左右的空间，因此想在ｌｉｎｕｘ上直接解压已经太折腾了（因为我们一共需要处理的这样的压缩包包含有３０个左右）。

解决方案：

先把ｌｉｎｕｘ上下载到的ｚｉｐ压缩包上传到ｈｄｆｓ，等待所有ｚｉｐ压缩包都上传完成后，开始使用程序直接在读取ｈｄｆｓ上的压缩包文件，直接解压到ｈｄｆｓ上，之后把解压后的文件压缩为ｇｚｉｐ，实现代码如下（参考：http://www.cnblogs.com/juefan/articles/2935163.html）：

import java.io.File;

import java.io.IOException;

import java.util.zip.GZIPOutputStream;

import java.util.zip.ZipEntry;

import java.util.zip.ZipInputStream;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FSDataOutputStream;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

/**

 * Created by Administrator on 12/10/2017.

 */

public class ConvertHdfsZipFileToGzipFile {

    public static boolean isRecur = false;

    public static void main(String[] args) throws IOException {

        if (args.length == 0)

            errorMessage("1filesmerge [-r|-R] <hdfsTargetDir> <hdfsFileName>");

        if (args[0].matches("^-[rR]$")) {

            isRecur = true;

        }

        if ((isRecur && args.length != 4) || ( !isRecur && args.length != 3)) {

            errorMessage("2filesmerge [-r|-R] <hdfsTargetDir> <hdfsFileName>");

        }

        Configuration conf = new Configuration();

        FileSystem hdfs = FileSystem.get(conf);

        Path inputDir;

        Path hdfsFile;

        Text pcgroupText;

        // hadoop jar myjar.jar ConvertHdfsZipFileToGzipFile -r /zip/(待转换文件路径，在HDFS上) /user/j/pconline/(转换完成后的文件存储地址，也在HDFS上) pconline(待转换的文件名包含的字符)

        if(isRecur){

            inputDir = new Path(args[1]);

            hdfsFile = new Path(args[2]);

            pcgroupText = new Text(args[3]);

        }

        // hadoop jar myjar.jar ConvertHdfsZipFileToGzipFile /zip/(待转换文件路径，在HDFS上) /user/j/pconline/(转换完成后的文件存储地址，也在HDFS上) pconline(待转换的文件名包含的字符)

        else{

            inputDir = new Path(args[0]);

            hdfsFile = new Path(args[1]);

            pcgroupText = new Text(args[2]);

        }

        if (!hdfs.exists(inputDir)) {

            errorMessage("3hdfsTargetDir not exist!");

        }

        if (hdfs.exists(hdfsFile)) {

            errorMessage("4hdfsFileName exist!");

        }

        merge(inputDir, hdfsFile, hdfs, pcgroupText);

        System.exit(0);

    }

    /**

     * @author

     * @param inputDir zip文件的存储地址

     * @param hdfsFile 解压结果的存储地址

     * @param hdfs 分布式文件系统数据流

     * @param pcgroupText 需要解压缩的文件关键名

     */

    public static void merge(Path inputDir, Path hdfsFile,

                             FileSystem hdfs, Text pcgroupText) {

        try {

            //文件系统地址inputDir下的FileStatus

            FileStatus[] inputFiles = hdfs.listStatus(inputDir);

            for (int i = 0; i < inputFiles.length; i++) {

                if (!hdfs.isFile(inputFiles[i].getPath())) {

                    if (isRecur){

                        merge(inputFiles[i].getPath(), hdfsFile, hdfs,pcgroupText);

                        return ;

                    }

                    else {

                        System.out.println(inputFiles[i].getPath().getName()

                                + "is not file and not allow recursion, skip!");

                        continue;

                    }

                }

                //判断文件名是否在需要解压缩的关键名内

                if(inputFiles[i].getPath().getName().contains(pcgroupText.toString()) == true){

                    //输出待解压的文件名

                    System.out.println(inputFiles[i].getPath().getName());

                    //将数据流指向待解压文件

                    FSDataInputStream in = hdfs.open(inputFiles[i].getPath());

                    /**

                     *数据的解压执行过程

                     */

                    ZipInputStream zipInputStream = null;

                    try{

                        zipInputStream = new ZipInputStream(in);

                        ZipEntry entry;

                        //解压后有多个文件一并解压出来并实现合并

                        //合并后的地址

                        FSDataOutputStream mergerout = hdfs.create(new Path(hdfsFile + File.separator +

                                inputFiles[i].getPath().getName().substring(0, inputFiles[i].getPath().getName().indexOf("."))));

                        while((entry = zipInputStream.getNextEntry()) != null){

                            int bygeSize1=2*1024*1024;

                            byte[] buffer1 = new byte[bygeSize1];

                            int nNumber;

                            while((nNumber = zipInputStream.read(buffer1,0, bygeSize1)) != -1){

                                mergerout.write(buffer1, 0, nNumber);

                            }

                        }

                        mergerout.flush();

                        mergerout.close();

                        zipInputStream.close();

                    }catch(IOException e){

                        continue;

                    }

                    in.close();

                    /**

                     *将解压合并后的数据压缩成gzip格式

                     */

                    GZIPOutputStream gzipOutputStream = null;

                    try{

                        FSDataOutputStream outputStream = null;

                        outputStream = hdfs.create(new Path(hdfsFile + File.separator +

                                inputFiles[i].getPath().getName().substring(0, inputFiles[i].getPath().getName().indexOf(".")) + ".gz"));

                        FSDataInputStream inputStream = null;

                        gzipOutputStream = new GZIPOutputStream(outputStream);

                        inputStream = hdfs.open(new Path(hdfsFile + File.separator + inputFiles[i].getPath().getName().substring(0, inputFiles[i].getPath().getName().indexOf("."))));

                        int bygeSize=2*1024*1024;

                        byte[] buffer = new byte[bygeSize];

                        int len;

                        while((len = inputStream.read(buffer)) > 0){

                            gzipOutputStream.write(buffer, 0, len);

                        }

                        inputStream.close();

                        gzipOutputStream.finish();

                        gzipOutputStream.flush();

                        outputStream.close();

                    }catch (Exception exception){

                        exception.printStackTrace();

                    }

                    gzipOutputStream.close();

                    //删除zip文件解压合并后的临时文件

                    String tempfiles = hdfsFile + File.separator + inputFiles[i].getPath().getName().substring(0, inputFiles[i].getPath().getName().indexOf("."));

                    try{

                        if(hdfs.exists(new Path(tempfiles))){

                            hdfs.delete(new Path(tempfiles), true);

                        }

                    }catch(IOException ie){

                        ie.printStackTrace();

                    }

                }

            }

        }catch (IOException e) {

            e.printStackTrace();

        }

    }

    public static void errorMessage(String str) {

        System.out.println("Error Message: " + str);

        System.exit(1);

    }

}

调用：

[ｃ@v09823]# hadoop jar myjar.jar [ConvertHdfsZipFileToGzipFile该ｍａｉｎ的类名根据打包方式决定是否需要] /zip/(待转换文件路径，在HDFS上) /user/j/pconline/(转换完成后的文件存储地址，也在HDFS上) pconline(待转换的文件名包含的字符)

如果要实现递归的话，可以在filesmerge后面加上 -r

执行过程中快照：

[c@v09823 ~]$ hadoop fs -ls /user/c/df/myzip

// :: INFO hdfs.PeerCache: SocketCache disabled.

Found  items

-rw-r--r--+   c hadoop  -- : user/c/df/myzip/myzip_0.zip

-rw-r--r--+   c hadoop  -- : user/c/df/myzip/myzip_12.zip

-rw-r--r--+   c hadoop  -- : user/c/df/myzip/myzip_15.zip

...

[ｃ@v09823 ~]$ yarn jar My_ConvertHdfsZipFileToGzipFile.jar /user/c/df/myzip user/c/df/mygzip .zip

// :: INFO hdfs.PeerCache: SocketCache disabled.

myzip_0.zip

myzip_12.zip

myzip_15.zip

...

[catt@vq20skjh01 ~]$ hadoop fs -ls -h user/c/df/mygzip

// :: INFO hdfs.PeerCache: SocketCache disabled.

Found  items

-rw-r--r--+   c hadoop      14.9 G -- : user/c/df/mygzip/myzip_0

-rw-r--r--+   c hadoop      14.9 G -- : user/c/df/mygzip/myzip_12

-rw-r--r--+   c hadoop          G -- : user/c/df/mygzip/myzip_15

....

Hadoop:读取ｈｄｆｓ上ｚｉｐ压缩包并解压到ｈｄｆｓ的实现代码的更多相关文章

第1节 IMPALA：4、5、linux磁盘的挂载和上传压缩包并解压
第二步:开机之后进行磁盘挂载分区,格式化,挂载新磁盘磁盘挂载 df -lh fdisk -l 开始分区 fdisk /dev/sdb 这个命令执行后依次输 n p 1 回车回车 w ...
liunx之zip格式的解压命令
zip -r myfile.zip ./* 将当前目录下的所有文件和文件夹全部压缩成myfile.zip文件,-r表示递归压缩子目录下所有文件. 2.unzip unzip -o -d /home/s ...
文件操作工具类：文件/目录的创建、删除、移动、复制、zip压缩与解压.
FileOperationUtils.java package com.xnl.utils; import java.io.BufferedInputStream; import java.io.Bu ...
「Python实用秘技01」复杂zip文件的解压
本文完整示例代码及文件已上传至我的Github仓库https://github.com/CNFeffery/PythonPracticalSkills 这是我的新系列文章「Python实用秘技」的第1 ...
ref:Spring Integration Zip 不安全解压（CVE-2018-1261）漏洞分析
ref:https://mp.weixin.qq.com/s/SJPXdZWNKypvWmL-roIE0Q 0x00 漏洞概览漏洞名称:Spring Integration Zip不安全解压漏洞编 ...
java zip 压缩与解压
java zip 压缩与解压 import java.io.BufferedInputStream; import java.io.BufferedOutputStream; import java. ...
Linux tar.gz 、zip、rar 解压压缩命令
tar -c: 建立压缩档案 -x:解压 -t:查看内容 -r:向压缩归档文件末尾追加文件 -u:更新原压缩包中的文件这五个是独立的命令,压缩解压都要用到其中一个,可以和别的命令连用但只能用其中一个 ...
ubuntu下各种压缩包的解压命令
.tar解包:tar xvf FileName.tar打包:tar cvf FileName.tar DirName(注:tar是打包,不是压缩!)-------------------------- ...
正确的 zip 压缩与解压代码
网上流传的zip压缩与解压的代码有非常大的问题尽管使用了ant进行压缩与解压,可是任务的流程还是用的java.util.zip 的方式写的,我在使用的过程中遇到了压缩的文件夹结构有误,甚至出现不同 ...

随机推荐

兄弟连教育分享：用CSS实现鼠标悬停提示的方法
兄弟连教育分享:用CSS实现鼠标悬停提示的方法本文,兄弟连HTML5培训,分享了纯CSS实现鼠标悬停提示的方法.给大家供大家参考.具体分析如下: 这是一款比较漂亮的鼠标悬停提示效果,用纯CSS代码实 ...
51ak带你看MYSQL5.7源码1：main入口函数
从事DBA工作多年 MYSQL源码也是头一次接触尝试记录下自己看MYSQL5.7源码的历程目录: 51ak带你看MYSQL5.7源码1:main入口函数 51ak带你看MYSQL5.7源码2:编译 ...
java函数回调
Class A实现接口CallBack callback--背景1 class A中包含一个class B的引用b --背景2 class B有一个参数为callback的方法f(CallBack c ...
mybatis动态insert，update
1. 动态update UPDATE ui.user_question_section_xref <set> reviewer = #{0}, score = #{1} , last_up ...
eclipse打包
【Python】 virtualenv虚拟环境建设和管理
[virtualenv] 用Python开发时面临的一个大问题就是每个项目需要的依赖包不一致.如果是包本身不一样倒还好,如果不同项目需要的是不同版本的包的话就会很麻烦.如果采用统一的系统Python环 ...
【CSS】 CSS 定位
css 定位和浮动 *******本章大量内容copy自w3school********* 定位对于html界面还是很重要的,因为定位会直接影响到用户的视图.对于css而言,定位也比较灵活. 浮动是一 ...
linux截取字符串之sort、uniq、cut用法
sort命令是帮我们依据不同的数据类型进行排序参数: -b 忽略每行前面开始出的空格字符. -c 检查文件是否已经按照顺序排序. -f 排序时,忽略大小写字母. -M 将前 ...
Redis --> 为redis分配新的端口
为redis分配新的端口为redis分配一个8888端口,操作步骤如下:1.$REDIS_HOME/redis.conf重新复制一份,重命名为redis8888.conf.2.打开redis8888 ...
Dynamics 365 for CRM：修改ADFS的过期时间，TokenLifetime
通过Microsoft PowerShell修改ADFS的过期时间实现延长CRM的过期时间 To change the timeout value, you will need to update t ...

Hadoop:读取ｈｄｆｓ上ｚｉｐ压缩包并解压到ｈｄｆｓ的实现代码

背景：

解决方案：

调用：

执行过程中快照：

Hadoop:读取ｈｄｆｓ上ｚｉｐ压缩包并解压到ｈｄｆｓ的实现代码的更多相关文章

随机推荐

热门专题