hadoop 学习笔记（第三章 Hadoop分布式文件系统）

map->shuffle->reduce

map(k1,v1)--->(k2,v2)

reduce(k2,List<v2>)--->(k2,v3)

传输类型：org.apache.hadoop.io

访问HDFS文件系统

1.java.net.URL 的setURLStreamHandlerFactory() 方法。每个java虚拟机只能调用一次，因此通常在静态方法中调用。如果引用的第三方组件调用过，再次调用会报错。

public class App 
{
    static{
        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    }

    static InputStream inputStream=null;
    public static void main( String[] args ) throws Exception
    {
        try{
           inputStream=new URL(args[0]).openStream();
           IOUtils.copyBytes(inputStream,System.out,4096,false);
        }finally {
            IOUtils.closeStream(inputStream);
        }
    }
}

2.FileSystem API 读取数据

public class App {

    public static void main(String[] args) throws Exception {

        String uri = args[];

        Configuration configuration = new Configuration();

        FileSystem fs = FileSystem.get(new URI(uri), configuration);

        InputStream inputStream = null;

        try {

            inputStream = fs.open(new Path(uri));

            IOUtils.copyBytes(inputStream, System.out, , false);

        } finally {

            IOUtils.closeStream(inputStream);

        }

    }

}

//实际上，FileSystem对象中open()方法返回的是FSDataInputStream对象。其实现了Seekable接口和PositionedReadable接口

public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable, ByteBufferReadable, HasFileDescriptor, CanSetDropBehind, CanSetReadahead, HasEnhancedByteBufferAccess, CanUnbuffer {

}

public interface Seekable {

  /**

   * Seek to the given offset from the start of the file.

   * The next read() will be from that location.  Can't

   * seek past the end of the file.

   */

  void seek(long pos) throws IOException;

  /**

   * Return the current offset from the start of the file

   */

  long getPos() throws IOException;

  /**

   * Seeks a different copy of the data.  Returns true if

   * found a new source, false otherwise.

   */

  @InterfaceAudience.Private

  boolean seekToNewSource(long targetPos) throws IOException;

}

public interface PositionedReadable {

  /**

   * Read upto the specified number of bytes, from a given

   * position within a file, and return the number of bytes read. This does not

   * change the current offset of a file, and is thread-safe.

   */

  public int read(long position, byte[] buffer, int offset, int length)

    throws IOException;

  /**

   * Read the specified number of bytes, from a given

   * position within a file. This does not

   * change the current offset of a file, and is thread-safe.

   */

  public void readFully(long position, byte[] buffer, int offset, int length)

    throws IOException;

  /**

   * Read number of bytes equal to the length of the buffer, from a given

   * position within a file. This does not

   * change the current offset of a file, and is thread-safe.

   */

  public void readFully(long position, byte[] buffer) throws IOException;

}

read()和readFully()的区别是readFully()在读取到length之前会阻塞，read()如果读到的小于length，读到多少返回多少。

seek()方法的开销较高，要谨慎使用。

3.写入数据

public class App {

    public static void main(String[] args) throws Exception {

        String localSrc=args[0];

        String dstSrc=args[1];

        InputStream inputStream=new BufferedInputStream(new FileInputStream(localSrc));

        Configuration configuration=new Configuration();

        FileSystem fs=FileSystem.get(URI.create(dstSrc),configuration);

        OutputStream outputStream=fs.create(new Path(dstSrc), new Progressable() {

            @Override

            public void progress() {

                System.out.print(".");

            }

        });

        IOUtils.copyBytes(inputStream,outputStream,4096,true);

    }

}

4.目录与查询

FileSystem提供mkdir方法创建目录。通常不需要，因为create方法写入文件时会自动创建目录

public boolean mkdirs(Path f) throws IOException {

}

FileSystem提供getFileStatus方法返回文件元数据。元数据包括文件的地址，大小，权限等

public abstract FileStatus getFileStatus(Path f) throws IOException;

FIleSystem提供listStatus()方法列出目录中的文件

public class App {

    public static void main(String[] args) throws Exception {

        String uri = args[0];

        Configuration configuration = new Configuration();

        FileSystem fs = FileSystem.get(URI.create(uri), configuration);

        Path[] paths = new Path[args.length];

        for (int i = 0; i < paths.length; i++) {

            paths[i] = new Path(args[i]);

        }

        FileStatus[] status = fs.listStatus(paths);

        Path[] listPaths = FileUtil.stat2Paths(status);

        for (Path path : listPaths) {

            System.out.println(path);

        }

    }

}

FileSystem还提供globStatus方法返回与指定格式匹配的所有FIleStatus

 public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException ;

FileSystem提供delete方法永久删除文件或目录。如果f是一个空目录，recursive就会被忽略。如果f非空，只有在recursive为true时才会执行删除。

public abstract boolean delete(Path f, boolean recursive) throws IOException;

hadoop 学习笔记（第三章 Hadoop分布式文件系统）的更多相关文章

Hadoop学习笔记（三）：分布式文件系统的写和读流程
写流程:怎么将文件切割成块,上传到服务器读流程:怎么从不同的服务器来读取数据块写流程图一图二写的过程中:NameNode会给块分配存储块的位置,每次想要存储文件的时候都会在NameNode创 ...
Hadoop学习笔记(6) ——重新认识Hadoop
Hadoop学习笔记(6) ——重新认识Hadoop 之前,我们把hadoop从下载包部署到编写了helloworld,看到了结果.现是得开始稍微更深入地了解hadoop了. Hadoop包含了两大功 ...
《DOM Scripting》学习笔记-——第三章 DOM
<Dom Scripting>学习笔记第三章 DOM 本章内容: 1.节点的概念. 2.四个DOM方法:getElementById, getElementsByTagName, get ...
The Road to learn React书籍学习笔记(第三章)
The Road to learn React书籍学习笔记(第三章) 代码详情声明周期方法通过之前的学习,可以了解到ES6 类组件中的生命周期方法 constructor() 和 render() ...
[HeadFrist-HTMLCSS学习笔记]第三章构建模块：Web页面建设
[HeadFrist-HTMLCSS学习笔记]第三章构建模块:Web页面建设敲黑板!! <q>元素添加短引用,<blockquote>添加长引用在段落里添加引用就使用< ...
JVM学习笔记-第三章-垃圾收集器与内存分配策略
JVM学习笔记-第三章-垃圾收集器与内存分配策略 tips:对于3.4之前的章节可见博客:https://blog.csdn.net/sanhewuyang/article/details/95380 ...
[BigData]关于Hadoop学习笔记第三天(PPT总结)(一)
课程安排 MapReduce原理*** MapReduce执行过程** 数据类型与格式*** Writable接口与序列化机制*** ---------------------------加深拓展- ...
hadoop学习笔记（五）hadoop伪分布式集群的搭建
本文原创,如需转载,请注明作者和原文链接 1.集群搭建的前期准备见搭建分布式hadoop环境的前期准备---需要检查的几个点 2.解压tar.gz包 [root@node01 ~]# ...
[hadoop读书笔记] 第三章 HDFS
P49 当数据集的大小超过一台计算机存储能力时,就有必要对数据集分区(partition)并将分区存储到若干台独立的计算机上. 管理网络中跨多台计算机存储的系统就叫分布式文件系统 Distribut ...
Hadoop学习笔记（三）：java操作Hadoop
1. 启动hadoop服务. 2. hadoop默认将数据存储带/tmp目录下,如下图: 由于/tmp是linux的临时目录,linux会不定时的对该目录进行清除,因此hadoop可能就会出现意外情况 ...

随机推荐

spring Bean的完整生命周期
spring 容器中的bean的完整生命周期一共分为十一步完成. 1.bean对象的实例化 2.封装属性,也就是设置properties中的属性值 3.如果bean实现了BeanNameAware,则 ...
Python3：几行代码实现阶乘
阶乘:一个正整数的阶乘(factorial)是所有小于及等于该数的正整数的积,并且0的阶乘为1.自然数n的阶乘写作n!. #---------------------------------- 阶乘- ...
Pycharm新建模板默认添加作者时间等信息
在pycharm使用过程中,对于每次新建文件的shebang行和关于代码编写者的一些个人信息快捷填写,使用模板的方式比较方便. 方法如下: 1.打开pycharm,选择File-Settings 2. ...
vcenter新建虚拟机centos7作为虚拟机模板
网卡选项适配器类型算则E1000 Remote console选项电源选项加密打开电源,连接iso安装系统按一下tab键,修改网卡为eth0 点击Tab,打开kernel启动选项后,增加ne ...
java 数组复制
http://www.cnblogs.com/zhengbin/p/5671403.html http://www.cnblogs.com/jjdcxy/p/5870524.html Java数组拷贝 ...
【原创】大数据基础之Hive（1）Hive SQL执行过程之代码流程
hive 2.1 hive执行sql有两种方式: 执行hive命令,又细分为hive -e,hive -f,hive交互式: 执行beeline命令,beeline会连接远程thrift server ...
配置php5.6.4 + Apache2.4.10
一.下载并安装apache 下载地址:www.apachelounge.com 解压后:执行以下命令: #httpd.exe –k install #httpd.exe -k start 在执行过程中 ...
pta编程总结2
7-1 币值转换 (20 分) 输入一个整数(位数不超过9位)代表一个人民币值(单位为元),请转换成财务要求的大写中文格式.如23108元,转换后变成"贰万叁仟壹百零捌"元.为了简 ...
洛谷P4482 [BJWC2018]Border 的四种求法字符串,SAM,线段树合并,线段树,树链剖分,DSU on Tree
原文链接https://www.cnblogs.com/zhouzhendong/p/LuoguP4482.html 题意给定一个字符串 S,有 q 次询问,每次给定两个数 L,R ,求 S[L.. ...
Mapjoin和Reducejoin案例
一.Mapjoin案例 1.需求:有两个文件,分别是订单表.商品表, 订单表有三个属性分别为订单时间.商品id.订单id(表示内容量大的表), 商品表有两个属性分别为商品id.商品名称(表示内容量小的 ...

hadoop 学习笔记（第三章 Hadoop分布式文件系统 ）

hadoop 学习笔记（第三章 Hadoop分布式文件系统 ）的更多相关文章

随机推荐

热门专题

hadoop 学习笔记（第三章 Hadoop分布式文件系统）

hadoop 学习笔记（第三章 Hadoop分布式文件系统）的更多相关文章