原文地址:http://www.baeldung.com/java-read-lines-large-file

1. Overview

This tutorial will show how to read all the lines from a large file in Java in an efficient manner.

This article is part of the “Java – Back to Basic” tutorial here on Baeldung.

2. Reading In Memory

The standard way of reading the lines of the file is in-memory – both Guava and Apache Commons IO provide a quick way to do just that:

1
Files.readLines(new File(path), Charsets.UTF_8);
1
FileUtils.readLines(new File(path));

The problem with this approach is that all the file lines are kept in memory – which will quickly lead to OutOfMemoryError if the File is large enough.

For example – reading a ~1Gb file:

1
2
3
4
5
@Test
public void givenUsingGuava_whenIteratingAFile_thenWorks() throws IOException {
    String path = ...
    Files.readLines(new File(path), Charsets.UTF_8);
}

This starts off with a small amount of memory being consumed: (~0 Mb consumed)

1
2
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 128 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 116 Mb

However, after the full file has been processed, we have at the end: (~2 Gb consumed)

1
2
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 2666 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 490 Mb

Which means that about 2.1 Gb of memory are consumed by the process – the reason is simple – the lines of the file are all being stored in memory now.

It should be obvious by this point that keeping in-memory the contents of the file will quickly exhaust the available memory – regardless of how much that actually is.

What’s more, we usually don’t need all of the lines in the file in memory at once – instead, we just need to be able to iterate through each one, do some processing and throw it away. So, this is exactly what we’re going to do – iterate through the lines without holding the in memory.

3. Streaming Through the File

Let’s now look at a solution – we’re going to use a java.util.Scanner to run through the contents of the file and retrieve lines serially, one by one:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
FileInputStream inputStream = null;
Scanner sc = null;
try {
    inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
} finally {
    if (inputStream != null) {
        inputStream.close();
    }
    if (sc != null) {
        sc.close();
    }
}

This solution will iterate through all the lines in the file – allowing for processing of each line – without keeping references to them – and in conclusion, without keeping them in memory(~150 Mb consumed)

1
2
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 763 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 605 Mb

4. Streaming with Apache Commons IO

The same can be achieved using the Commons IO library as well, by using the customLineIterator provided by the library:

1
2
3
4
5
6
7
8
9
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
} finally {
    LineIterator.closeQuietly(it);
}

Since the entire file is not fully in memory – this will also result in pretty conservative memory consumption numbers(~150 Mb consumed)

1
2
[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752 Mb
[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564 Mb

5. Conclusion

This quick article shows how to process lines in a large file without iteratively, without exhausting the available memory – which proves quite useful when working with these large files.

The implementation of all these examples and code snippets can be found in my github project – this is an Eclipse based project, so it should be easy to import and run as it is.

Java – Reading a Large File Efficiently--转的更多相关文章

  1. Loading Large Bitmaps Efficiently

    有效地加载大位图文件-Loading Large Bitmaps Efficiently 图像有各种不同的形状和大小.在许多情况下,他们往往比一个典型应用程序的用户界面(UI)所需要的资源更大.例如, ...

  2. java之io之file类的常用操作

    java io 中,file类是必须掌握的.它的常用api用法见实例. package com.westward.io; import java.io.File; import java.io.IOE ...

  3. linux出现bash: ./java: cannot execute binary file 问题的解决办法

    问题现象描述: 到orcal官网上下载了两个jdk: (1)jdk-7u9-linux-i586.tar.gz ------------>32位 (2)jdk-7u9-linux-x64.tar ...

  4. java: cannot execute binary file

    转自:http://jxwpx.blog.51cto.com/15242/222572 java: cannot execute binary file 如果遇到这个错,一般是操作系统位数出问题了. ...

  5. -bash: /tyrone/jdk/jdk1.8.0_91/bin/java: cannot execute binary file

    问题描述:今天在linux环境下安装了一下JDK,安装成功后,打算输入java -version去测试一下,结果却出错了. 错误信息:-bash: /tyrone/jdk/jdk1.8.0_91/bi ...

  6. Github Upload Large File 上传超大文件

    Github中单个文件的大小限制是100MB,为了能突破这个限制,我们需要使用Git Large File Storage这个工具,参见这个官方帖子,但是按照其给的步骤,博主未能成功上传超大文件,那么 ...

  7. Reading Lines from File in C++

    Reading Lines from File in C++ In C++, istringstream has been used to read lines from a file. code: ...

  8. 使用JAVA API 解析ORC File

    使用JAVA API 解析ORC File orc File 的解析过程中,使用FileInputFormat的getSplits(conf, 1)函数, 然后使用 RecordReaderreade ...

  9. java.lang.IllegalStateException: Zip File is closed

    最近在研究利用sax读取excel大文件时,出现了以下的错误: java.lang.IllegalStateException: Zip File is closed at org.apache.po ...

随机推荐

  1. json.js

    由于json官网被强,现保存源码一份以备不时之需,直接保存成js文件即可. /* json.js 2007-08-05 Public Domain This file adds these metho ...

  2. c# 读取导入的excel文件,循环批量处理数据

    dt = FM_HR_ShiftMaintenanceManager.GetCsvToDataTable(strConn, excelName,"XJSQMonthlyImportExcel ...

  3. ThinkPad X260 UEFI安装 win7 64位 方法

    ThinkPad X260   UEFI安装 win7 64位 方法 1.使用DG重新格式化硬盘,格式为GPT 2.使用CGI  安装 WIM文件 (image不知是否可以,下次测试) 3.改BIOS ...

  4. windows下搭建hadoop-2.6.0本地idea开发环境

    概述 本文记录windows下hadoop本地开发环境的搭建: OS:windows hadoop执行模式:独立模式 安装包结构: Hadoop-2.6.0-Windows.zip - cygwinI ...

  5. js---14公有私有成员方法

    var ns1 = {}; //命名空间 ns1.ns11 = {};//子命名空间 ns1.module1 = {name:"a",m:function(){}}; consol ...

  6. Toeplitz matrix 与 Circulant matrix

    之所以专门定义两个新的概念,在于它们特殊的形式,带来的特别的形式. 1. Toeplitz matrix 对角为常数: n×n 的矩阵 A 是 Toepliz 矩阵当且仅当,对于 Ai,j 有: Ai ...

  7. Ajax缓存原理

    一.什么是Ajax缓存原理? Ajax在发送的数据成功后,会把请求的URL和返回的响应结果保存在缓存内,当下一次调用Ajax发送相同的请求时,它会直接从缓存中把数据取出来,这是为了提高页面的响应速度和 ...

  8. 43.可变参数实现printf

    #include <stdio.h> #include <stdio.h> #include <Windows.h> #include <stdarg.h&g ...

  9. java 参数

    -Xmx:size java最大堆内存 -Xms:size 初始化内存 -Xmn:size 年轻带堆大小 -XX:NewSize=size 年轻带的大小 -XX:NewRatio=ratio 年轻带和 ...

  10. STM32上使用JSON

    一.STM32工程中添加JSON 最近在一网2串项目,串口和网口之间可能需要定义一下简单的通信协议,而通信协议上则需要去定义一下通信的数据格式,上次听剑锋说要用Json来定义,目前查了下资料具体如何去 ...