How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. Python3学习笔记36-PEP8代码规范

    在使用PyCharm时,最右边会有波浪线警告提示代码不符合PEP8代码规范.记录一下犯的错和解决方式 PEP8是风格错误,而不是编码错误.只是为了让代码更具有阅读性. 1)block comment ...

  2. Appium&Java自动化实现移动端几种典型动作

    一.Appium4.0 Pinch&Zoom /* * @FileName Pinch_Zoom: Pinch_Zoom * @author davieyang * @create 2018- ...

  3. GDB调试器教程

    启动和退出GDBGDB(GNU Project Debugger)几乎适用于所有类Unix系统,小巧方便且不失功能强大,Linux/Unix程序员经常用它来调试程序. 总的来说有几下几种方法启动GDB ...

  4. 鼠标点击自定义文字展现特效JS代码

    JS特效使用方法 只需将如下JS代码放到</body>之前就好了 var a_idx = 0; jQuery(document).ready(function($) { $("b ...

  5. Spring实战(第4版)

    第1部分 Spring的核心 Spring的两个核心:依赖注入(dependency injection,DI)和面向切面编程(aspec-oriented programming,AOP) POJO ...

  6. linux 没有界面内容显示不全解决办法

    1.管道 管道简单理解就是,使用管道意味着第一个命令的输出会作为第二个命令的输入,第二个命令的输出又会作为第三个命令的输入,依此类推.利用Linux所提供的管道符“|”将两个命令隔开,管道符左边命令的 ...

  7. 【2-sat】8.14B. 黑心老板

    2-sat 只写过板子 题目大意 有一个长度为$k$取值只有01的序列,现在$n$个人每人下注三个位置,请构造一个序列使每个人最多猜对一个位置 $k\le 5000,n \le 10000$ 题目分析 ...

  8. Windows下安装配置Apache+PHP+Mysql环境

    1.下载相关安装包 Apache下载: http://archive.apache.org/dist/httpd/binaries/win32/ ,选择httpd-2.2.25-win32-x86-n ...

  9. json注解及序列化

    一.json框架 市面上的json框架常用的有 jackson.gson.fastjson.大家比较推崇的是fastjson,但是springmvc默认集成的是 jackson. 在一个项目中建议一个 ...

  10. Hbuilder + MUI 的简单案例

    话不多说 直接上代码 项目结构: index.html 的代码 <!DOCTYPE html><html>    <head>        <meta ch ...