How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. linux自由软件安装 ./config, make的理解

    在linux系统中安装软件的其中一种:源码安装的方法是,先输入./configure,然后输入make,最后make install.或许有人留意到没有,这些软件的根目录中开始是没有Makefile的 ...

  2. string::assign

    string (1) string& assign (const string& str); substring (2) string& assign (const strin ...

  3. keeping

     很多时候我们总是低估了自己,对自己不够狠,从而错过了遇到一个更加优秀的自己.逼自己一把,很多事并不需要多高的智商,仅仅需要你的一份坚持,一个认真的态度,一颗迎难而上的决心 

  4. cookie和Session是啥?

    HTTP是无状态(stateless)协议 http协议是无状态协议即不保存状态. 无状态协议的优点: 由于不需要保存记录,所以减少服务器的CPU和内存的资源的消耗.毕竟客户端一多起来保存记录的话对于 ...

  5. 【leetcode】1290. Convert Binary Number in a Linked List to Integer

    题目如下: Given head which is a reference node to a singly-linked list. The value of each node in the li ...

  6. 如何在 Laravel 中灵活的使用 Trait

    如何在 Laravel 中灵活的使用 Trait  Laravel/ 3个月前/  1740 /  4 / 更新于 3个月前   @这是小豪的第九篇文章 好久没有更新文章了,说好了周更结果还是被自己对 ...

  7. Unable to find the requested .Net Framework Data Provider

    换了个系统后发现VS2010和VS2012都有同样问题,在SQL EXPLORER 里连不上SQL Server,这也导致了打不开 dbml文件,会报错: The operation could no ...

  8. sudo/su

    linux用户分为根用户/ 普通用户 系统用户 根用户和普通用户是可以实际登录到系统中的,普通用户是没办法使用useradd添加新用户的,只有根用户有权限 当然,也可以使用su su 是切换用户的意思 ...

  9. IPv4 地址分类-for what

    怎么分的:IPV4 地址分类 A B C D E 分来做什么:IP地址为什要分类?就是a类,b类,c类...? - wuxinliulei的回答 - 知乎

  10. 【Spark机器学习速成宝典】推荐引擎——协同过滤

    目录 推荐模型的分类 ALS交替最小二乘算法:显式矩阵分解 Spark Python代码:显式矩阵分解 ALS交替最小二乘算法:隐式矩阵分解 Spark Python代码:隐式矩阵分解 推荐模型的分类 ...