How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. linux程序编译过程

    大家肯定都知道计算机程序设计语言通常分为机器语言.汇编语言和高级语言三类.高级语言需要通过翻译成机器语言才能执行,而翻译的方式分为两种,一种是编译型,另一种是解释型,因此我们基本上将高级语言分为两大类 ...

  2. 2018多校第九场 HDU 6416 (DP+前缀和优化)

    转自:https://blog.csdn.net/CatDsy/article/details/81876341 #include <bits/stdc++.h> using namesp ...

  3. vs2015下载

    VS2015 专业版下载链接http://download.microsoft.com/download/B/8/9/B898E46E-CBAE-4045-A8E2-2D33DD36F3C4/vs20 ...

  4. EasyLogging++学习笔记(1)—— 简要介绍

    对于有开发经验的程序员来说,记录程序执行日志是一件必不可少的事情.通过查看和分析日志信息,不仅可以有效地帮助我们调试程序,而且当程序正式发布运行之后,更是可以帮助我们快速.准确地定位问题.在现在这个开 ...

  5. Luogu P2114_[NOI2014]起床困难综合症 贪心

    思路:按位贪心. 提交:1次 题解: 可以先处理出对于全$0$串和全$1$串最后每一位的结果.(每一位 从 $0$ $or$ $1$ 变成 $0$ $or$ $1$) 对于每一位,若不能变成$1$,则 ...

  6. 洛谷P2135 方块消除

    洛谷题目链接 动态规划(真毒瘤!) 变量声明: $val[i]$:表示第$i$块颜色 $num[i]$:表示第$i$块颜色数量 $sum[i]$:表示$num$的前缀和 我们设计状态$f[l][r][ ...

  7. 彩色模型,CIE XYZ,CIE RGB

    学习DIP第8天 转载请标明出处:http://blog.csdn.net/tonyshengtan,欢迎大家转载,发现博客被某些论坛转载后,图像无法正常显示,无法正常表达本人观点,对此表示很不满意. ...

  8. 2017 ZSTU寒假排位赛 #6

    题目链接:https://vjudge.net/contest/149212#overview. A题,水题,略过. B题,水题,读清题意即可. C题,数学题,如果把x表示成x=nb+m,则k=n/m ...

  9. mysql:unknown variable 'default-character-set=utf8'

    1.修改my.cnf后,执行 service mysql restart 重启数据库失败 service mysql restart Shutting down MySQL.. SUCCESS! St ...

  10. HTML容器标签和文本标签

    html中的容器级标签和文本级标签,css中的块级元素和行内元素是我们常常拿来比较的四个名词(行内块级暂时先不考虑).注:如果标签嵌套错误,可能会发生浏览器解析错误的情况,只是针对嵌套做的这个. 容器 ...