How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. Launcher类源码分析

    基于上一次获取系统类加载器这块进行分析: 关于这个方法的javadoc在之前已经阅读过了,不过这里再来仔细阅读一下加深印象: 这里有一个非常重要的概念:上下文类加载器: 它的作用非常之大,在后面会详细 ...

  2. Airtest 支持的手机,系统等环境

    据个人经验,Airtest 支持的以下设备会跑的比较666 Android 平台 华为荣耀9青春版 版本:8.0.0 型号:LLD-AL10 评价:自动化运行最6 华为 荣耀10青春版 版本:9.0. ...

  3. CentOS下更改yum源

    centos下下载工具为yum,对应的源在/etc/yum.repos.d/CentOS-Base.repo文件下,修改其URI中前面的网络地址即可

  4. string::crbegin string::crend

    const_reverse_iterator crbegin() const noexcept;功能:crbegin是最后一个字符,crend第一个字符的前一个.迭代器向左移动是“+”,向右移动是“- ...

  5. P1052 过河 题解

    复习dp(迪皮)的时候刷到了一道简单路径压缩的题目(一点不会qwq) 题目描述链接. 正解: 首先呢,我们看到题目,自然而然的会想到这种思路: 设状态变量dp[i]表示从第一个格子开始经过一些跳跃跳到 ...

  6. quartz (从原理到应用)详解篇(转)

    一.Quartz 基本介绍 1.1 Quartz 概述 1.2 Quartz特点 1.3 Quartz 集群配置 二.Quartz 原理及流程 2.1 quartz基本原理 2.2 quartz启动流 ...

  7. Luogu P4109 [HEOI2015]定价 贪心

    思路:找规律?$or$贪心. 提交:1次 题解: 发现:若可以构成$X0000$,答案绝对不会再在数字最后把$0$改成其他数: 若可以构成$XX50...0$更优. 所以左端点增加的步长是增加的($i ...

  8. 实体类,bean文件,pojo文件夹,model文件夹都一样

    实体类,bean文件,pojo文件夹,model文件夹都一样,这些都是编写实体类,这是我暂时看到的项目文件

  9. 【原创】洛谷 LUOGU P3372 【模板】线段树1

    P3372 [模板]线段树 1 题目描述 如题,已知一个数列,你需要进行下面两种操作: 1.将某区间每一个数加上x 2.求出某区间每一个数的和 输入输出格式 输入格式: 第一行包含两个整数N.M,分别 ...

  10. 标准输入输出(C++)

    输入输出流函数(模板) #include<iostream> #include<iomanip> using namespace std; int main() { cout ...