How to remove duplicate lines in a large text file?
How would you remove duplicate lines from a file that is much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.
A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.
Java:
// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}
How to remove duplicate lines in a large text file?的更多相关文章
- notepad++ remove duplicate line
To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...
- Compare, sort, and delete duplicate lines in Notepad ++
Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...
- [LeetCode] Remove Duplicate Letters 移除重复字母
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- 316. Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- Remove Duplicate Letters I & II
Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...
- LeetCode Remove Duplicate Letters
原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...
- leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)
https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...
- Remove Duplicate Letters
316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...
- [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
随机推荐
- linux程序编译过程
大家肯定都知道计算机程序设计语言通常分为机器语言.汇编语言和高级语言三类.高级语言需要通过翻译成机器语言才能执行,而翻译的方式分为两种,一种是编译型,另一种是解释型,因此我们基本上将高级语言分为两大类 ...
- 2018多校第九场 HDU 6416 (DP+前缀和优化)
转自:https://blog.csdn.net/CatDsy/article/details/81876341 #include <bits/stdc++.h> using namesp ...
- vs2015下载
VS2015 专业版下载链接http://download.microsoft.com/download/B/8/9/B898E46E-CBAE-4045-A8E2-2D33DD36F3C4/vs20 ...
- EasyLogging++学习笔记(1)—— 简要介绍
对于有开发经验的程序员来说,记录程序执行日志是一件必不可少的事情.通过查看和分析日志信息,不仅可以有效地帮助我们调试程序,而且当程序正式发布运行之后,更是可以帮助我们快速.准确地定位问题.在现在这个开 ...
- Luogu P2114_[NOI2014]起床困难综合症 贪心
思路:按位贪心. 提交:1次 题解: 可以先处理出对于全$0$串和全$1$串最后每一位的结果.(每一位 从 $0$ $or$ $1$ 变成 $0$ $or$ $1$) 对于每一位,若不能变成$1$,则 ...
- 洛谷P2135 方块消除
洛谷题目链接 动态规划(真毒瘤!) 变量声明: $val[i]$:表示第$i$块颜色 $num[i]$:表示第$i$块颜色数量 $sum[i]$:表示$num$的前缀和 我们设计状态$f[l][r][ ...
- 彩色模型,CIE XYZ,CIE RGB
学习DIP第8天 转载请标明出处:http://blog.csdn.net/tonyshengtan,欢迎大家转载,发现博客被某些论坛转载后,图像无法正常显示,无法正常表达本人观点,对此表示很不满意. ...
- 2017 ZSTU寒假排位赛 #6
题目链接:https://vjudge.net/contest/149212#overview. A题,水题,略过. B题,水题,读清题意即可. C题,数学题,如果把x表示成x=nb+m,则k=n/m ...
- mysql:unknown variable 'default-character-set=utf8'
1.修改my.cnf后,执行 service mysql restart 重启数据库失败 service mysql restart Shutting down MySQL.. SUCCESS! St ...
- HTML容器标签和文本标签
html中的容器级标签和文本级标签,css中的块级元素和行内元素是我们常常拿来比较的四个名词(行内块级暂时先不考虑).注:如果标签嵌套错误,可能会发生浏览器解析错误的情况,只是针对嵌套做的这个. 容器 ...