How to remove duplicate lines in a large text file?
How would you remove duplicate lines from a file that is much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.
A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.
Java:
// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}
How to remove duplicate lines in a large text file?的更多相关文章
- notepad++ remove duplicate line
To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...
- Compare, sort, and delete duplicate lines in Notepad ++
Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...
- [LeetCode] Remove Duplicate Letters 移除重复字母
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- 316. Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- Remove Duplicate Letters I & II
Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...
- LeetCode Remove Duplicate Letters
原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...
- leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)
https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...
- Remove Duplicate Letters
316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...
- [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
随机推荐
- udp单播,广播,多播实现(ReceiveFromAsync,SendToAsync)
注意:客户端和服务器实现基本一致,本地host和port和多播的host和port可以一样 (1)多播 1.将本地host加入多播组中,只有加入多播组的成员才能接受同组的节点发送的多播 Multica ...
- linux批量添加用户和批量修改密码
一.批量创建用户通过命令newusers可以实现批量的创建用户.这个命令的用法为 newusers file.txt(一个文本文件)文本文件内存放需要批量添加的用户信息但是对格式有要求格式:pw_na ...
- [ZJOI2019]语言——树剖+树上差分+线段树合并
原题链接戳这儿 SOLUTION 考虑一种非常\(naive\)的统计方法,就是对于每一个点\(u\),我们维护它能到达的点集\(S_u\),最后答案就是\(\frac{\sum\limits_{i= ...
- CSS3 -- column 实现瀑布流布局
本例使用 CSS column 实现瀑布流布局 关键点,column-count: 元素内容将被划分的最佳列数 关键点,break-inside: 避免在元素内部插入分页符 html div.g-co ...
- esxi克隆虚拟机
1.->选中虚拟机->导出(需要关闭虚拟机电源) 此时会下载下两个文件: 2.新建虚拟机 ->从OVF或OVA文件部署虚拟机 然后创建虚拟机,选择第二项 然后填入新虚拟机名称,并把下 ...
- git 解决 error: failed to push some refs to 'https://github.com/xxxx.git'
在github远程创建仓库后, 利用gitbash进行提交本地文件的时候出现如下错误 [root@foundation38 demo]# git push -u origin master Usern ...
- Luogu P4109 [HEOI2015]定价 贪心
思路:找规律?$or$贪心. 提交:1次 题解: 发现:若可以构成$X0000$,答案绝对不会再在数字最后把$0$改成其他数: 若可以构成$XX50...0$更优. 所以左端点增加的步长是增加的($i ...
- ttf-mscorefonts-installer 无法安装,解决办法
ttf-mscorefonts-installer 无法安装,解决办法 原 lieefu 发布于 2017/01/11 08:11 字数 163 阅读 1007 收藏 0 点赞 0 评论 0 面试:你 ...
- 路由器配置——基于区域的OSPF简单认证
一.实验目的:掌握区域的OSPF简单认证 二.拓扑图: 三.具体步骤配置: (1)R1路由器配置 Router>enable Router#configure terminal Enter co ...
- 【原创】tarjan算法初步(强连通子图缩点)
[原创]tarjan算法初步(强连通子图缩点) tarjan算法的思路不是一般的绕!!(不过既然是求强连通子图这样的回路也就可以稍微原谅了..) 但是研究tarjan之前总得知道强连通分量是什么吧.. ...