How to remove duplicate lines in a large text file?
How would you remove duplicate lines from a file that is much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.
A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.
Java:
// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}
How to remove duplicate lines in a large text file?的更多相关文章
- notepad++ remove duplicate line
To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...
- Compare, sort, and delete duplicate lines in Notepad ++
Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...
- [LeetCode] Remove Duplicate Letters 移除重复字母
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- 316. Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- Remove Duplicate Letters I & II
Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...
- LeetCode Remove Duplicate Letters
原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...
- leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)
https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...
- Remove Duplicate Letters
316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...
- [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
随机推荐
- kotlin函数加强
在之前已经接触过了kotlin的函数了,这里再次加强学习下它,下面开始吧! Kotlin函数编写规则: 对照函数来理解其写法: 演练巩固: ①.编写函数, 实现四则运算. 接着来实现其它三个运算: 然 ...
- C3的坑之inline-block
最近开始复习css一直在踩坑,今天分享一个inline-block 关于inline-block可能很多人都不熟悉,布局这方面很多人用的都是flex或者浮动,flex很强大毋庸置疑的可是关于兼容性就不 ...
- [MySQL优化] -- 如何使用SQL Profiler 性能分析器
mysql 的 sql 性能分析器主要用途是显示 sql 执行的整个过程中各项资源的使用情况.分析器可以更好的展示出不良 SQL 的性能问题所在. 下面我们举例介绍一下 MySQL SQL Profi ...
- cookie的使用和设置
cookie就是服务端通过浏览器端的存储机制,把一些会话相关数据存储在浏览器中.优点:分担服务端的压力,提高了效率,缺点:不安全 生成和请求原理 cookie的生命周期设定以后,哪怕是关闭浏览器,那么 ...
- @Transactional(转)
概述@Transactional 是声明式事务管理 编程中使用的注解 添加位置 接口实现类或接口实现方法上,而不是接口类中访问权限:public 的方法才起作用 @Transactional 注解应该 ...
- 【转】pe结构详解
(一)基本概念 PE(Portable Execute)文件是Windows下可执行文件的总称,常见的有DLL,EXE,OCX,SYS等, 事实上,一个文件是否是PE文件与其扩展名无关,PE文件可以是 ...
- mysql查询字段中含有中文
查询mysql数据库中字段中含有中文使用正则表达式: 例如: select create_time,nickname from eb_engineer where not(nickname regex ...
- 栈的数组和链表实现(Java实现)
我以前用JavaScript写过栈和队列,这里初学Java,于是想来实现栈,基于数组和链表. 下面上代码: import java.io.*; //用接口来存放需要的所有操作 interface st ...
- JS各循环的差别
1.最普通的for循环: for(var i=0;i<arr.length;i++){ } 特点:只能针对数组循环,不能引用于非数组对象 2.for(var i in obj){ } 特点:用于 ...
- HTTP之缓存技术
1. 缓存简介 缓存是位于服务器和客户端的中间单元,主要根据用户代理发送过来的请求,向服务器请求相关内容后提供给用户,并保存内容副本,例如 HTML 页面.图片.文本文件或者流媒体文件.然后,当下一个 ...