Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input:

["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]

Output:

[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

 class Solution {

     public static List<List<String>> findDuplicate(String[] paths) {

         Map<String, List<String>> map = new HashMap<>();

         for(String path : paths) {

             String[] tokens = path.split(" ");

             for(int i = ; i < tokens.length; i++) {

                 String file = tokens[i].substring(, tokens[i].indexOf('('));

                 String content = tokens[i].substring(tokens[i].indexOf('(') + , tokens[i].indexOf(')'));

                 map.putIfAbsent(content, new ArrayList<>());

                 map.get(content).add(tokens[] + "/" + file);

             }

         }

         return map.values().stream().filter(e -> e.size() > ).collect(Collectors.toList());

     }

 }

Follow Up questions:

Imagine you are given a real file system, how will you search files? DFS or BFS ?

The answer depends on the tree structure. If the branching factor (n) and depth (d) are high, then BFS will take up a lot of memory O(d^n). For DFS, the space complexity is generally the height of the tree - O(d).

If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

Question 1:

core idea: DFS

Reason: if depth of directory is not too deeper, which is suitable to use DFS, comparing with BFS.

Question-2:

If the file content is very large (GB level), how will you modify your solution?

Answer:

core idea: make use of meta data, like file size before really reading large content.

Two steps:

DFS to map each size to a set of paths that have that size: Map<Integer, Set>
For each size, if there are more than 2 files there, compute hashCode of every file by MD5, if any files with the same size have the same hash, then they are identical files: Map<String, Set>, mapping each hash to the Set of filepaths+filenames. This hash id's are very very big, so we use the Java library BigInteger.

To optimize Step-2. In GFS, it stores large file in multiple "chunks" (one chunk is 64KB). we have meta data, including the file size, file name and index of different chunks along with each chunk's checkSum(the xor for the content). For step-2, we just compare each file's checkSum.

Disadvantage: there might be flase positive duplicates, because two different files might share the same checkSum.

Question-3:

If you can only read the file by 1kb each time, how will you modify your solution?

Answer:

makeHashQuick Function is quick but memory hungry, might likely to run with java -Xmx2G or the likely to increase heap space if RAM avaliable.
we might need to play with the size defined by "buffSize" to make memory efficient.

Question-4:

What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?

Answer:

hashing part is the most time-consuming and memory consuming.
optimize as above mentioned, but also introduce false positive issue.

Question-5:

How to make sure the duplicated files you find are not false positive?

Answer:

Question-2-Answer-1 will avoid it.
We need to compare the content chunk by chunk when we find two "duplicates" using checkSum.

In preparing for my Dropbox interview, I came across this problem and really wanted to find the ideas behind the follow up questions (as these were the questions that the interviewer was most interested in, not the code itself). Since this is the only post with the follow up discussion, i'll comment here! @yujun gave a great solution above and I just wanted to add a bit more to help future interviewees.

To find duplicate files, given input of String array is quite easy. Loop through each String and keep a HashMap of Strings to Set/Collection of Strings: mapping the contents of each file to a set of paths with filename concatenated.

For me, instead of given a list of paths, I was given a Directory and asked to return List of List of duplicate files for all under it. I chose to represent a Directory like:

class Directory{

     List<Directory> subDirectories;

     List<File> files;

}

Given a directory, you are asked how you can find duplicate files given very large files. The idea here is that you cannot store contents in memory, so you need to store the file contents in disk. So you can hash each file content and store the hash as a metadata field for each file. Then as you perform your search, store the hash instead the file's contents in memory. So the idea is you can do a DFS through the root directory and create a HashMap<String, Set<String>> mapping each hash to the Set of filepaths + filenames that correspond to that hash's content.

(Note: You can choose BFS / DFS to traverse the Path. I chose DFS as it is more memory efficient and quicker to code up.)

Follow Up: This is great, but it requires you to compute the hash for every single file once, which can be expensive for large files. Is there anyway you can avoid computing the hash for a file?

One approach is to also maintain a metadata field for each file's size on disk. Then you can take a 2 pass approach:

DFS to map each size to a set of paths that have that size
For each size, if there are more than 2 files there, compute hash of every file, if any files with the same size have the same hash, then they are identical files.

This way, you only compute hashes if you have multiple files with the same size. So when you do a DFS, you can create a HashMap<Integer, Set<String>>, mapping each file's size to the list of file paths that have that size. Loop through each String in each set, get its hash, check if it exists in your set, if so, add it to your List<String> res otherwise add it into the set. In between each key (switching file sizes), you can add your res to your List<List<String>>.

Just want to share my humble opinions for discussion:
If anyone has a better solution, I would appreciate it if you'd like to correct and enlighten me:-)
Question 2:
In real-world file system, we usually store large file in multiple "chunks" (in GFS, one chunk is 64 MB),so we have meta data recording the file size,file name and index of different chunks along with each chunk's checkSum (the xor for the content).
So when we upload a file, we record the meta data as mentioned above.
When we need to check for duplicates, we could simply check the meta data:
1.Check if files are of the same size;
2.if step 1 passes, compare the first chunk's checkSum
3.if step 2 passes, check the second checkSum
...
and so on.
There might be false positive duplicates, because two different files might share the same checkSum.

Question 3:
In the way mentioned above, we could read the meta data instead of the entire file, and compare the information KB by KB.

Question 5:
Using checkSum, we could quickly and accurately find out the non-duplicated files. But to totally avoid getting the false positive, we need to compare the content chunk by chunk when we find two "duplicates" using checkSum.

Find Duplicate File in System的更多相关文章

LC 609. Find Duplicate File in System
Given a list of directory info including directory path, and all the files with contents in this dir ...
[LeetCode] Find Duplicate File in System 在系统中寻找重复文件
Given a list of directory info including directory path, and all the files with contents in this dir ...
[Swift]LeetCode609. 在系统中查找重复文件 | Find Duplicate File in System
Given a list of directory info including directory path, and all the files with contents in this dir ...
LeetCode Find Duplicate File in System
原题链接在这里:https://leetcode.com/problems/find-duplicate-file-in-system/description/ 题目: Given a list of ...
[leetcode-609-Find Duplicate File in System]
https://discuss.leetcode.com/topic/91430/c-clean-solution-answers-to-follow-upGiven a list of direct ...
609. Find Duplicate File in System
Given a list of directory info including directory path, and all the files with contents in this dir ...
【leetcode】609. Find Duplicate File in System
题目如下: Given a list of directory info including directory path, and all the files with contents in th ...
【LeetCode】609. Find Duplicate File in System 解题报告（Python & C++）
作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 目录题目描述题目大意解题方法日期题目地址:https://leetcode.c ...
HDU 3269 P2P File Sharing System（模拟）（2009 Asia Ningbo Regional Contest）
Problem Description Peer-to-peer(P2P) computing technology has been widely used on the Internet to e ...

随机推荐

Python数据类型知识点
1.字符串字符串常用功能 name = 'derek' print(name.capitalize()) #首字母大写 Derek print(name.count("e")) ...
learning docker steps(9) ----- arm linux docker 安装
参考:https://docs.docker.com/install/linux/docker-ce/ubuntu/#install-docker-ce-1 想要在arm linux上安装docker ...
【线性代数】Linear Algebra Big Picture
Abstract: 通过学习MIT 18.06课程,总结出的线性代数的知识点相互依赖关系,后续博客将会按照相应的依赖关系进行介绍.(2017-08-18 16:28:36) Keywords: Lin ...
浅谈C语言和C++中“类”的区别
在C语言中,没有“类”的概念,但是可以由结构体struct构造出我们所需要的数据类型,struct可以组合不同的数据类型,可以看作是C语言中的“类”. 下面是C语言中的结构体的实例. #include ...
Java关键字volatile的实现原理（四）
简述 volatile 是轻量级的synchronized,在多线程开发中保证了共享变量的可见性.可见性就是当一个线程修改一个共享变量时,另一个线程可以读到修改的值.如果volatile变量使用恰当, ...
JAVA之G1与CMS垃圾回收
G1 GC,全称Garbage-FirstGarbage Collector,通过-XX:+UseG1GC参数来启用,作为体验版随着JDK 6u14版本面世,在JDK 7u4版本发行时被正式推出,相信 ...
Redis 4.x RCE 复现学习
攻击场景: 能够访问远程redis的端口(直接访问或者SSRF) 对redis服务器可以访问到的另一台服务器有控制权实际上就是通过主从特性来同步传输数据,同时利用模块加载来加载恶意的用来进行命令执 ...
OUC_Summer Training_ DIV2_#7 718
是18号做的题啦,现在才把报告补上是以前不重视报告的原因吧,不过现在真的很喜欢写报告,也希望能写一些有意义的东西出来. A - Dragons Time Limit:2000MS Memory ...
邻居子系统之状态定时器回调neigh_timer_handler
概述在分配邻居子系统之后,会设置定时器来处理那些需要定时器处理的状态,定时器回调函数为neigh_timer_handler:函数会根据状态机变换规则对状态进行切换,切换状态后,如果需要更新输出函数 ...
JAVA单元测试的用法和要点
2018年09月25日 10:11:18 琼歌阅读数 5192 版权声明:禁止转载 https://blog.csdn.net/qq_36505948/article/details/827 ...

Find Duplicate File in System

Question 1:

core idea: DFS

Question-2:

Answer:

core idea: make use of meta data, like file size before really reading large content.

Two steps:

To optimize Step-2. In GFS, it stores large file in multiple "chunks" (one chunk is 64KB). we have meta data, including the file size, file name and index of different chunks along with each chunk's checkSum(the xor for the content). For step-2, we just compare each file's checkSum.

Disadvantage: there might be flase positive duplicates, because two different files might share the same checkSum.

Question-3:

Answer:

Question-4:

Answer:

Question-5:

Answer:

Find Duplicate File in System的更多相关文章

随机推荐

热门专题