609. Find Duplicate File in System
Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.
A group of duplicate files consists of at least two files that have exactly the same content.
A single directory info string in the input list has the following format:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
It means there are n files (f1.txt
, f2.txt
... fn.txt
with content f1_content
, f2_content
... fn_content
, respectively) in directory root/d1/d2/.../dm
. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.
The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:
"directory_path/file_name.txt"
Example 1:
Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
Note:
- No order is required for the final output.
- You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50].
- The number of files given is in the range of [1,20000].
- You may assume no files or directories share the same name in the same directory.
- You may assume each given directory info represents a unique directory. Directory path and file info are separated by a single blank space.
Follow-up beyond contest:
- Imagine you are given a real file system, how will you search files? DFS or BFS?
- If the file content is very large (GB level), how will you modify your solution?
- If you can only read the file by 1kb each time, how will you modify your solution?
- What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?
- How to make sure the duplicated files you find are not false positive?
class Solution {
public:
vector<vector<string>> findDuplicate(vector<string>& paths) {
int size = paths.size();
unordered_map<string, vector<string>> mp;
for (int i = 0; i < size; ++i) {
int found = paths[i].find(' ');
string str = paths[i].substr(0, found);
while (found != string::npos) {
int last = found;
found = paths[i].find(' ', last+1);
mp[str].push_back(paths[i].substr(last+1, found-last-1));
}
} unordered_map<string, vector<string>> temp; for (auto m : mp) {
string base = m.first + "/";
for (string s : m.second) {
int fre = s.find('(');
int las = s.find(')');
string key = s.substr(fre+1, las-fre-1);
string kid = s.substr(0, fre);
temp[key].push_back(base+kid);
}
} vector<vector<string>> ans; for (auto it : temp) {
vector<string> ant;
for (string s : it.second) {
ant.push_back(s);
}
if (ant.size() >1)
ans.push_back(ant);
} return ans;
}
};
Approach #2: Java.
class Solution {
public List<List<String>> findDuplicate(String[] paths) {
HashMap<String, List<String>> map = new HashMap<>();
for (String path : paths) {
String[] values = path.split(" ");
for (int i = 1; i < values.length; ++i) {
String[] name_cont = values[i].split("\\(");
name_cont[1] = name_cont[1].replace(")", "");
List<String> list = map.getOrDefault(name_cont[1], new ArrayList<String>());
list.add(values[0] + "/" + name_cont[0]);
map.put(name_cont[1], list);
}
}
List<List<String>> res = new ArrayList<>();
for (String key : map.keySet()) {
if (map.get(key).size() > 1)
res.add(map.get(key));
}
return res;
}
}
Apparoch #3: Python.
class Solution(object):
def findDuplicate(self, paths):
"""
:type paths: List[str]
:rtype: List[List[str]]
"""
M = collections.defaultdict(list)
for line in paths:
data = line.split()
root = data[0]
for file in data[1:]:
name, _, content = file.partition('(')
M[content[:-1]].append(root + '/' + name)
return [x for x in M.values() if len(x) > 1]
Analysis:
In this question our goal is to split and combine the string. If you are familiar with the operate it will easy to solve this problem.
C++ -----> string:assign
string (1) |
string& assign (const string& str); |
---|---|
substring (2) |
string& assign (const string& str, size_t subpos, size_t sublen); |
c-string (3) |
string& assign (const char* s); |
buffer (4) |
string& assign (const char* s, size_t n); |
fill (5) |
string& assign (size_t n, char c); |
range (6) |
template <class InputIterator> |
Assigns a new value to the string, replacing its current contents.
- (1) string
- Copies str.
- (2) substring
- Copies the portion of str that begins at the character position subpos and spans sublen characters (or until the end of str, if either str is too short or if sublen is string::npos).
- (3) c-string
- Copies the null-terminated character sequence (C-string) pointed by s.
- (4) buffer
- Copies the first n characters from the array of characters pointed by s.
- (5) fill
- Replaces the current value by n consecutive copies of character c.
- (6) range
- Copies the sequence of characters in the range [first,last), in the same order.
- (7) initializer list
- Copies each of the characters in il, in the same order.
- (8) move
- Acquires the contents of str.
str is left in an unspecified but valid state.
// string::assign
#include <iostream>
#include <string> int main ()
{
std::string str;
std::string base="The quick brown fox jumps over a lazy dog."; // used in the same order as described above: str.assign(base);
std::cout << str << '\n'; str.assign(base,10,9);
std::cout << str << '\n'; // "brown fox" str.assign("pangrams are cool",7);
std::cout << str << '\n'; // "pangram" str.assign("c-string");
std::cout << str << '\n'; // "c-string" str.assign(10,'*');
std::cout << str << '\n'; // "**********" str.assign<int>(10,0x2D);
std::cout << str << '\n'; // "----------" str.assign(base.begin()+16,base.end()-12);
std::cout << str << '\n'; // "fox jumps over" return 0;
}
C++ -----> string:substr.
string substr (size_t pos = 0, size_t len = npos) const;
Returns a newly constructed string object with its value initialized to a copy of a substring of this object.
The substring is the portion of the object that starts at character position pos and spans len characters (or until the end of the string, whichever comes first).
Parameters
- pos
- Position of the first character to be copied as a substring.
If this is equal to the string length, the function returns an empty string.
If this is greater than the string length, it throws out_of_range.
Note: The first character is denoted by a value of 0 (not 1). - len
- Number of characters to include in the substring (if the string is shorter, as many characters as possible are used).
A value of string::npos indicates all characters until the end of the string.
size_t is an unsigned integral type (the same as member type string::size_type).
609. Find Duplicate File in System的更多相关文章
- LC 609. Find Duplicate File in System
Given a list of directory info including directory path, and all the files with contents in this dir ...
- 【leetcode】609. Find Duplicate File in System
题目如下: Given a list of directory info including directory path, and all the files with contents in th ...
- 【LeetCode】609. Find Duplicate File in System 解题报告(Python & C++)
作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 目录 题目描述 题目大意 解题方法 日期 题目地址:https://leetcode.c ...
- [LeetCode] Find Duplicate File in System 在系统中寻找重复文件
Given a list of directory info including directory path, and all the files with contents in this dir ...
- [Swift]LeetCode609. 在系统中查找重复文件 | Find Duplicate File in System
Given a list of directory info including directory path, and all the files with contents in this dir ...
- LeetCode Find Duplicate File in System
原题链接在这里:https://leetcode.com/problems/find-duplicate-file-in-system/description/ 题目: Given a list of ...
- [leetcode-609-Find Duplicate File in System]
https://discuss.leetcode.com/topic/91430/c-clean-solution-answers-to-follow-upGiven a list of direct ...
- Find Duplicate File in System
Given a list of directory info including directory path, and all the files with contents in this dir ...
- HDU 3269 P2P File Sharing System(模拟)(2009 Asia Ningbo Regional Contest)
Problem Description Peer-to-peer(P2P) computing technology has been widely used on the Internet to e ...
随机推荐
- [网页游戏开发]Morn简介及使用教程
网页游戏开发利器,morn系列教程之Morn简介及使用教程 网页游戏开发的一大部分工作是在和UI制作上,一个好的工具及框架能使开发事半功倍,Adobe自带flash IDE和Flex各有不足. Mor ...
- 九度OJ 1096:日期差值 (日期计算)
时间限制:1 秒 内存限制:32 兆 特殊判题:否 提交:8138 解决:2752 题目描述: 有两个日期,求两个日期之间的天数,如果两个日期是连续的我们规定他们之间的天数为两天 输入: 有多组数据, ...
- libcurl以get方式请求服务器端文件
static size_t callbackfunction(void *ptr, size_t size, size_t nmemb, void* userdata){ FILE* strea ...
- 最新版本号MYSQL官网下载地址可是必需要注冊后才干下载
因mysql5.0上运行函数不行,决定安装最新版本号的mysql,在网上找了些绿色版的.安装总报1067错误,网上的各种方法都试了,就是不行.浪费时间不说.郁闷死了,最后决定去官网,官 ...
- ZOJ - 3953 Intervals 【贪心】
题目链接 http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode=3953 题意 给出N个区间,求去掉某些区间,使得剩下的区间中,任何 ...
- cURL范例(包括错误输出和详情输出)
//1.初始化 $ch = curl_init(); //2.设置选项,包括URL curl_setopt($ch, CURLOPT_URL, 'http://www.baidu.com'); cur ...
- php递归循环地区
$mylist = array( array( 'area_parent_id'=>0,'id'=>1,'area_name' => '河北',), array( 'area_par ...
- poj3904 Sky Code —— 唯一分解定理 + 容斥原理 + 组合
题目链接:http://poj.org/problem?id=3904 Sky Code Time Limit: 1000MS Memory Limit: 65536K Total Submiss ...
- Chrome 插件 Vimium——让你脱离鼠标
下面是帮助,按?就能出现.什么时候忘了可以随时查看.^_^
- 模仿yui将css和js打包,加速网页速度
如果你有机会用firebug看看自己网站的网络请求,你会发现请求数量之多超乎你的想象.为减少这个数量,有许多技术方案.比如yui的combo,会将所有需要的js混合成一个文件下载,现代web服务器好像 ...