Find and delete duplicate files
作用:查找指定目录(一个或多个)及子目录下的所有重复文件,分组列出,并可手动选择或自动随机删除多余重复文件,每组重复文件仅保留一份。(支持文件名有空格,例如:"file name" 等)
实现:find遍历指定目录查找所有文件,并对找到的所有文件进行MD5校验,通过比对MD5值分类处理重复文件。
不足: find 遍历文件耗时;
MD5校验大文件耗时;
对所有文件校验比对耗时(可考虑通过比对文件大小进行第一轮的重复性筛选,此方式针对存放大量大文件的目录效果明显,本脚本未采用);
演示:

注释:
脚本执行过程中显示MD5校验过程,完毕后,统计信息如下:
Files: 校验的文件总数
Groups: 重复文件组的数量
Size:此处统计的大小为,多余文件的总大小,即将要删除的多余的重复文件的大小,换句话说就是,删除重复文件后,磁盘空间会节省的空间。
可在“Show detailed information ?”提示后,按键“y”,进行重复文件组的查看,以便确认,也可直接跳过,进入删除文件方式的选择菜单:
删除文件方式有两种,一种是手动选择方式(默认的方式),每次列出一组重复文件,手动选择欲留下的文件,其他文件将会被删除,若没有选择 则默认保留列表的第一个文件,演示如下:

另一种方式是自动选择方式,默认保留每组文件的第一个文件,其他重复文件自动删除。(为防止删除重要文件,建议使用第一种方式),演示如下:

支持文件名空格的情况,演示如下:

代码专区:
#!/bin/bash
#Author: LingYi
#Date:
#Func: Delete duplicate files
#EG : $ [ DIR1 DIR2 ... DIRn ]
#Define the mnt file, confirming the write authority by yourself.
md5sum_result_log="/tmp/$(date +%Y%m%d%H%M%S)"
echo -e "\033[1;31mMd5suming ...\033[0m"
-I {} md5sum {} | tee -a $md5sum_result_log
files_sum=$(cat $md5sum_result_log | wc -l)
# Define array, using the value of md5 as index, filename as element.
# Firstly, you must do advance declaration to make sure the it's supported by bash.
declare -A md5sum_value_arry
while read md5sum_value md5sum_filename
do
#Space in a file name, in order to support this case ,using the ‘+’ as the segmentation charater.
#So, if '+' appears in a file name, there will be problems. The use should choose the manual mode to delete redundant files.
md5sum_value_arry[$md5sum_value]="${md5sum_value_arry[$md5sum_value]}+$md5sum_filename"
(( _${md5sum_value}+= ))
done <$md5sum_result_log
# counting the duplicate file groups and the size of redundant files in this loop.
groups_sum=
repfiles_size=
for md5sum_value_index in ${!md5sum_value_arry[@]}
do
]]; then
let groups_sum++
need_print_indexes="$need_print_indexes $md5sum_value_index"
eval repfile_sum=\$\(\( \$_$md5sum_value_index - \)\)
repfile_size=$( ls -lS "`echo ${md5sum_value_arry[$md5sum_value_index]}|awk -F'+' '{print $2}'`" | awk '{print $5}')
repfiles_size=$(( repfiles_size + repfile_sum*repfile_size ))
fi
done
#Outputing the statistical information.
echo -e "\033[1;31mFiles: $files_sum Groups: $groups_sum \
Size: ${repfiles_size}B $((repfiles_size/))K $((repfiles_size//))M\[0m"
[[ $groups_sum -eq ]] && exit
#The use chooses whether to check the file grouping or not.
read -n -s -t -p 'Show detailed information ?' user_ch
[[ $user_ch == 'n' ]] && echo || {
[[ $user_ch == 'q' ]] && exit
for print_value_index in $need_print_indexes
do
echo -ne "\n\033[1;35m$((++i)) \033[0m"
eval echo -ne "\\\033[1\;34m$print_value_index [ \$_${print_value_index} ]:\\\033[0m"
echo ${md5sum_value_arry[$print_value_index]} | tr '+' '\n'
done | more
}
#The user can choose the way of deleting file here.
echo -e "\n\nManual Selection by default !"
echo -e " 1 Manual selection\n 2 Random selection"
echo -ne "\033[1;31m"
read -t USER_CH
echo -ne "\033[0m"
[[ $USER_CH == 'q' ]] && exit
[[ $USER_CH -ne ]] && USER_CH= || {
echo -ne "\033[31mWARNING: you have choiced the Random Selection mode, files will be deleted at random !\nAre you sure ?\033[0m"
read -t yn
[[ $yn == 'q' ]] && exit
[[ $yn !=
}
#Handle files according to the user's selection
echo -e "\033[31m\nWarn: keep the first file by default.\033[0m"
for exec_value_index in $need_print_indexes
do
#This loop contains an array of files that are about to be deleted.
,j=;i<$(echo ${md5sum_value_arry[$exec_value_index]} | grep -o '+' | wc -l); i++,j++))
do
file_choices_arry[i]="$(echo ${md5sum_value_arry[$exec_value_index]}|awk -F'+' '{print $J}' J=$j)"
done
eval file_sum=\$_$exec_value_index
]]; then
#If the user selects a manual mode, handle the duplicate file group one by one in a loop.
echo -e "\033[1;34m$exec_value_index\033[0m"
; j<${#file_choices_arry[@]}; j++))
do
echo "[ $j ] ${file_choices_arry[j]}"
done
read -p "Number of the file you want to keep: " num_ch
[[ $num_ch == 'q' ]] && exit
$((${#file_choices_arry[@]}-)) |
else
num_ch=
fi
#If the user selects the automatic deletion mode, then delete the redundant files
; n<${#file_choices_arry[@]}; n++))
do
[[ $n -ne $num_ch ]] && {
echo -ne "\033[1mDeleting file \" ${file_choices_arry[n]} \" ... \033[0m"
rm -f "${file_choices_arry[n]}"
[[ $? -eq ]] && echo -e "\033[1;32mOK" || echo -e "\033[1;31mFAIL"
echo -ne "\033[0m"
}
done
done
Find and delete duplicate files的更多相关文章
- Compare, sort, and delete duplicate lines in Notepad ++
Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...
- Android Duplicate files copied in APK
今天调试 Android 应用遇到这么个问题: Duplicate files copied in APK META-INF/DEPENDENCIES File 1: httpmime-4.3.2.j ...
- com.android.build.api.transform.TransformException: com.android.builder.packaging.DuplicateFileException: Duplicate files copied in APK assets/com.xx.xx
完整的Error 信息(关键部分) Error:Execution failed for task ':fanwe_o2o_47_mgxz_dingzhi:transformResourcesWith ...
- AndroidStudio使用第三方jar包报错(Error: duplicate files during packaging of APK)
http://www.kwstu.com/ArticleView/android_201410252131196692 错误描述: Error: duplicate files during pack ...
- Android Studio 错误 Duplicate files copied in APK META-INF/LICENSE.txt
1 .Duplicate files copied in APK META-INF/LICENSE.txt android { packagingOptions { exclude 'META-I ...
- Duplicate files copied in APK META-INF/LICENSE.txt
Error:Execution failed for task ':app:packageDebug'. > Duplicate files copied in APK META-INF/LIC ...
- Android Studio 错误 Duplicate files copied in APK META-INF/LICENSE.txt解决方案
My logcat: log Execution failed for task ':Prog:packageDebug'. Duplicate files copied in APK META-IN ...
- List or delete hidden files from command prompt(CMD)
In Windows, files/folders have a special attribute called hidden attribute. By setting this attribut ...
- 解决DuplicateFileException: Duplicate files copied in APK META-INF/LICENSE(或META-INF/DEPENDENCIES)
导入eclipse项目时报 Error:Execution failed for task ':app:transformResourcesWithMergeJavaResForDebug'.> ...
随机推荐
- python爬虫学习(11) —— 也写个AC自动机
0. 写在前面 本文记录了一个AC自动机的诞生! 之前看过有人用C++写过AC自动机,也有用C#写的,还有一个用nodejs写的.. C# 逆袭--自制日刷千题的AC自动机攻克HDU OJ HDU 自 ...
- j2ee之Filter使用实例(页面跳转)
javax.servlet.Filter类中主要有三个方法. public void destroy(); //销毁对象 public void doFilter(ServletRequest req ...
- mybatis: 利用多数据源实现分库存储
之前写过一篇mybatis 使用经验小结 提到过多数据源的处理方式,虽然简单但是姿势不太优雅,今天介绍一些更美观的办法: spring中有一个AbstractRoutingDataSource的抽象类 ...
- linux的常用文件系统格式
文件系统指文件存在的物理空间.在Linux系统中,每个分区都是一个文件系统,都有自己的目录层次结构.Linux的最重要特征之一就是支持多种文件系统,这样它更加灵活,并可以和许多其它种操作系统共存.Vi ...
- 关于#define for if(false);else for
今日在看一个第三方代码时看到了#define for if(false);else for 这样的一种定义,不明白这样用法的目的,于是查了一下. 这是一个兼容vc6.0的用法,csdn上有这个问题的回 ...
- iOS学习-圆形进度条
效果: #import <UIKit/UIKit.h> @interface HsProfitRatePieWidgets : UIView { UILabel *_textLabel; ...
- 红米3 TWRP-3.0.3-RECOVERY-7.1.1中文版 20170101更新修复版
1-刷7.1的包不再提示错误命令. 2-基于安卓7.1.1-r6适配. 3-支持屏蔽检验,刷官方包不卡米. 4-禁止恢复RECOVER. 下载地址: https://pan.baidu.com/s/1 ...
- 基于redis 实现分布式锁的方案
在电商项目中,经常有秒杀这样的活动促销,在并发访问下,很容易出现上述问题.如果在库存操作上,加锁就可以避免库存卖超的问题.分布式锁使分布式系统之间同步访问共享资源的一种方式 基于redis实现分布式锁 ...
- BZOJ1257 [CQOI2007]余数之和sum
本文版权归ljh2000和博客园共有,欢迎转载,但须保留此声明,并给出原文链接,谢谢合作. 本文作者:ljh2000 作者博客:http://www.cnblogs.com/ljh2000-jump/ ...
- 架构师养成记--12.Concurrent工具类CyclicBarrier和CountDownLatch
java.util.concurrent.CyclicBarrier 一组线程共同等待,直到达到一个公共屏障点. 举个栗子,百米赛跑中,所有运动员都要等其他运动员都准备好后才能一起跑(假如没有发令员) ...