作用:查找指定目录(一个或多个)及子目录下的所有重复文件,分组列出,并可手动选择或自动随机删除多余重复文件,每组重复文件仅保留一份。(支持文件名有空格,例如:"file  name" 等)

实现:find遍历指定目录查找所有文件,并对找到的所有文件进行MD5校验,通过比对MD5值分类处理重复文件。

不足:  find 遍历文件耗时;

MD5校验大文件耗时;

对所有文件校验比对耗时(可考虑通过比对文件大小进行第一轮的重复性筛选,此方式针对存放大量大文件的目录效果明显,本脚本未采用);

演示:

注释:

脚本执行过程中显示MD5校验过程,完毕后,统计信息如下:

Files: 校验的文件总数

Groups: 重复文件组的数量

Size:此处统计的大小为,多余文件的总大小,即将要删除的多余的重复文件的大小,换句话说就是,删除重复文件后,磁盘空间会节省的空间。

可在“Show detailed information ?”提示后,按键“y”,进行重复文件组的查看,以便确认,也可直接跳过,进入删除文件方式的选择菜单:

删除文件方式有两种,一种是手动选择方式(默认的方式),每次列出一组重复文件,手动选择欲留下的文件,其他文件将会被删除,若没有选择 则默认保留列表的第一个文件,演示如下:

另一种方式是自动选择方式,默认保留每组文件的第一个文件,其他重复文件自动删除。(为防止删除重要文件,建议使用第一种方式),演示如下:

支持文件名空格的情况,演示如下:

代码专区:

#!/bin/bash
#Author: LingYi
#Date:
#Func: Delete duplicate files
#EG  : $ [ DIR1 DIR2 ... DIRn ]

#Define the mnt file, confirming the write authority by yourself.
md5sum_result_log="/tmp/$(date +%Y%m%d%H%M%S)"

echo -e "\033[1;31mMd5suming ...\033[0m"

 -I {} md5sum {} | tee -a $md5sum_result_log
files_sum=$(cat $md5sum_result_log | wc -l)

# Define array, using the value of md5 as index, filename as element.
# Firstly, you must do advance declaration to make sure the it's supported by bash.
declare -A md5sum_value_arry

while read md5sum_value md5sum_filename
do
    #Space in a file name, in order to support this case ,using the ‘+’ as the segmentation charater.
    #So, if '+' appears in a file name, there will be problems. The use should choose the manual mode to delete redundant files.
    md5sum_value_arry[$md5sum_value]="${md5sum_value_arry[$md5sum_value]}+$md5sum_filename"
    (( _${md5sum_value}+= ))
done <$md5sum_result_log

# counting the duplicate file groups and the size of redundant files in this loop.
groups_sum=
repfiles_size=
for md5sum_value_index in ${!md5sum_value_arry[@]}
do
     ]]; then
        let groups_sum++
        need_print_indexes="$need_print_indexes $md5sum_value_index"
        eval repfile_sum=\$\(\( \$_$md5sum_value_index -  \)\)
        repfile_size=$( ls -lS "`echo ${md5sum_value_arry[$md5sum_value_index]}|awk -F'+' '{print $2}'`" | awk '{print $5}')
        repfiles_size=$(( repfiles_size + repfile_sum*repfile_size ))
    fi
done 

#Outputing the statistical information.
echo -e "\033[1;31mFiles: $files_sum    Groups: $groups_sum    \
Size: ${repfiles_size}B $((repfiles_size/))K $((repfiles_size//))M\[0m"
[[ $groups_sum -eq  ]] && exit

#The use chooses whether to check the file grouping or not.
read -n  -s -t  -p 'Show detailed information ?' user_ch
[[ $user_ch == 'n' ]] && echo || {
    [[ $user_ch == 'q' ]] && exit
    for print_value_index in $need_print_indexes
    do
        echo -ne "\n\033[1;35m$((++i)) \033[0m"
        eval echo -ne "\\\033[1\;34m$print_value_index [ \$_${print_value_index} ]:\\\033[0m"
        echo ${md5sum_value_arry[$print_value_index]} | tr '+' '\n'
    done | more
}

#The user can choose the way of deleting file here.
echo -e "\n\nManual Selection by default !"
echo -e " 1 Manual selection\n 2 Random selection"
echo -ne "\033[1;31m"
read -t  USER_CH
echo -ne "\033[0m"
[[ $USER_CH == 'q' ]] && exit
[[ $USER_CH -ne  ]] && USER_CH= || {
    echo -ne "\033[31mWARNING: you have choiced the Random Selection mode, files will be deleted at random !\nAre you sure ?\033[0m"
    read -t   yn
    [[ $yn == 'q' ]] && exit
    [[ $yn !=
}

#Handle files according to the user's selection
echo -e "\033[31m\nWarn: keep the first file by default.\033[0m"
for exec_value_index in $need_print_indexes
do
    #This loop contains an array of files that are about to be deleted.
    ,j=;i<$(echo ${md5sum_value_arry[$exec_value_index]} | grep -o '+' | wc -l); i++,j++))
    do
        file_choices_arry[i]="$(echo ${md5sum_value_arry[$exec_value_index]}|awk -F'+' '{print $J}' J=$j)"
    done

    eval file_sum=\$_$exec_value_index
     ]]; then
        #If the user selects a manual mode, handle the duplicate file group one by one in a loop.
        echo -e "\033[1;34m$exec_value_index\033[0m"
        ; j<${#file_choices_arry[@]}; j++))
        do
            echo "[ $j ]  ${file_choices_arry[j]}"
        done
        read -p "Number of the file you want to keep: " num_ch
        [[ $num_ch == 'q' ]] && exit
         $((${#file_choices_arry[@]}-)) |
    else
        num_ch=
    fi
    #If the user selects the automatic deletion mode, then delete the redundant files
    ; n<${#file_choices_arry[@]}; n++))
    do
        [[ $n -ne $num_ch ]] && {
            echo -ne "\033[1mDeleting file \" ${file_choices_arry[n]} \" ... \033[0m"
            rm -f "${file_choices_arry[n]}"
            [[ $? -eq  ]] && echo -e "\033[1;32mOK" || echo -e "\033[1;31mFAIL"
            echo -ne "\033[0m"
        }
    done
done

Find and delete duplicate files的更多相关文章

  1. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  2. Android Duplicate files copied in APK

    今天调试 Android 应用遇到这么个问题: Duplicate files copied in APK META-INF/DEPENDENCIES File 1: httpmime-4.3.2.j ...

  3. com.android.build.api.transform.TransformException: com.android.builder.packaging.DuplicateFileException: Duplicate files copied in APK assets/com.xx.xx

    完整的Error 信息(关键部分) Error:Execution failed for task ':fanwe_o2o_47_mgxz_dingzhi:transformResourcesWith ...

  4. AndroidStudio使用第三方jar包报错(Error: duplicate files during packaging of APK)

    http://www.kwstu.com/ArticleView/android_201410252131196692 错误描述: Error: duplicate files during pack ...

  5. Android Studio 错误 Duplicate files copied in APK META-INF/LICENSE.txt

    1 .Duplicate files copied in APK META-INF/LICENSE.txt   android { packagingOptions { exclude 'META-I ...

  6. Duplicate files copied in APK META-INF/LICENSE.txt

    Error:Execution failed for task ':app:packageDebug'. > Duplicate files copied in APK META-INF/LIC ...

  7. Android Studio 错误 Duplicate files copied in APK META-INF/LICENSE.txt解决方案

    My logcat: log Execution failed for task ':Prog:packageDebug'. Duplicate files copied in APK META-IN ...

  8. List or delete hidden files from command prompt(CMD)

    In Windows, files/folders have a special attribute called hidden attribute. By setting this attribut ...

  9. 解决DuplicateFileException: Duplicate files copied in APK META-INF/LICENSE(或META-INF/DEPENDENCIES)

    导入eclipse项目时报 Error:Execution failed for task ':app:transformResourcesWithMergeJavaResForDebug'.> ...

随机推荐

  1. Oracle学习笔记六 SQL常用函数

    函数的分类 Oracle 提供一系列用于执行特定操作的函数 SQL 函数带有一个或多个参数并返回一个值 以下是SQL函数的分类:

  2. 常用兼容浏览器js

    功能:取得鼠标坐标.取得IE版本 一. 准备工作 1. 点击此下载 相关文件 二. 在浏览器中运行 compatJS.html 文件,点击相关功能按钮,即可看到效果

  3. Spark编译与打包

    编译打包 Spark支持Maven与SBT两种编译工具,这里使用了Maven进行编译打包: 在执行make-distribution脚本时它会检查本地是否已经存在Maven还有当前Spark所依赖的S ...

  4. iOS关于通知传值Bool类型的注意点

    比如: [[NSNotificationCenter defaultCenter] postNotificationName:@"Notification_Name" object ...

  5. freeradius整合AD域作anyconncet认证服务器

    一.服务器要求 Radius服务器:centos6.6.hostname.selinux  disabled.stop iptables AD域服务器:Windows Server 2008 R2 E ...

  6. [python]python try异常处理机制

    #python的try语句有两种风格 #一:种是处理异常(try/except/else) #二:种是无论是否发生异常都将执行最后的代码(try/finally) try/except/else风格 ...

  7. 如何通过JS调用某段SQL语句

    如何通过JS调用某段SQL语句,这样的需求在报表.数据平台开发中很常见.以报表平台FineReport开发为例,例如在点击某个按钮之后,来判断一下数据库条数,再决定下一步操作.那这在后台如何实现呢? ...

  8. QC在win7下不能访问QC服务器介绍

    本地访问不了服务器QC的主要几个原因总结 服务器serverjbossextensionhpcmd 2016-03-24   兼容性问题: 1.在服务端QC的安装目录下jboss\server\def ...

  9. 【教程】简易CDQ分治教程&学习笔记

    前言 辣鸡蒟蒻__stdcall终于会CDQ分治啦!       CDQ分治是我们处理各类问题的重要武器.它的优势在于可以顶替复杂的高级数据结构,而且常数比较小:缺点在于必须离线操作. CDQ分治的基 ...

  10. 比较.NET程序集(DLL或EXE)是否相同

    如何比较两个.NET程序集(DLL或EXE)是否相同呢? 直接比较文件内容?当然没那么简单了,这个你可以去试试,去比较一下两次Build产生的程序集, 就算内容没有改变,产生的程序集的二进制文件也是不 ...