Data deduplication provides a new approach to store data and eliminate duplicate data in chunk level.
A typical data deduplication workflow can be explained like this.

File metadata describes how to restore the file use unique chunks.

Chunk level deduplication approach has five key stages.

  1. Chunking
  2. Fingerprinting
  3. Indexing fringerprints
  4. Further compression
  5. Storage management

Different Stage has its own challenges, which may become the bottleneck for restoring file or compressing files.

Chunking

At the chunking stage, we should split the data stream into chunks, which can be presented at the fingerprints.
The different method splitting data streaming has different result and different efficiency.

The splitting method can be divided into two categories:
Fixed Size Chunking, which just split the data stream into fixed size chunk, simply and easily .
Content Defined Chunking, which split the data into variable size chunk, depending on the content.

Although fixed size chunking is simple and quick, the biggest problem is Boundary Shift. Boundary Shift Problem is when little part of data stream is modified, all the subsequent chunks will be changed, because of the boundary is shifted.
Content defined chunking uses a sliding-window technique on the content of data stream and computes a hash value of the window. If the hash value is satisfied some predefined conditions, it will generate a chunk.

Chunk's size can also be optimized. If we use CDC (content defined chunking), the size of chunking can not be in charge. On some extremely condition, it will generate too large or too small chunking. If a chunking is too large, the compression ratio will decrease. Because the large chunk can hide duplicates from being detected. If a chunking is too small, the file metadata will increase. What's more, it can cause indexing fingerprints problem. So we can define the max and min chunk size.

Chunking still has some problems such as how to detect the deduplicate accurately, how to accelerate computing time cost.

Data Deduplication Workflow Part 1的更多相关文章

  1. Data De-duplication

    偶尔看到data deduplication的博客,还挺有意思,记录之 http://blog.csdn.net/liuben/article/details/5829083?reload http: ...

  2. 大数据去重(data deduplication)方案

    数据去重(data deduplication)是大数据领域司空见惯的问题了.除了统计UV等传统用法之外,去重的意义更在于消除不可靠数据源产生的脏数据--即重复上报数据或重复投递数据的影响,使计算产生 ...

  3. Note: Transparent data deduplication in the cloud

    What Design and implement ClearBox which allows a storage service provider to transparently attest t ...

  4. 论文阅读 Prefetch-aware fingerprint cache management for data deduplication systems

    论文链接 https://link.springer.com/article/10.1007/s11704-017-7119-0 这篇论文试图解决的问题是在cache 环节之前,prefetch-ca ...

  5. Data Deduplication in Windows Server 2012

    https://blogs.technet.microsoft.com/filecab/2012/05/20/introduction-to-data-deduplication-in-windows ...

  6. Note: File Recipe Compression in Data Deduplication Systems

    Zero-Chunk Suppression 检测全0数据块,将其用预先计算的自身的指纹信息代替. Detect zero chunks and replace them with a special ...

  7. SharePoint 2013 create workflow by SharePoint Designer 2013

    这篇文章主要基于上一篇http://www.cnblogs.com/qindy/p/6242714.html的基础上,create a sample workflow by SharePoint De ...

  8. Seven Python Tools All Data Scientists Should Know How to Use

    Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...

  9. 重复数据删除(De-duplication)技术研究(SourceForge上发布dedup util)

    dedup util是一款开源的轻量级文件打包工具,它基于块级的重复数据删除技术,可以有效缩减数据容量,节省用户存储空间.目前已经在Sourceforge上创建项目,并且源码正在不断更新中.该工具生成 ...

随机推荐

  1. Flutter 错误捕获的正确姿势

    背景 我们知道,在软件开发过程中,错误和异常总是在所难免. 不管是客户端的逻辑错误导致的,还是服务器的数据问题导致的,只要出现了异常,我们都需要一个机制来通知我们去处理. 在 APP 的开发过程中,我 ...

  2. python爬虫之基本类库

    简单梳理一下爬虫原理: 1.发送请求 通过HTTP库向目标站点发起请求,即发送一个Request,请求可以包含额外的headers等信息,等待服务器响应. 2.获取响应内容 如果服务器能正常响应(正常 ...

  3. 「面试高频」二叉搜索树&双指针&贪心 算法题指北

    本文将覆盖 「字符串处理」 + 「动态规划」 方面的面试算法题,文中我将给出: 面试中的题目 解题的思路 特定问题的技巧和注意事项 考察的知识点及其概念 详细的代码和解析 开始之前,我们先看下会有哪些 ...

  4. Hive 官方手册翻译 -- Hive DML(数据操纵语言)

    由 Confluence Administrator创建, 最终由 Lars Francke修改于 八月 15, 2018 原文链接 https://cwiki.apache.org/confluen ...

  5. SLAM中的卡方分布

    视觉slam中相邻帧特征点匹配时,动辄上千个特征点,匹配错误的是难免的,而误匹配势必会对位姿精度以及建图精度造成影响,那么如何分辨哪些是误匹配的点对儿呢?如果已知两帧的的单应矩阵,假设单应矩阵是没有误 ...

  6. Java编程思想——第17章 容器深入研究 读书笔记(四)

    九.散列与散列码 HashMap使用equals()判断当前的键是否与表中存在的键相同. 正确的equals()方法需满足一下条件: 1)自反性.x.equals(x) 是true; 2)对称性.x. ...

  7. Windows 服务程序(一)

    Windows 服务程序简介: Windows服务应用程序是一种需要长期运行的应用程序,它对于服务器环境特别适合. 它没有用户界面,并且也不会产生任何可视输出.任何用户消息都会被写进Windows事件 ...

  8. pdfminer API介绍:pdf网页爬虫

    安装 pip install pdfminer 爬取数据是数据分析项目的第一个阶段,有的加密成pdf格式的文件,下载后需要解析,使用pdfminer工具. 先介绍一下什么是pdfminer 下面是官方 ...

  9. PHP 奇葩的debug_zval_dump的输出

    有段代码: $a1 = 'Hello world!'; $a2 = &$a1; echo "test1 :"; debug_zval_dump($a1); $b1 = 'H ...

  10. Tree 点分治

    题目描述 给你一棵TREE,以及这棵树上边的距离.问有多少对点它们两者间的距离小于等于K 输入输出格式 输入格式: N(n<=40000) 接下来n-1行边描述管道,按照题目中写的输入 接下来是 ...