Data Deduplication Workflow Part 1
Data deduplication provides a new approach to store data and eliminate duplicate data in chunk level.
A typical data deduplication workflow can be explained like this.

File metadata describes how to restore the file use unique chunks.
Chunk level deduplication approach has five key stages.
- Chunking
- Fingerprinting
- Indexing fringerprints
- Further compression
- Storage management
Different Stage has its own challenges, which may become the bottleneck for restoring file or compressing files.
Chunking
At the chunking stage, we should split the data stream into chunks, which can be presented at the fingerprints.
The different method splitting data streaming has different result and different efficiency.
The splitting method can be divided into two categories:
Fixed Size Chunking, which just split the data stream into fixed size chunk, simply and easily .
Content Defined Chunking, which split the data into variable size chunk, depending on the content.
Although fixed size chunking is simple and quick, the biggest problem is Boundary Shift. Boundary Shift Problem is when little part of data stream is modified, all the subsequent chunks will be changed, because of the boundary is shifted.
Content defined chunking uses a sliding-window technique on the content of data stream and computes a hash value of the window. If the hash value is satisfied some predefined conditions, it will generate a chunk.
Chunk's size can also be optimized. If we use CDC (content defined chunking), the size of chunking can not be in charge. On some extremely condition, it will generate too large or too small chunking. If a chunking is too large, the compression ratio will decrease. Because the large chunk can hide duplicates from being detected. If a chunking is too small, the file metadata will increase. What's more, it can cause indexing fingerprints problem. So we can define the max and min chunk size.
Chunking still has some problems such as how to detect the deduplicate accurately, how to accelerate computing time cost.
Data Deduplication Workflow Part 1的更多相关文章
- Data De-duplication
偶尔看到data deduplication的博客,还挺有意思,记录之 http://blog.csdn.net/liuben/article/details/5829083?reload http: ...
- 大数据去重(data deduplication)方案
数据去重(data deduplication)是大数据领域司空见惯的问题了.除了统计UV等传统用法之外,去重的意义更在于消除不可靠数据源产生的脏数据--即重复上报数据或重复投递数据的影响,使计算产生 ...
- Note: Transparent data deduplication in the cloud
What Design and implement ClearBox which allows a storage service provider to transparently attest t ...
- 论文阅读 Prefetch-aware fingerprint cache management for data deduplication systems
论文链接 https://link.springer.com/article/10.1007/s11704-017-7119-0 这篇论文试图解决的问题是在cache 环节之前,prefetch-ca ...
- Data Deduplication in Windows Server 2012
https://blogs.technet.microsoft.com/filecab/2012/05/20/introduction-to-data-deduplication-in-windows ...
- Note: File Recipe Compression in Data Deduplication Systems
Zero-Chunk Suppression 检测全0数据块,将其用预先计算的自身的指纹信息代替. Detect zero chunks and replace them with a special ...
- SharePoint 2013 create workflow by SharePoint Designer 2013
这篇文章主要基于上一篇http://www.cnblogs.com/qindy/p/6242714.html的基础上,create a sample workflow by SharePoint De ...
- Seven Python Tools All Data Scientists Should Know How to Use
Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...
- 重复数据删除(De-duplication)技术研究(SourceForge上发布dedup util)
dedup util是一款开源的轻量级文件打包工具,它基于块级的重复数据删除技术,可以有效缩减数据容量,节省用户存储空间.目前已经在Sourceforge上创建项目,并且源码正在不断更新中.该工具生成 ...
随机推荐
- scalikejdbc 学习笔记(4)
Batch 操作 import scalikejdbc._ import scalikejdbc.config._ object BatchOperation { def main(args: Arr ...
- Newtonsoft.Json.Linq 常用方法总结
目录 1.Entity to Json 1.1.准备工作 1.2.Entity to Json 1.3.Json to Entity 2.Linq To Json 2.1.创建对象 2.2.从 Jso ...
- Unknown column 'user_id' in 'where clause'
mapper位置报错Unknown column 'user_id' in 'where clause' 可能是数据库中的字段user_id包含空格
- [ASP.NET Core 3框架揭秘] 依赖注入:控制反转
ASP.NET Core框架建立在一些核心的基础框架之上,这些基础框架包括依赖注入.文件系统.配置选项和诊断日志等.这些框架不仅仅是支撑ASP.NET Core框架的基础,我们在进行应用开发的时候同样 ...
- App自动化环境搭建
1.安装Appium-desktop工具 下载地址:https://github.com/appium/appium-desktop/releases/tag/v1.8.2 2.安装Android环境 ...
- 洛谷P3258 [JLOI2014]松鼠的新家【LCA+树上差分】
简要题意 树上n个节点,给定路径,求每个点经过次数 题意分析 对于每两个点,有两种情况,第一种,他们的lca为本身,第二种,他们有公共祖先,又要求他们的点经过次数,暴力是不可能的,复杂度不对,所以可以 ...
- HTML innerHTML、textContext、innerText
网址 : https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML 1.innerHTML : 获得.修改元素的用HTML语 ...
- opencv::AKAZE检测与匹配
AKAZE局部匹配 AKAZE局部匹配介绍 AOS 构造尺度空间 Hessian矩阵特征点检测 方向指定基于一阶微分图像 描述子生成 与SIFT.SUFR比较 更加稳定 非线性尺度空间 AKAZE速度 ...
- linux::jsoncpp库
下载库:http://sourceforge.net/projects/jsoncpp/files/ tar -zxvf jsoncpp-src- -C jsoncpp () 安装 scons $ s ...
- React学习系列之(1)简单的demo(React脚手架)
1.什么是React? React是一个一个声明式,高效且灵活的用于构建用户界面的JavaScript库.React 起源于 Facebook 的内部项目,用来架设 Instagram 的网站,并于 ...