Data Deduplication Workflow Part 1
Data deduplication provides a new approach to store data and eliminate duplicate data in chunk level.
A typical data deduplication workflow can be explained like this.

File metadata describes how to restore the file use unique chunks.
Chunk level deduplication approach has five key stages.
- Chunking
- Fingerprinting
- Indexing fringerprints
- Further compression
- Storage management
Different Stage has its own challenges, which may become the bottleneck for restoring file or compressing files.
Chunking
At the chunking stage, we should split the data stream into chunks, which can be presented at the fingerprints.
The different method splitting data streaming has different result and different efficiency.
The splitting method can be divided into two categories:
Fixed Size Chunking, which just split the data stream into fixed size chunk, simply and easily .
Content Defined Chunking, which split the data into variable size chunk, depending on the content.
Although fixed size chunking is simple and quick, the biggest problem is Boundary Shift. Boundary Shift Problem is when little part of data stream is modified, all the subsequent chunks will be changed, because of the boundary is shifted.
Content defined chunking uses a sliding-window technique on the content of data stream and computes a hash value of the window. If the hash value is satisfied some predefined conditions, it will generate a chunk.
Chunk's size can also be optimized. If we use CDC (content defined chunking), the size of chunking can not be in charge. On some extremely condition, it will generate too large or too small chunking. If a chunking is too large, the compression ratio will decrease. Because the large chunk can hide duplicates from being detected. If a chunking is too small, the file metadata will increase. What's more, it can cause indexing fingerprints problem. So we can define the max and min chunk size.
Chunking still has some problems such as how to detect the deduplicate accurately, how to accelerate computing time cost.
Data Deduplication Workflow Part 1的更多相关文章
- Data De-duplication
偶尔看到data deduplication的博客,还挺有意思,记录之 http://blog.csdn.net/liuben/article/details/5829083?reload http: ...
- 大数据去重(data deduplication)方案
数据去重(data deduplication)是大数据领域司空见惯的问题了.除了统计UV等传统用法之外,去重的意义更在于消除不可靠数据源产生的脏数据--即重复上报数据或重复投递数据的影响,使计算产生 ...
- Note: Transparent data deduplication in the cloud
What Design and implement ClearBox which allows a storage service provider to transparently attest t ...
- 论文阅读 Prefetch-aware fingerprint cache management for data deduplication systems
论文链接 https://link.springer.com/article/10.1007/s11704-017-7119-0 这篇论文试图解决的问题是在cache 环节之前,prefetch-ca ...
- Data Deduplication in Windows Server 2012
https://blogs.technet.microsoft.com/filecab/2012/05/20/introduction-to-data-deduplication-in-windows ...
- Note: File Recipe Compression in Data Deduplication Systems
Zero-Chunk Suppression 检测全0数据块,将其用预先计算的自身的指纹信息代替. Detect zero chunks and replace them with a special ...
- SharePoint 2013 create workflow by SharePoint Designer 2013
这篇文章主要基于上一篇http://www.cnblogs.com/qindy/p/6242714.html的基础上,create a sample workflow by SharePoint De ...
- Seven Python Tools All Data Scientists Should Know How to Use
Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...
- 重复数据删除(De-duplication)技术研究(SourceForge上发布dedup util)
dedup util是一款开源的轻量级文件打包工具,它基于块级的重复数据删除技术,可以有效缩减数据容量,节省用户存储空间.目前已经在Sourceforge上创建项目,并且源码正在不断更新中.该工具生成 ...
随机推荐
- MySql5.5安装步骤及MySql_Front视图配置
一.下载文件 有需要的朋友,请自行到百度云下载 链接:https://pan.baidu.com/s/13Cf1VohMz_a0czBI05UqJg 提取码:cmyq 二.安装MySql 2.1.运行 ...
- C++学习笔记二、头文件与源文件
头文件 .h 与源文件 .ccp 的区别 .h 文件一般是用来定义的,比如定义函数.类.结构体等: .cpp 文件则是对头文件的定义进行实现. include .h文件,可以调用你声明的函数.类等.当 ...
- 什么是线程调度器(Thread Scheduler)和时间分片(Time Slicing )?
线程调度器是一个操作系统服务,它负责为 Runnable 状态的线程分配 CPU 时间. 一旦我们创建一个线程并启动它,它的执行便依赖于线程调度器的实现.同上一个问题,线程调度并不受到 Java 虚拟 ...
- 移动端的<meta>标签
<head> <meta charset="UTF-8" /> <!-- 页面关键词 --> <meta name="keywo ...
- Mac上Charles抓包iOS的https请求
介绍一款抓包工具,一般我在windows下使用Fiddler抓包,Fiddler使用教程这里就不讲了,重点介绍使用mac时的抓包工具----Charles. 进入官网 :Charles官网地址官网下载 ...
- python自动化测试三部曲之request+django实现接口测试
国庆期间准备写三篇博客,介绍和总结下接口测试,由于国庆期间带娃,没有按照计划完成,今天才完成第二篇,惭愧惭愧. 这里我第一篇博客的地址:https://www.cnblogs.com/bainianm ...
- django2.0+反向查询抛异常处理
一.错误信息 AttributeError: 'RelatedManager' object has no attribute 'lrc' #其中RelatedManager为关键字 二.反向查询的字 ...
- [BZOJ3029] 守卫者的挑战
Description 打开了黑魔法师Vani的大门,队员们在迷宫般的路上漫无目的地搜寻着关押applepi的监狱的所在地.突然,眼前一道亮光闪过.“我,Nizem,是黑魔法圣殿的守卫者.如果你能通过 ...
- [CODEVS1537] 血色先锋队 - BFS
题目描述 Description 巫妖王的天灾军团终于卷土重来,血色十字军组织了一支先锋军前往诺森德大陆对抗天灾军团,以及一切沾有亡灵气息的生物.孤立于联盟和部落的血色先锋军很快就遭到了天灾军团的重重 ...
- Cocos2d-x 学习笔记(15.2) EventDispatcher 事件分发机制 dispatchEvent(event)
1. 事件分发方法 EventDispatcher::dispatchEvent(Event* event) 首先通过_isEnabled标志判断事件分发是否启用. 执行 updateDirtyFla ...