Data Deduplication Workflow Part 1

Data deduplication provides a new approach to store data and eliminate duplicate data in chunk level.
A typical data deduplication workflow can be explained like this.

File metadata describes how to restore the file use unique chunks.

Chunk level deduplication approach has five key stages.

Chunking
Fingerprinting
Indexing fringerprints
Further compression
Storage management

Different Stage has its own challenges, which may become the bottleneck for restoring file or compressing files.

Chunking

At the chunking stage, we should split the data stream into chunks, which can be presented at the fingerprints.
The different method splitting data streaming has different result and different efficiency.

The splitting method can be divided into two categories:
Fixed Size Chunking, which just split the data stream into fixed size chunk, simply and easily .
Content Defined Chunking, which split the data into variable size chunk, depending on the content.

Although fixed size chunking is simple and quick, the biggest problem is Boundary Shift. Boundary Shift Problem is when little part of data stream is modified, all the subsequent chunks will be changed, because of the boundary is shifted.
Content defined chunking uses a sliding-window technique on the content of data stream and computes a hash value of the window. If the hash value is satisfied some predefined conditions, it will generate a chunk.

Chunk's size can also be optimized. If we use CDC (content defined chunking), the size of chunking can not be in charge. On some extremely condition, it will generate too large or too small chunking. If a chunking is too large, the compression ratio will decrease. Because the large chunk can hide duplicates from being detected. If a chunking is too small, the file metadata will increase. What's more, it can cause indexing fingerprints problem. So we can define the max and min chunk size.

Chunking still has some problems such as how to detect the deduplicate accurately, how to accelerate computing time cost.

Data Deduplication Workflow Part 1的更多相关文章

Data De-duplication
偶尔看到data deduplication的博客,还挺有意思,记录之 http://blog.csdn.net/liuben/article/details/5829083?reload http: ...
大数据去重（data deduplication）方案
数据去重(data deduplication)是大数据领域司空见惯的问题了.除了统计UV等传统用法之外,去重的意义更在于消除不可靠数据源产生的脏数据--即重复上报数据或重复投递数据的影响,使计算产生 ...
Note: Transparent data deduplication in the cloud
What Design and implement ClearBox which allows a storage service provider to transparently attest t ...
论文阅读 Prefetch-aware fingerprint cache management for data deduplication systems
论文链接 https://link.springer.com/article/10.1007/s11704-017-7119-0 这篇论文试图解决的问题是在cache 环节之前,prefetch-ca ...
Data Deduplication in Windows Server 2012
https://blogs.technet.microsoft.com/filecab/2012/05/20/introduction-to-data-deduplication-in-windows ...
Note: File Recipe Compression in Data Deduplication Systems
Zero-Chunk Suppression 检测全0数据块,将其用预先计算的自身的指纹信息代替. Detect zero chunks and replace them with a special ...
SharePoint 2013 create workflow by SharePoint Designer 2013
这篇文章主要基于上一篇http://www.cnblogs.com/qindy/p/6242714.html的基础上,create a sample workflow by SharePoint De ...
Seven Python Tools All Data Scientists Should Know How to Use
Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...
重复数据删除(De-duplication)技术研究（SourceForge上发布dedup util）
dedup util是一款开源的轻量级文件打包工具,它基于块级的重复数据删除技术,可以有效缩减数据容量,节省用户存储空间.目前已经在Sourceforge上创建项目,并且源码正在不断更新中.该工具生成 ...

随机推荐

BUUCTF刷题记录(Web方面)
WarmUp 首先查看源码,发现有source.php,跟进看看,发现了一堆代码这个原本是phpmyadmin任意文件包含漏洞,这里面只不过是换汤不换药. 有兴趣的可以看一下之前我做的分析,http ...
【Java基础】Java开发过程中的常用工具类库
目录 Java开发过程中的常用工具类库 1. Apache Commons类库 2. Guava类库 3. Spring中的常用工具类 4. 其他工具参考 Java开发过程中的常用工具类库 1. A ...
解读C#中的正则表达式
本文摘自LTP.NET知识库. regexp规则类包含在System.Text.RegularExpressions.dll文件中,在对应用软件进行编译时你必须引用这个文件: System.Text. ...
python常用算法（5）——树，二叉树与AVL树
1,树树是一种非常重要的非线性数据结构,直观的看,它是数据元素(在树中称为节点)按分支关系组织起来的结构,很像自然界中树那样.树结构在客观世界中广泛存在,如人类社会的族谱和各种社会组织机构都可用树形 ...
springboot依赖的一些配置：spring-boot-dependencies、spring-boot-starter-parent、io.spring.platform
springboot里会引入很多springboot starter依赖,这些依赖的版本号统一管理,springboot有几种方案可以选择. 一.spring-boot-dependencies 有两 ...
django渲染高阶
08.16自我总结 django渲染高阶一.利用母版渲染 1.创建母版文件如:stamper.html <!DOCTYPE html> <html lang="en&q ...
微信小程序前端样式WXSS书写
微信小程序前端样式WXSS书写一. WXSS的简单介绍 WXSS(WeiXin Style Sheets)是一套样式语言,用于描述 WXML 的组件样式. 与 CSS 相比,WXSS 扩展的特性有: ...
Python_MySQL数据库的写入与读取
[需求]1. 在数据库中创建表,且能按时间自动创建新表 2. 数据写入数据库 3. 从数据库读取数据 1. 创建表,并自动更新 def Creat_Table(InitMySQL,tabel_name ...
npm install bcrypt报错
gyp ERR! stack Error: Can't find Python executable "python", you can set the PYTHON env va ...
渗透-N种反弹shell方法
简介 reverse shell反弹shell或者说反向shell,就是控制端监听在某TCP/UDP端口,被控端发起请求到该端口,并将其命令行的输入输出转到控制端.reverse shell与teln ...

Data Deduplication Workflow Part 1

Chunking

Data Deduplication Workflow Part 1的更多相关文章

随机推荐

热门专题