Data Compression is an approach to compress the origin dataset and save spaces. According to the Economist reports, the amount of digital dat in the world is growing explosively, which increase from 1.2 zettabytes to 1.8 zettabytes in 2010 and 2011. So how to compress data and manage storage cost-effectively is a challenging and important task.

Traditionally, we use compression algorithms to achieve data reduction. The main idea of data compression is "use the fewest number of bits to represent an information as accurately as possible". What we want to do is to represent the origin data information as accurately as possible, so it allows us to ignore some useless information when converting the encoded data to represented data. We can classify the classical compression approach into lossless compression and lossy compression. The difference between them is the loss of unnecessary information.

For lossless compression, it reduces data by identifying and eliminating statistical redundancy in reversible fashion. For removing redundant information. It can use statistical properties to build a new encoding system, like Huffman coding. Or it can use dictionary model, replacing the repeated strings with slide window algorithm. What a matter is that for a lossless compression, when we restore the data, we can get the origin data without losing any information.

For lossy compression, it reduces data by identifying unnecessary information and irretrievably removing it. For the removing unnecessary information, unnecessary information indeed has its own information, which may not be useful in some particular field. So it means lossy compression. In some filed, we just need useful information, and ignore useless information, so lossy compression methods works in Image, Audio, and Video. So we can't get the origin data when we use lossy compression algorithm.

For a lossless approach, when data become larger, eliminating statistical redundancy is unacceptable. Lossless approach needs data statistic information, counting all information. So for a large dataset, it must tradeoff between speed and compression ratio.

There are two methods to compress data, delta compression and data deduplication.

Delta compression is a new perspective to compress two very similar files. It compares two files, A and B, and calculates the delta A-B, so file B can be expressed as file A + delta A-B, which can save space. Delta compression is generally used in source code version, synchronization.

Data deduplication target large-scale system, which has a big granularity (file level or 8K kb size chunk level) the reason why using chunk-level instead of file level in data deduplication is chunk-level can achieve better compression performance. In general, data deduplication splits the back-up data into chunks, and identifies a chunk by its own cryptographically secure hash (SHA-1) signature. For some same chunks, it will remove the duplicate data chunks and store only one copy of that to achieve the goal (saving the space). It will only store the unique chunk, and file metadata, which can be used to reconstruct the origin file.

Data Compression Category的更多相关文章

  1. SQL SERVER ->> Data Compression

    最近做了一个关于数据压缩的项目,要把整个SQL SERVER服务器下所有的表对象要改成页压缩.于是趁此机会了解了一下SQL SERVER下压缩技术. 这篇文章几乎就是完全指导手册了 https://t ...

  2. Programming Assignment 5: Burrows–Wheeler Data Compression

    编程作业五 作业链接:Burrows-Wheeler Data Compression & Checklist 我的代码:MoveToFront.java & CircularSuff ...

  3. dimensionality reduction动机---data compression(使算法提速)

    data compression可以使数据占用更少的空间,并且能使算法提速 什么是dimensionality reduction(维数约简)    例1:比如说我们有一些数据,它有很多很多的feat ...

  4. Intent中的四个重要属性——Action、Data、Category、Extras

    Intent作为联系各Activity之间的纽带,其作用并不仅仅只限于简单的数据传递.通过其自带的属性,其实可以方便的完成很多较为复杂的操作.例如直接调用拨号功能.直接自动调用合适的程序打开不同类型的 ...

  5. <转>四个重要属性——Action、Data、Category、Extras

    Intent作为联系各Activity之间的纽带,其作用并不仅仅只限于简单的数据传递.通过其自带的属性,其实可以方便的完成很多较为复杂的操作.例如直接调用拨号功能.直接自动调用合适的程序打开不同类型的 ...

  6. Data Compression

    数据压缩 introduction 压缩数据可以节省存储数据需要的空间和传输数据需要的时间,虽然摩尔定律说集成芯片上的晶体管每 18-24 个月翻一倍,帕金森定律说数据会自己拓展来填满可用空间,但数据 ...

  7. Hive 压缩技术Data Compression

    Mapreducwe 执行流程 :input > map > shuffle > reduce > output 压缩执行时间,map 之后,压缩,数据存储在本地磁盘,减少磁盘 ...

  8. 吴恩达机器学习笔记48-降维目标:数据压缩与可视化(Motivation of Dimensionality Reduction : Data Compression & Visualization)

    目标一:数据压缩 除了聚类,还有第二种类型的无监督学习问题称为降维.有几个不同的的原因使你可能想要做降维.一是数据压缩,数据压缩不仅允许我们压缩数据,因而使用较少的计算机内存或磁盘空间,而且它也让我们 ...

  9. 【转】The most comprehensive Data Science learning plan for 2017

    I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...

随机推荐

  1. [Git] Git 使用记录

    1. 配置git客户端 1.1 安装git bash https://git-scm.com/downloads 1.2 设置ssh Key 查看是否有ssh key ls -al ~/.ssh 没有 ...

  2. ELK 学习笔记之 elasticsearch bool组合查询

    elasticsearch bool组合查询: 相当于sql:where _type = 'books' and (price = 500 or title = 'bigdata') Note: mu ...

  3. LitePal的聚合函数

    传统的聚合函数用法   虽说是聚合函数,但它的用法其实和传统的查询还是差不多的,即仍然使用的是select语句.但是在select语句当中我们通常不会再去指定列名,而是将需要统计的列名传入到聚合函数当 ...

  4. struts 2.3.28+spring 4.2.5.RELEASE+hibernate 5.1.0.Final整合maven构建项目基本配置

    第一次写博客,主要也是记录给自己看的,可能很多比较熟的地方就没注释 用maven构建,ssh框架都是选用的最新的release版(感觉还是不要用beta),环境jdk1.8 tomcat8.0 mys ...

  5. Oracle自带工具sql优化集-SQL Tuning Advisor (使用心得体会)

    如何有效的诊断和监控高负载的SQL对于DBA来说并非是件容易的事情,对SQL语句手工调优需要很多的经验和技巧, 结合个人经验常见如下问题:          . 对SQL语句本身进行优化以便获得更优的 ...

  6. ORM查询2

    目录 十三式 2式(针对外键查询优化) select_related和prefetch_related prefetch_related 查询返回值类型 不等式查询 关键字查询 时间查询 跨表查询 组 ...

  7. DG常用运维命令及常见问题解决

    DG常见运维命令及常见问题解决方法 l> DG库启动.关闭标准操作Dataguard关闭1).先取消日志应用alter database recover managed standby data ...

  8. 2019关于phpstudy软件后门简单分析

    2019.9.20得知非官网的一些下载站中的phpstudy版本存在后门文件   说是官网下的就没有后门 20号出现的新闻 今天phpstudy官网21号又更新一波 不太好说这是什么操作哦 此地无银三 ...

  9. 控制反转和依赖注入(对IOC,DI理解+案例)

    理解 控制反转说的官方一点就是面向对象编程中的一种设计原则,可以用来减低计算机代码之间的耦合度.其实就是一种设计思想,大概思想就是把设计好的对象交给容器控制,而不是在你内部直接控制. 依赖注入是控制反 ...

  10. 点击任何位置隐藏所需隐藏的元素 (无BUG/jQuery版)

    1>第一种分两步 1) :对document的click事件绑定事件处理程序,使其隐藏该div 2) :对div的click事件绑定事件处理程序,阻止事件冒泡,防止其冒泡到document,而调 ...