[IR] Huffman Coding

为了保证：Block中，所有的叶子在所有的中间结点的前面。Static: Huffman coding

Dynamic: Adaptive Huffman

一些概念

压缩指标

• Compress a 10MB file to 2MB
• Compression ratio = 5 or 5:1
• Space savings = 0.8 or 80%

对称与非对称

• Symmetric compression 对称压缩
　　– requires same time for encoding and decoding
　　– used for live mode applications (teleconference)
• Asymmetric compression 非对称压缩，压缩慢，解压快
　　– performed once when enough time is available
　　– decompression performed frequently, must be fast
　　– used for retrieval mode applications (e.g., an interactive CD-ROM)

压缩解压的唯一性 - Uniquely decodable

compression的基本条件。

Static or Dynamic codes

Static:

Huffman coding，需知道字符的编码。-->

Dynamic:

Adaptive Huffman。-->

Shannon’s Result

Huffman coding

前提是需知晓 Freq。

= ( 30*2 + 30*2 + 20*2 + 10*3 + 10*3 ) / 100
= 220 / 100
= 2.2

问题一

香农理论极限是：

H
= -0.3 * log 0.3 + -0.3 * log 0.3 + -0.2 * log 0.2 + -0.1 * log 0.1 + -0.1 * log 0.1
= -0.3*(-1.737) + -0.3*(-1.737) + -0.2 * (-2.322) + -0.1 * (-3.322) + -0.1 * (-3.322)
= 0.3 log 10/3 + 0.3 log 10/3 + 0.2 log 5 + 0.1 log 10 + 0.1 log 10
= 0.3*1.737 + 0.3*1.737 + 0.2* 2.322 + 0.1*3.322 + 0.1*3.322
= 2.17 < 2.2 　　// 说明未达到极限，还有压缩的余地

问题二

Freq不平均的话，压缩率越差。

L = (100000*1 + ...)/100010
≈ 1

H = 0.9999 log 1.0001 + 0.00006 log 16668.333
+ ... + 1/100010 log 100010
≈ 0.00

Adaptive Huffman

Problems of Static coding
• Need statistics & static: e.g., single pass over the data just to collect stat & stat unchanged during encoding
• To decode, the stat table need to be transmitted. Table size can be significant for small msg.
　=> Adaptive compression e.g., adaptive huffman

两个阶段：

• FGK Algorithm
• Vitter's Invariant

FGK Algorithm.

Video: https://www.youtube.com/watch?v=N5pw_Z-oP-4

Rule:

In the same block，中间结点的index总是大于叶子结点的index。<-- Vitter's Invariant

权重值大的结点，其index也较大。

Operation:

操作1：Leaf node: move first, then update

操作2：Internal node: update first, then move.

NYT node = null node.

Stream: abcbaaa

a = 0110 0001

b = 0110 0010

c = 0110 0011

Step 1

[0110 0001]

表示插入的位置是左枝

Step 2

插入b之后的样子如下。

Next，需要执行“操作2”。

原来的NYT变为1（孩子value之和），补充完编号。（update）

考虑move操作，画出block，所有标号为1的nodes。

为了保证：Block中，所有的叶子在所有的中间结点的前面。

但，目前满足这个要求么？显然不是，如下的1,a 比较碍眼。

252	253	254	255	256
NYT	b		a	1

那么，如何move？将上图中的254结点连带子树与跟它冲突的255交换。

可见，这样就重新满足了the Rule.

0110 0001]

Step 3

然后继续 insert c，当然还是在NYT这个位置。

插入效果如下：

Next，还是先 update internal node。

为了保证：Block中，所有的叶子在所有的中间结点的前面。

但，目前满足这个要求么？显然不是，如下的1,b,a 比较碍眼。

251	252	253	254	255
c		b	a	1

开始move：253,254左移，给252的1腾出地儿。

可见，这样就重新满足了the Rule.

0110 0001] 0110 0011]

Step 4

4th是b，已有b，所以挂在已有的node b下面。

0110 0001]

这次，先 update leaf node，也包括node b的个数++。

这里满足了the Rule的第一条，即叶子结点index较小。

但，the Rule的第二条未满足，即权重大的index较大。


c(1)	b(2)	a(1)	(1)

所以，将b移动到最后位置254，如下。

可见，这样就重新满足了the Rule.

0110 0001] ]

Step 5

5th是a，已有a，所以挂在已有的node a下面。

0110 0001] ] 10

这里满足了the Rule的第一条，即叶子结点index较小。

但，the Rule的第二条未满足，即权重大的index较大。

252	252	253
c(1)	a(2)	(1)

交换252与 253及其子树后，如下：

可见，这样就重新满足了the Rule.

0110 0001] 0110 0011]] ]

Step 6

6th是a，已有a，所以挂在已有的node a下面。

0110 0001] 0110 0011] ] ]

此时，253的a的权重变为3，根据the Rule，权重大的index较大。

所以，253:a 应该在254:b的后面。交换后，如下：

0110 0001] 0110 0011] ] ] ]

Step 7

7th是a，已有a，所以挂在已有的node a下面。

0110 0001] 0110 0011] ] ]

此时，254的a的权重变为4，根据the Rule，权重大的index较大。

所以，254:a 应该在255的后面。交换后如下：

0110 0001] 0110 0011] ] ] ]