小结：

1、ASCII编码、GBK编码不是变长编码；

2、数据压缩；

示例：

aabacdab → 00100110111010 → |0|0|10|0|110|111|0|10| → aabacdab

3、变长编码：

符号-位长映射；

https://en.wikipedia.org/wiki/Variable-length_code

https://baike.baidu.com/item/变长编码表

变长编码表是通过一种评估来源符号出现机率的方法得到的，出现机率高的字母使用较短的编码,反之出现机率低的则使用较长的编码，这便使编码之後的字串的平均长度、期望值降低，从而达到无失真压缩资料的目的。

Variable-length code

From Wikipedia, the free encyclopedia

Jump to navigation Jump to search

This article is about the transmission of data across noisy channels. For the storage of text in computers, see Variable-width encoding.

In coding theory a variable-length code is a code which maps source symbols to a variable number of bits.

Variable-length codes can allow sources to be compressed and decompressed with zero error (lossless data compression) and still be read back symbol by symbol. With the right coding strategy an independent and identically-distributed source may be compressed almost arbitrarily close to its entropy. This is in contrast to fixed length coding methods, for which data compression is only possible for large blocks of data, and any compression beyond the logarithm of the total number of possibilities comes with a finite (though perhaps arbitrarily small) probability of failure.

Some examples of well-known variable-length coding strategies are Huffman coding, Lempel–Ziv coding and arithmetic coding.

Codes and their extensions[edit]

The extension of a code is the mapping of finite length source sequences to finite length bit strings, that is obtained by concatenating for each symbol of the source sequence the corresponding codeword produced by the original code.

Using terms from formal language theory, the precise mathematical definition is as follows: Let {\displaystyle S} and {\displaystyle T} be two finite sets, called the source and target alphabets, respectively. A code {\displaystyle C:S\to T^{*}} is a total function mapping each symbol from {\displaystyle S} to a sequence of symbols over {\displaystyle T}, and the extension of {\displaystyle C} to a homomorphism of {\displaystyle S^{*}} into {\displaystyle T^{*}}, which naturally maps each sequence of source symbols to a sequence of target symbols, is referred to as its extension.

Classes of variable-length codes[edit]

Variable-length codes can be strictly nested in order of decreasing generality as non-singular codes, uniquely decodable codes and prefix codes. Prefix codes are always uniquely decodable, and these in turn are always non-singular:

Non-singular codes[edit]

A code is non-singular if each source symbol is mapped to a different non-empty bit string, i.e. the mapping from source symbols to bit strings is injective.

For example the mapping {\displaystyle M_{1}=\{\,a\mapsto 0,b\mapsto 0,c\mapsto 1\,\}} is not non-singular because both "a" and "b" map to the same bit string "0" ; any extension of this mapping will generate a lossy (non-lossless) coding. Such singular coding may still be useful when some loss of information is acceptable (for example when such code is used in audio or video compression, where a lossy coding becomes equivalent to source quantization).
However, the mapping {\displaystyle M_{2}=\{\,a\mapsto 1,b\mapsto 011,c\mapsto 01110,d\mapsto 1110,e\mapsto 10011,f\mapsto 0\}} is non-singular ; its extension will generate a lossless coding, which will be useful for general data transmission (but this feature is not always required). Note that it is not necessary for the non-singular code to be more compact than the source (and in many applications, a larger code is useful, for example as a way to detect and/or recover from encoding or transmission errors, or in security applications to protect a source from undetectable tampering).

Uniquely decodable codes[edit]

A code is uniquely decodable if its extension is non-singular (see above). Whether a given code is uniquely decodable can be decided with the Sardinas–Patterson algorithm.

The mapping {\displaystyle M_{3}=\{\,a\mapsto 0,b\mapsto 01,c\mapsto 011\,\}} is uniquely decodable (this can be demonstrated by looking at the follow-set after each target bit string in the map, because each bitstring is terminated as soon as we see a 0 bit which cannot follow any existing code to create a longer valid code in the map, but unambiguously starts a new code).
Consider again the code {\displaystyle M_{2}} from the previous section. This code, which is based on an example found in,^[1] is not uniquely decodable, since the string 011101110011 can be interpreted as the sequence of codewords 01110 – 1110 – 011, but also as the sequence of codewords 011 – 1 – 011 – 10011. Two possible decodings of this encoded string are thus given by cdb and babe. However, such a code is useful when the set of all possible source symbols is completely known and finite, or when there are restrictions (for example a formal syntax) that determine if source elements of this extension are acceptable. Such restrictions permit the decoding of the original message by checking which of the possible source symbols mapped to the same symbol are valid under those restrictions.

Prefix codes[edit]

Main article: Prefix code

A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Other commonly used names for this concept are prefix-free code, instantaneous code, or context-free code.

The example mapping {\displaystyle M_{3}} in the previous paragraph is not a prefix code because we don't know after reading the bit string "0" if it encodes an "a" source symbol, or if it is the prefix of the encodings of the "b" or "c" symbols.
An example of a prefix code is shown below.

Symbol	Codeword
a	0
b	10
c	110
d	111

Example of encoding and decoding:

aabacdab → 00100110111010 → |0|0|10|0|110|111|0|10| → aabacdab

A special case of prefix codes are block codes. Here all codewords must have the same length. The latter are not very useful in the context of source coding, but often serve as error correcting codes in the context of channel coding.

Another special case of prefix codes are variable-length quantity codes, which encode arbitrarily large integers as a sequence of octets -- i.e., every codeword is a multiple of 8 bits.

Advantages[edit]

The advantage of a variable-length code is that unlikely source symbols can be assigned longer codewords and likely source symbols can be assigned shorter codewords, thus giving a low expected codeword length. For the above example, if the probabilities of (a, b, c, d) were {\displaystyle \textstyle \left({\frac {1}{2}},{\frac {1}{4}},{\frac {1}{8}},{\frac {1}{8}}\right)}, the expected number of bits used to represent a source symbol using the code above would be:

{\displaystyle 1\times {\frac {1}{2}}+2\times {\frac {1}{4}}+3\times {\frac {1}{8}}+3\times {\frac {1}{8}}={\frac {7}{4}}}

As the entropy of this source is 1.7500 bits per symbol, this code compresses the source as much as possible so that the source can be recovered with zero error.

Notes[edit]

^ Berstel et al. (2009), Example 2.3.1, p. 63

References[edit]

Berstel, Jean; Perrin, Dominique; Reutenauer, Christophe (2010). Codes and automata. Encyclopedia of Mathematics and its Applications. 129. Cambridge: Cambridge University Press. ISBN 978-0-521-88831-8. Zbl 1187.94001. Draft available online

hide

Data compression methods

Lossless

Entropy type	Unary Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary type	Byte pair encoding DEFLATE Snappy Lempel–Ziv LZ77 / LZ78 (LZ1 / LZ2) LZFSE LZJB LZMA LZO LZRW LZS LZSS LZW LZWL LZX LZ4 Brotli Zstandard
Other types	BWT CTW Delta DMC MTF PAQ PPM RLE

Audio

Concepts	Bit rate average (ABR) constant (CBR) variable (VBR) Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law ACELP ADPCM CELP DPCM Fourier transform LPC LAR LSP MDCT Psychoacoustic model WLPC

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image
Methods	Chain code DCT EZW Fractal KLT LP RLE SPIHT Wavelet

Video

Concepts	Bit rate average (ABR) constant (CBR) variable (VBR) Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	Lapped transform DCT Deblocking filter Motion compensation

Theory

变长编码表 ASCII代码等长编码的更多相关文章

Java 控制台输入数字输出乘法表（代码练习）
最近,回忆了一些刚学习Java时经常练习的一些小练习题.感觉还是蛮有趣的,在回顾时想起好多学习时的经历和坎坷,一道小小的练习题要研究半天,珍重过往,直面未来.下面贡献代码,Java 控制台输入数字输 ...
基于HTML5手机登录注册表单代码
分享一款基于HTML5手机登录注册表单代码.这是一款鼠标点击注册登录按钮弹出表单,适合移动端使用.效果图如下: 在线预览源码下载实现的代码. html代码: <div class=&qu ...
基于jQuery会员中心安全修改表单代码
基于jQuery会员中心安全修改表单代码.这是一款登录密码,交易密码,手机号码,实名认证,电子邮箱,安全设置表单,会员表单等设置代码.效果图如下: 在线预览源码下载实现的代码. html代码: ...
常用字符与ASCII代码对照表
常用字符与ASCII代码对照表为了便于查询,以下列出ASCII码表:第128-255号为扩展字符(不常用) ASCII码键盘 ASCII 码键盘 ASCII 码键盘 ASCII 码键盘 27 ...
[转]常用字符与ASCII代码对照表
常用字符与ASCII代码对照表为了便于查询,以下列出ASCII码表:第128-255号为扩展字符(不常用) ASCII码键盘 ASCII 码键盘 ASCII 码键盘 ASCII 码键盘 27 ...
基于jQuery商品分类选择提交表单代码
分享一款基于jQuery商品分类选择提交表单代码.这是一款基于jQuery实现的商品信息选择列表表单提交代码. 在线预览源码下载实现的代码: <div class="yList ...
Cheat Engine 作弊表框架代码
打开游戏打开自动汇编扫描的所有过程,这里就省略了引用作弊表框架代码查找使阳光减少的地址拷贝这个地址添加到自动汇编脚本里,并添加汇编指令分配到当前作弊表生成自动汇编脚本进行激活测试可 ...
CSS样式表初始化代码
CSS为什么要初始化?建站老手都知道,这是为了考虑到浏览器的兼容问题,其实不同浏览器对有些标签的默认值是不同的,如果没对CSS初始化往往会出现浏览器之间的页面差异.当然,初始化样式会对SEO有一定的影 ...

随机推荐

SNF.CodeGenerator代码生成器前夕-代码生成器初始配置
如果你是第一次使用SNF快速开发平台的话,第一次运行代码生成器的话,可以需要以下信息来帮助你快速进行配置和使用. 代码生成器在使用之前有几个地方需要配置,如果是第一次登录会提示无授权,弹出一个框填入授 ...
git 命令常用总结
详细git教程可参考:http://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000 基础命令用 ...
01——Introduction to Android介绍
Introduction to Android Android provides a rich application framework that allows you to build innov ...
树莓派2上手 —— Raspbian的一些基本配置问题
先说点废话: 原来的笔记本因为上次被儿子拿着充电器玩的时候漏电烧了主板,修了之后还是时不时就突然宕机,Windows也完全起不来.后面这个问题倒是不大,真要用Windows的时候拿老婆的用一下就是了, ...
不平衡学习 Learning from Imbalanced Data
问题: ICC警情数据分类不均,30+分类,最多的分类数据数量1w+条,只有10个类别数量超过1k,大部分分类数量少于100条. 解决办法: 下采样:通过非监督学习,找出每个分类中的异常点,减少数据. ...
app嵌入的H5页面的数据埋点总结
好久没写博客了,大半年时间花费在了许多杂事上. 最近1个月专门为H5页面的app开发了一些埋点功能,主要是考虑到以后的可复制性和通用型,由于不是前端开发出身,相对来说还是比较简陋的. 正题开始:H5页 ...
Java如何从文件中打印与给定模式匹配的所有字符串？
在Java编程中,如何从文件中打印与给定模式匹配的所有字符串? 以下示例显示了如何使用Util.regex类的Patternname.matcher()方法从文件中打印与给定模式匹配的所有字符串. p ...
c#扩展函数
分页 public static class IEnumerableExt { public static (IEnumerable<T> dataAfterPaging, Pageinf ...
swoole Tcp
TCP服务对象 <?php //创建Server对象,监听 127.0.0.1:9501端口 $serv = ); //监听连接进入事件 $serv->on('connect', func ...
web.xml配置DispatcherServlet (***-servlert.xml)
1. org.springframework.web.servlet.DispatcherServlet 所在jar包: <dependency> <groupId>org.s ...

变长编码表 ASCII代码等长编码