前言

  最近项目上有大量的字符串数据需要存储到内存,并且需要储存至一定时间,于是自然而然的想到了使用字符串压缩算法对“源串”进行压缩存储。由此触发了对一些优秀压缩算法的调研。
  字符串压缩,我们通常的需求有几个,一是高压缩率,二是压缩速率高,三是解压速率高。不过高压缩率与高压缩速率是鱼和熊掌的关系,不可皆得,优秀的算法一般也是采用压缩率与性能折中的方案。从压缩率、压缩速率、解压速率考虑,zstd与lz4有较好的压缩与解压性能,最终选取zstd与lz4进行调研。

  zstd是facebook开源的提供高压缩比的快速压缩算法(参考https://github.com/facebook/zstd),很想了解一下它在压缩与解压方面的实际表现。

一、zstd压缩与解压

  ZSTD_compress属于ZSTD的Simple API范畴,只有压缩级别可以设置。

  ZSTD_compress函数原型如下:

  size_t ZSTD_compress(void* dst, size_t dstCapacity, const void* src, size_t srcSize, int compressionLevel)

  ZSTD_decompress函数原型如下:
  size_t ZSTD_decompress( void* dst, size_t dstCapacity, const void* src, size_t compressedSize);

  我们先来看看zstd的压缩与解压缩示例。

 1 #include <stdio.h>
2 #include <string.h>
3 #include <sys/time.h>
4 #include <malloc.h>
5 #include <zstd.h>
6 #include <iostream>
7
8 using namespace std;
9
10 int main()
11 {
12 // compress
13 size_t com_space_size;
14 size_t peppa_pig_text_size;
15
16 char *com_ptr = NULL;
17 char peppa_pig_buf[2048] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining. Can we go out to play?Daddy: Alright, run along you two.Narrator: Peppa loves jumping in muddy puddles.Peppa: I love muddy puddles.Mummy: Peppa. If you jumping in muddy puddles, you must wear your boots.Peppa: Sorry, Mummy.Narrator: George likes to jump in muddy puddles, too.Peppa: George. If you jump in muddy puddles, you must wear your boots.Narrator: Peppa likes to look after her little brother, George.Peppa: George, let's find some more pud dles.Narrator: Peppa and George are having a lot of fun. Peppa has found a lttle puddle. George hasfound a big puddle.Peppa: Look, George. There's a really big puddle.Narrator: George wants to jump into the big puddle first.Peppa: Stop, George. | must check if it's safe for you. Good. It is safe for you. Sorry, George. It'sonly mud.Narrator: Peppa and George love jumping in muddy puddles.Peppa: Come on, George. Let's go and show Daddy.Daddy: Goodness me.Peppa: Daddy. Daddy. Guess what we' ve been doing.Daddy: Let me think... Have you been wa tching television?Peppa: No. No. Daddy.Daddy: Have you just had a bath?Peppa: No. No.Daddy: | know. You've been jumping in muddy puddles.Peppa: Yes. Yes. Daddy. We've been jumping in muddy puddles.Daddy: Ho. Ho. And look at the mess you're in.Peppa: Oooh....Daddy: Oh, well, it's only mud. Let's clean up quickly before Mummy sees the mess.Peppa: Daddy, when we've cleaned up, will you and Mummy Come and play, too?Daddy: Yes, we can all play in the garden.Narrator: Peppa and George are wearing their boots. Mummy and Daddy are wearing their boots.Peppa loves jumping up and down in muddy puddles. Everyone loves jumping up and down inmuddy puddles.Mummy: Oh, Daddy pig, look at the mess you're in. .Peppa: It's only mud.";
18
19 peppa_pig_text_size = strlen(peppa_pig_buf);
20 com_space_size= ZSTD_compressBound(peppa_pig_text_size);
21 com_ptr = (char *)malloc(com_space_size);
22 if(NULL == com_ptr) {
23 cout << "compress malloc failed" << endl;
24 return -1;
25 }
26
27 size_t com_size;
28 com_size = ZSTD_compress(com_ptr, com_space_size, peppa_pig_buf, peppa_pig_text_size, ZSTD_fast);
29 cout << "peppa pig text size:" << peppa_pig_text_size << endl;
30 cout << "compress text size:" << com_size << endl;
31 cout << "compress ratio:" << (float)peppa_pig_text_size / (float)com_size << endl << endl;
32
33
34 // decompress
35 char* decom_ptr = NULL;
36 unsigned long long decom_buf_size;
37 decom_buf_size = ZSTD_getFrameContentSize(com_ptr, com_size);
38
39 decom_ptr = (char *)malloc((size_t)decom_buf_size);
40 if(NULL == decom_ptr) {
41 cout << "decompress malloc failed" << endl;
42 return -1;
43 }
44
45 size_t decom_size;
46 decom_size = ZSTD_decompress(decom_ptr, decom_buf_size, com_ptr, com_size);
47 cout << "decompress text size:" << decom_size << endl;
48
49 if(strncmp(peppa_pig_buf, decom_ptr, peppa_pig_text_size)) {
50 cout << "decompress text is not equal peppa pig text" << endl;
51 }
52
53 free(com_ptr);
54 free(decom_ptr);
55 return 0;
56 }

执行结果:

  从结果可以发现,压缩之前的peppa pig文本长度为1827,压缩后的文本长度为759,压缩率为2.4,解压后的长度与压缩前相等。

  另外,上文提到可以调整ZSTD_compress函数的压缩级别,zstd的默认级别为ZSTD_CLEVEL_DEFAULT = 3,最小值为0,最大值为ZSTD_MAX_CLEVEL = 22。另外也提供一些策略设置,例如 ZSTD_fast, ZSTD_greedy, ZSTD_lazy, ZSTD_lazy2, ZSTD_btlazy2。压缩级别越高,压缩率越高,但是压缩速率越低。

二、ZSTD压缩与解压性能探索

  上面探索了zstd的基础压缩与解压方法,接下来再摸索一下zstd的压缩与解压缩性能。

  测试方法是,使用ZSTD_compress连续压缩同一段文本并持续10秒,最后得到每一秒的平均压缩速率。测试压缩性能的代码示例如下:

 1 #include <stdio.h>
2 #include <string.h>
3 #include <sys/time.h>
4 #include <malloc.h>
5 #include <zstd.h>
6 #include <iostream>
7
8 using namespace std;
9
10 int main()
11 {
12 int cnt = 0;
13
14 size_t com_size;
15 size_t com_space_size;
16 size_t peppa_pig_text_size;
17
18 char *com_ptr = NULL;
19 char peppa_pig_buf[2048] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining. Can we go out to play?Daddy: Alright, run along you two.Narrator: Peppa loves jumping in muddy puddles.Peppa: I love muddy puddles.Mummy: Peppa. If you jumping in muddy puddles, you must wear your boots.Peppa: Sorry, Mummy.Narrator: George likes to jump in muddy puddles, too.Peppa: George. If you jump in muddy puddles, you must wear your boots.Narrator: Peppa likes to look after her little brother, George.Peppa: George, let's find some more pud dles.Narrator: Peppa and George are having a lot of fun. Peppa has found a lttle puddle. George hasfound a big puddle.Peppa: Look, George. There's a really big puddle.Narrator: George wants to jump into the big puddle first.Peppa: Stop, George. | must check if it's safe for you. Good. It is safe for you. Sorry, George. It'sonly mud.Narrator: Peppa and George love jumping in muddy puddles.Peppa: Come on, George. Let's go and show Daddy.Daddy: Goodness me.Peppa: Daddy. Daddy. Guess what we' ve been doing.Daddy: Let me think... Have you been wa tching television?Peppa: No. No. Daddy.Daddy: Have you just had a bath?Peppa: No. No.Daddy: | know. You've been jumping in muddy puddles.Peppa: Yes. Yes. Daddy. We've been jumping in muddy puddles.Daddy: Ho. Ho. And look at the mess you're in.Peppa: Oooh....Daddy: Oh, well, it's only mud. Let's clean up quickly before Mummy sees the mess.Peppa: Daddy, when we've cleaned up, will you and Mummy Come and play, too?Daddy: Yes, we can all play in the garden.Narrator: Peppa and George are wearing their boots. Mummy and Daddy are wearing their boots.Peppa loves jumping up and down in muddy puddles. Everyone loves jumping up and down inmuddy puddles.Mummy: Oh, Daddy pig, look at the mess you're in. .Peppa: It's only mud.";
20
21 timeval st, et;
22
23 peppa_pig_text_size = strlen(peppa_pig_buf);
24 com_space_size= ZSTD_compressBound(peppa_pig_text_size);
25
26 gettimeofday(&st, NULL);
27 while(1) {
28
29 com_ptr = (char *)malloc(com_space_size);
30 com_size = ZSTD_compress(com_ptr, com_space_size, peppa_pig_buf, peppa_pig_text_size, ZSTD_fast);
31
32 free(com_ptr);
33 cnt++;
34
35 gettimeofday(&et, NULL);
36 if(et.tv_sec - st.tv_sec >= 10) {
37 break;
38 }
39 }
40
41 cout << "comparess per second:" << cnt/10 << " times" << endl;
42 return 0;
43 }

执行结果:

  结果显示ZSTD的压缩性能大概在每秒6-7万次左右,这个结果其实并不是太理想。需要说明的是压缩性能与待压缩文本的长度、字符内容也是有关系的。

  我们再来探索一下ZSTD的解压缩性能。与上面的测试方法类似,先对本文进行压缩,然后连续解压同一段被压缩过的数据并持续10秒,最后得到每一秒的平均解压速率。测试解压性能的代码示例如下:

 1 #include <stdio.h>
2 #include <string.h>
3 #include <sys/time.h>
4 #include <malloc.h>
5 #include <zstd.h>
6 #include <iostream>
7
8 using namespace std;
9
10 int main()
11 {
12 int cnt = 0;
13
14 size_t com_size;
15 size_t com_space_size;
16 size_t peppa_pig_text_size;
17
18 timeval st, et;
19
20 char *com_ptr = NULL;
21 char peppa_pig_buf[2048] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining. Can we go out to play?Daddy: Alright, run along you two.Narrator: Peppa loves jumping in muddy puddles.Peppa: I love muddy puddles.Mummy: Peppa. If you jumping in muddy puddles, you must wear your boots.Peppa: Sorry, Mummy.Narrator: George likes to jump in muddy puddles, too.Peppa: George. If you jump in muddy puddles, you must wear your boots.Narrator: Peppa likes to look after her little brother, George.Peppa: George, let's find some more pud dles.Narrator: Peppa and George are having a lot of fun. Peppa has found a lttle puddle. George hasfound a big puddle.Peppa: Look, George. There's a really big puddle.Narrator: George wants to jump into the big puddle first.Peppa: Stop, George. | must check if it's safe for you. Good. It is safe for you. Sorry, George. It'sonly mud.Narrator: Peppa and George love jumping in muddy puddles.Peppa: Come on, George. Let's go and show Daddy.Daddy: Goodness me.Peppa: Daddy. Daddy. Guess what we' ve been doing.Daddy: Let me think... Have you been wa tching television?Peppa: No. No. Daddy.Daddy: Have you just had a bath?Peppa: No. No.Daddy: | know. You've been jumping in muddy puddles.Peppa: Yes. Yes. Daddy. We've been jumping in muddy puddles.Daddy: Ho. Ho. And look at the mess you're in.Peppa: Oooh....Daddy: Oh, well, it's only mud. Let's clean up quickly before Mummy sees the mess.Peppa: Daddy, when we've cleaned up, will you and Mummy Come and play, too?Daddy: Yes, we can all play in the garden.Narrator: Peppa and George are wearing their boots. Mummy and Daddy are wearing their boots.Peppa loves jumping up and down in muddy puddles. Everyone loves jumping up and down inmuddy puddles.Mummy: Oh, Daddy pig, look at the mess you're in. .Peppa: It's only mud.";
22
23 size_t decom_size;
24 char* decom_ptr = NULL;
25 unsigned long long decom_buf_size;
26
27 peppa_pig_text_size = strlen(peppa_pig_buf);
28 com_space_size= ZSTD_compressBound(peppa_pig_text_size);
29 com_ptr = (char *)malloc(com_space_size);
30
31 com_size = ZSTD_compress(com_ptr, com_space_size, peppa_pig_buf, peppa_pig_text_size, 1);
32
33 gettimeofday(&st, NULL);
34 decom_buf_size = ZSTD_getFrameContentSize(com_ptr, com_size);
35
36 while(1) {
37
38 decom_ptr = (char *)malloc((size_t)decom_buf_size);
39
40 decom_size = ZSTD_decompress(decom_ptr, decom_buf_size, com_ptr, com_size);
41 if(decom_size != peppa_pig_text_size) {
42
43 cout << "decompress error" << endl;
44 break;
45 }
46
47 free(decom_ptr);
48
49 cnt++;
50 gettimeofday(&et, NULL);
51 if(et.tv_sec - st.tv_sec >= 10) {
52 break;
53 }
54 }
55
56 cout << "decompress per second:" << cnt/10 << " times" << endl;
57
58 free(com_ptr);
59 return 0;
60 }

执行结果:

  结果显示ZSTD的解压缩性能大概在每秒12万次左右,解压性能比压缩性能高。

三、zstd的高级用法

  zstd提供了一个名为PZSTD的压缩和解压工具。PZSTD(parallel zstd),并行压缩的zstd,是一个使用多线程对待压缩文本进行切片分段,且进行并行压缩的命令行工具。

  其实高版本(v1.4.0及以上)的zstd也提供了指定多线程对文本进行并行压缩的相关API接口,也就是本小节要介绍的zstd高级API用法。下面我们再来探索一下zstd的多线程压缩使用方法。

  多线程并行压缩的两个关键API,一个是参数设置API,另一个是压缩API。

  参数设置API的原型是:

  size_t ZSTD_CCtx_setParameter(ZSTD_CCtx* cctx, ZSTD_cParameter param, int value)

  压缩API的原型是:

  size_t ZSTD_compress2(ZSTD_CCtx* cctx, void* dst, size_t dstCapacity, const void* src, size_t srcSize)

  下面给出zstd并行压缩的示例demo,通过ZSTD_CCtx_setParameter设置线程数为3,即指定宏ZSTD_c_nbWorkers为3,通过ZSTD_compress2压缩相关文本。另外,为了展示zstd确实使用了多线程,需要先读取一个非常大的文件,作为zstd的压缩文本源,尽量使zstd运行较长时间。

 1 #include <stdio.h>
2 #include <string.h>
3 #include <sys/time.h>
4 #include <malloc.h>
5 #include <zstd.h>
6 #include <iostream>
7
8 using namespace std;
9
10 int main()
11 {
12 size_t com_size;
13 size_t com_space_size;
14
15 FILE *fp = NULL;
16 unsigned int file_len;
17
18 char *com_ptr = NULL;
19 char *file_text_ptr = NULL;
20
21 fp = fopen("xxxxxx", "r");
22 if(NULL == fp){
23 cout << "file open failed" << endl;
24 return -1;
25 }
26
27 fseek(fp, 0, SEEK_END);
28 file_len = ftell(fp);
29 fseek(fp, 0, SEEK_SET);
30 cout << "file length:" << file_len << endl;
31
32 // malloc space for file content
33 file_text_ptr = (char *)malloc(file_len);
34 if(NULL == file_text_ptr) {
35 cout << "malloc failed" << endl;
36 return -1;
37 }
38
39 // malloc space for compress space
40 com_space_size = ZSTD_compressBound(file_len);
41 com_ptr = (char *)malloc(com_space_size);
42 if(NULL == com_ptr) {
43 cout << "malloc failed" << endl;
44 return -1;
45 }
46
47 // read text from source file
48 fread(file_text_ptr, 1, file_len, fp);
49 fclose(fp);
50
51 ZSTD_CCtx* cctx;
52 cctx = ZSTD_createCCtx();
53
54 // set multi-thread parameter
55 ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, 3);
56 ZSTD_CCtx_setParameter(cctx, ZSTD_c_compressionLevel, ZSTD_btlazy2);
57
58 com_size = ZSTD_compress2(cctx, com_ptr, com_space_size, file_text_ptr, file_len);
59
60 free(com_ptr);
61 free(file_text_ptr);
62 return 0;
63 }

  运行上述demo,可见zstd确实启动了3个线程对文本进行了并行压缩。且设置的线程数越多,压缩时间越短,这里就不详细展示了,读者可以自行实验。

  需要说明的是,zstd当前默认编译单线程的库文件,要实现多线程的API调用,需要在make的时候指定编译参数ZSTD_MULTITHREAD。

  

  另外,zstd还支持线程池的方式,线程池的函数原型:

  POOL_ctx* ZSTD_createThreadPool(size_t numThreads)

  线程池可以避免在多次、连续压缩场景时频繁的去创建线程、撤销线程产生的非必要开销,使得算力主要开销在文本压缩方面。

四、总结

  本篇分享了zstd压缩与解压缩使用的基本方法,对压缩与解压的性能进行了摸底,最后探索了zstd多线程压缩的使用方法。

  从压缩测试来看,zstd的压缩比其实已经比较好了,比原文所占用空间缩小了一半以上,当然压缩比也跟待压缩文本的内容有关。

  从性能执行结果来看,zstd的压缩与解压性能表现比较勉强,我认为zstd在鱼(性能)和熊掌(压缩比)之间更偏向熊掌一些,不过对一些性能要求不太高的,但是要高压缩比的场景是比较符合的。

  多线程并行压缩,在有大文本需要连续多次压缩的场景下,结合线程池可以很好的提升压缩速率。

  下一篇介绍调研的LZ4压缩算法。码字不易,还请各位技术爱好者登录点个赞呀!

字符串压缩(一)之ZSTD的更多相关文章

  1. 字符串压缩(二)之LZ4

    本文来自博客园,作者:T-BARBARIANS,转载请注明原文链接:https://www.cnblogs.com/t-bar/p/16451185.html 谢谢! 上一篇对google精品ZSTD ...

  2. 基于Zlib算法的流压缩、字符串压缩源码

    原文:基于Zlib算法的流压缩.字符串压缩源码 Zlib.net官方源码demo中提供了压缩文件的源码算法.处于项目研发的需要,我需要对内存流进行压缩,由于zlib.net并无相关文字帮助只能自己看源 ...

  3. php字符串压缩

    在PHP中偶尔遇到字符串的压缩,比如一个长字符串,数据库开始设计的字段存不下,但是又不想改数据库字段存储长度,就可以用压缩的方式降低数据字段字符串的长度数量级,把几百个字符的字符串压缩到几十个字符.总 ...

  4. 字符串压缩 stringZip

    1,题目描述 通过键盘输入一串小写字母(a~z)组成的字符串.请编写一个字符串压缩程序,将字符串中连续出席的重复字母进行压缩,并输出压缩后的字符串.压缩规则:1. 仅压缩连续重复出现的字符.比如字符串 ...

  5. ruby直接字符串压缩与解压缩

    ruby2.1.3的核心类中包含了Zlib库,其中的Zlib模块包含了对字符串压缩和解压的方法: irb(main):180:0> Zlib.class => Module irb(mai ...

  6. 记录新项目中遇到的技术及自己忘记的技术点【DES加密解密,MD5加密,字符串压缩、解压,字符串截取等操作】

    一.DES加密.解密 #region DES加密解密 /// <summary> /// 进行DES加密 /// </summary> /// <param name=& ...

  7. C# 使用GZip对字符串压缩和解压

    using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.I ...

  8. PAT 1078 字符串压缩与解压(20)(代码+思路)

    1078 字符串压缩与解压(20 分) 文本压缩有很多种方法,这里我们只考虑最简单的一种:把由相同字符组成的一个连续的片段用这个字符和片段中含有这个字符的个数来表示.例如 ccccc 就用 5c 来表 ...

  9. CC05:基本字符串压缩

    题目 利用字符重复出现的次数,编写一个方法,实现基本的字符串压缩功能.比如,字符串"aabcccccaaa"经压缩会变成"a2b1c5a3".若压缩后的字符串没 ...

随机推荐

  1. 根据SVG Arc求出其开始角、摆动角和椭圆圆心

    SVG Arc 目前Svg的Arc的参数字符串如下: a rx ry x-axis-rotation large-arc-flag sweep-flag x y 除了a表示标识为Arc之外,其余参数说 ...

  2. Python实现查询12306火车票信息

    例子来源于马哥的公众号,看了几遍,有些地方存在些疑问,然后就自己查找些资料,重写的一下,但是对于获取到的信息,并不能有效的解析出来,而且对于中文字符处理,并不是很好,请大神指教下!谢过! 1.接口设置 ...

  3. Druid未授权访问实战利用

    Druid未授权访问实战利用 ​ 最近身边的同学都开始挖src了,而且身边接触到的挖src的网友也是越来越多.作者也是在前几天开始了挖src之路.惊喜又遗憾的是第一次挖src就挖到了一家互联网公司的R ...

  4. Clang-Format 个人常用配置

    Clang-Format 个人常用配置 本文记录 Clang-Format 个人常用配置. 欲了解更多配置选项,可查阅 官方文档. BasedOnStyle: Google AccessModifie ...

  5. ☕【Java深层系列】「并发编程系列」深入分析和研究MappedByteBuffer的实现原理和开发指南

    前言介绍 在Java编程语言中,操作文件IO的时候,通常采用BufferedReader,BufferedInputStream等带缓冲的IO类处理大文件,不过java nio中引入了一种基于Mapp ...

  6. oracle中的常用函数、字符串函数、数值类型函数、日期函数,聚合函数。

    一.字符串的常用函数. --一.oracle 字符串常用函数 --1. concat 连接字符串的函数,只能连接[两个]字符串. 字符写在括号中,并用逗号隔开! --2."||"符 ...

  7. Java枚举类在生产环境中的使用方式

    前言   Java枚举在项目中使用非常普遍,许多人在做项目时,一定会遇到要维护某些业务场景状态的时候,往往会定义一个常量类,然后添加业务场景相关的状态常量.但实际上,生产环境的项目中业务状态的定义大部 ...

  8. Lesson6——Pandas Pandas描述性统计

    1 简介 描述统计学(descriptive statistics)是一门统计学领域的学科,主要研究如何取得反映客观现象的数据,并以图表形式对所搜集的数据进行处理和显示,最终对数据的规律.特征做出综合 ...

  9. Lesson4——NumPy 数组属性

    NumPy 教程目录 NumPy 数组的维数称为秩(rank),秩就是轴的数量,即数组的维度,一维数组的秩为 1,二维数组的秩为 2,以此类推. 在 NumPy中,每一个线性的数组称为是一个轴(axi ...

  10. idea创建WebService服务端和客户端

    创建服务端 1.file–>new–>project 2.点击next后输入服务端名,点击finish,生成目录如下 3.在 HelloWorld.Java 文件中右击,选 WebServ ...