Redis数据结构之HperLogLog

一、HyperLogLog

HyperLogLog是用来做基数统计的。

其可以非常省内存的去统计各种计数，比如注册ip数、每日访问IP数、页面实时UV（PV肯定字符串就搞定了）、在线用户数等在对准确性不是很重要的应用场景。

HyperLogLog的优点是：

在输入元素的数量或者体积非常非常大时，计算基数所需的空间总是固定的、并且是很小的，

HyperLogLog的缺点:

它是估计基数的算法，所以会有一定误差0.81%。

每个HyperLogLog键只需要花费12KB内存，就可以计算接近264个不同元素的基数。这和计算基数时，元素越多耗费内存就越多的集合形成鲜明对比。

但是，因为 HyperLogLog 只会根据输入元素来计算基数，而不会储存输入元素本身，所以 HyperLogLog 不能像集合那样，返回输入的各个元素即无法知道统计的详细内容。

二、基数和估算值

1、基数

基数是集合中不同元素的数量。

比如数据集 {1, 3, 5, 7, 5, 7, 8}，那么这个数据集的基数集为 {1, 3, 5 ,7, 8}, 基数(不重复元素)为5。

基数估计就是在误差可接受的范围内，快速计算基数。

2、估算值

算法给出的基数并不是精确的，可能会比实际稍微多一些或者稍微少一些，但会控制在合理的范围之内。

三、HperLogLog基本命令

redis HyperLogLog 的基本命令：

1 PFADD key element [element ...]

添加指定元素到 HyperLogLog 中。

2 PFCOUNT key [key ...]

返回给定 HyperLogLog 的基数估算值。

3 PFMERGE destkey sourcekey [sourcekey ...]

将多个 HyperLogLog 合并为一个 HyperLogLog

PFADD

将任意数量的元素添加到指定的 HyperLogLog 里面。在执行这个命令之后，HyperLogLog内部的结构会被更新，并有所反馈，

如果执行完之后HyperLogLog内部的基数估算发生了变化，那么就会返回1，否则（认为已经存在）就返回0。

这个命令还有一个比较神器的就是可以只有键，没有值，这样的意思就是只是创建空的键，不放值。

如果这个键存在，不做任何事情，返回0；不存在的话就创建，并返回1。

这个命令的时间复杂度为O(1)，所以就放心用吧~

PFCOUNT

当命令作用于单个键的时候，返回这个键的基数估算值。如果键不存在，则返回0。

当 PFCOUNT 命令作用于多个键时，返回所有给定 HyperLogLog 的并集的近似基数，这个近似基数是通过将所有给定 HyperLogLog 合并至一个临时 HyperLogLog 来计算得出的。

这个命令在作用于单个值的时候，时间复杂度为O(1)，并且具有非常低的平均常数时间；在作用于N个值的时候，时间复杂度为O(N)，这个命令的常数复杂度会比较低些。

命令返回的可见集合（observed set）基数并不是精确值，而是一个带有 0.81% 标准错误（standard error）的近似值。

举个例子，为了记录一天会执行多少次各不相同的搜索查询，一个程序可以在每次执行搜索查询时调用一次 PFADD ，并通过调用 PFCOUNT 命令来获取这个记录的近似结果。

PFMERGE

合并（merge）多个HyperLogLog为一个HyperLogLog。合并后的 HyperLogLog 的基数接近于所有输入 HyperLogLog 的可见集合（observed set）的并集。

合并得出的 HyperLogLog 会被储存在 destkey 键里面，如果该键并不存在，那么命令在执行之前，会先为该键创建一个空的 HyperLogLog 。

这个命令的第一个参数为目标键，剩下的参数为要合并的HyperLogLog。命令执行时，如果目标键不存在，则创建后再执行合并。

这个命令的时间复杂度为O(N)，其中N为要合并的HyperLogLog的个数。不过这个命令的常数时间复杂度比较高。

redis> PFADD ip:20170626 "192.168.0.10" "192.168.0.20" "192.168.0.30"

(integer) 1

redis> PFADD ip:20170626 "192.168.0.20" "192.168.0.40" "192.168.0.50" # 存在就只加新的

(integer) 1

redis> PFCOUNT ip:20170626 # 元素估计数量没有变化

(integer) 5

redis> PFADD ip:20170626 "192.168.0.20" # 存在就不会增加

(integer) 0

edis> PFMERGE ip:20170626 ip:20170627 ip:20170628

redis> PFCOUNT ip:201706

(integer) 5

四、hperloglog 描述

由于hperloglog，这种数据结构在实际应用场景中并不多。因此，这里就不再详细讨论了。

我们看下hperloglog.c文件，对HperLogLog的描述

/* The Redis HyperLogLog implementation is based on the following ideas:

*

* * The use of a 64 bit hash function as proposed in [1], in order to don't

* limited to cardinalities up to 10^9, at the cost of just 1 additional

* bit per register.

* * The use of 16384 6-bit registers for a great level of accuracy, using

* a total of 12k per key.

* * The use of the Redis string data type. No new type is introduced.

* * No attempt is made to compress the data structure as in [1]. Also the

* algorithm used is the original HyperLogLog Algorithm as in [2], with

* the only difference that a 64 bit hash function is used, so no correction

* is performed for values near 2^32 as in [1].

*

* [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic

* Engineering of a State of The Art Cardinality Estimation Algorithm.

*

* [2] P. Flajolet, éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The

* analysis of a near-optimal cardinality estimation algorithm.

*

* Redis uses two representations:

*

* 1) A "dense" representation where every entry is represented by

* a 6-bit integer.

* 2) A "sparse" representation using run length compression suitable

* for representing HyperLogLogs with many registers set to 0 in

* a memory efficient way.

*

*

* HLL header

* ===

*

* Both the dense and sparse representation have a 16 byte header as follows:

*

* +------+---+-----+----------+

* | HYLL | E | N/U | Cardin. |

* +------+---+-----+----------+

*

* The first 4 bytes are a magic string set to the bytes "HYLL".

* "E" is one byte encoding, currently set to HLL_DENSE or

* HLL_SPARSE. N/U are three not used bytes.

*

* The "Cardin." field is a 64 bit integer stored in little endian format

* with the latest cardinality computed that can be reused if the data

* structure was not modified since the last computation (this is useful

* because there are high probabilities that HLLADD operations don't

* modify the actual data structure and hence the approximated cardinality).

*

* When the most significant bit in the most significant byte of the cached

* cardinality is set, it means that the data structure was modified and

* we can't reuse the cached value that must be recomputed.

*

* Dense representation

* ===

*

* The dense representation used by Redis is the following:

*

* +--------+--------+--------+------// //--+

* |11000000|22221111|33333322|55444444 .... |

* +--------+--------+--------+------// //--+

*

* The 6 bits counters are encoded one after the other starting from the

* LSB to the MSB, and using the next bytes as needed.

*

* Sparse representation

* ===

*

* The sparse representation encodes registers using a run length

* encoding composed of three opcodes, two using one byte, and one using

* of two bytes. The opcodes are called ZERO, XZERO and VAL.

*

* ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented

* by the six bits 'xxxxxx', plus 1, means that there are N registers set

* to 0. This opcode can represent from 1 to 64 contiguous registers set

* to the value of 0.

*

* XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit

* integer represented by the bits 'xxxxxx' as most significant bits and

* 'yyyyyyyy' as least significant bits, plus 1, means that there are N

* registers set to 0. This opcode can represent from 0 to 16384 contiguous

* registers set to the value of 0.

*

* VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer

* representing the value of a register, and a 2-bit integer representing

* the number of contiguous registers set to that value 'vvvvv'.

* To obtain the value and run length, the integers vvvvv and xx must be

* incremented by one. This opcode can represent values from 1 to 32,

* repeated from 1 to 4 times.

*

* The sparse representation can't represent registers with a value greater

* than 32, however it is very unlikely that we find such a register in an

* HLL with a cardinality where the sparse representation is still more

* memory efficient than the dense representation. When this happens the

* HLL is converted to the dense representation.

*

* The sparse representation is purely positional. For example a sparse

* representation of an empty HLL is just: XZERO:16384.

*

* An HLL having only 3 non-zero registers at position 1000, 1020, 1021

* respectively set to 2, 3, 3, is represented by the following three

* opcodes:

*

* XZERO:1000 (Registers 0-999 are set to 0)

* VAL:2,1 (1 register set to value 2, that is register 1000)

* ZERO:19 (Registers 1001-1019 set to 0)

* VAL:3,2 (2 registers set to value 3, that is registers 1020,1021)

* XZERO:15362 (Registers 1022-16383 set to 0)

*

* In the example the sparse representation used just 7 bytes instead

* of 12k in order to represent the HLL registers. In general for low

* cardinality there is a big win in terms of space efficiency, traded

* with CPU time since the sparse representation is slower to access:

*

* The following table shows average cardinality vs bytes used, 100

* samples per cardinality (when the set was not representable because

* of registers with too big value, the dense representation size was used

* as a sample).

*

* 100 267

* 200 485

* 300 678

* 400 859

* 500 1033

* 600 1205

* 700 1375

* 800 1544

* 900 1713

* 1000 1882

* 2000 3480

* 3000 4879

* 4000 6089

* 5000 7138

* 6000 8042

* 7000 8823

* 8000 9500

* 9000 10088

* 10000 10591

*

* The dense representation uses 12288 bytes, so there is a big win up to

* a cardinality of ~2000-3000. For bigger cardinalities the constant times

* involved in updating the sparse representation is not justified by the

* memory savings. The exact maximum length of the sparse representation

* when this implementation switches to the dense representation is

* configured via the define server.hll_sparse_max_bytes.

*/

Redis数据结构之HperLogLog的更多相关文章

Redis 数据结构使用场景
转自http://get.ftqq.com/523.get 一.redis 数据结构使用场景原来看过 redisbook 这本书,对 redis 的基本功能都已经熟悉了,从上周开始看 redis 的 ...
Redis数据结构
Redis数据结构 Redis数据结构详解(一) 前言 Redis和Memcached最大的区别,Redis 除啦支持数据持久化之外,还支持更多的数据类型而不仅仅是简单key-value结构的数据 ...
Redis数据结构底层知识总结
Redis数据结构底层总结本篇文章是基于作者黄建宏写的书Redis设计与实现而做的笔记数据结构与对象 Redis中数据结构的底层实现包括以下对象: 对象解释简单动态字符串字符串的底层实现链 ...
Redis 数据结构与内存管理策略（上）
Redis 数据结构与内存管理策略(上) 标签: Redis Redis数据结构 Redis内存管理策略 Redis数据类型 Redis类型映射 Redis 数据类型特点与使用场景 String.Li ...
Redis 数据结构与内存管理策略（下）
Redis 数据结构与内存管理策略(下) 标签: Redis Redis数据结构 Redis内存管理策略 Redis数据类型 Redis类型映射 Redis 数据类型特点与使用场景 String.Li ...
Redis数据结构之intset
本文及后续文章,Redis版本均是v3.2.8 上篇文章<Redis数据结构之robj>,我们说到redis object数据结构,其有5中数据类型:OBJ_STRING,OBJ_LIST ...
Redis数据结构之robj
本文及后续文章,Redis版本均是v3.2.8 我们知道一个database内的这个映射关系是用一个dict来维护的.dict的key固定用一种数据结构来表达,这这数据结构就是动态字符串sds.而va ...
Redis 数据结构之dict（2）
本文及后续文章,Redis版本均是v3.2.8 上篇文章<Redis 数据结构之dict>,我们对dict的结构有了大致的印象.此篇文章对dict是如何维护数据结构的做个详细的理解. 老规 ...
Redis 数据结构之dict
上篇文章<Redis数据结构概述>中,了解了常用数据结构.我们知道Redis以高效的方式实现了多种数据结构,因此把Redis看做为数据结构服务器也未尝不可.研究Redis的数据结构和正确. ...

随机推荐

min-max容斥
这玩意儿一般都是跟概率期望结合的吧,就是下面这个式子(\(max(S)\)代表集合\(S\)中的最大值,\(min(S)\)同理): \[max(S)=\sum\limits_{T\subseteq ...
GWAS: 阿尔兹海默症和代谢指标在大规模全基因组数据的遗传共享研究
今天要讲的一篇是发表于 Hum Genet 的 "Shared genetic architecture between metabolic traits and Alzheimer's d ...
I/O模型系列之一：Linux I/O模型基本概念
1. IO模型矩阵基本 Linux I/O 模型的简单矩阵: 同步与异步:描述的是用户线程与内核的交互方式. 同步IO和异步IO的区别就在于:数据拷贝的时候进程是否阻塞! 同步是指用户线程发起IO请 ...
[译]Ocelot - Service Discovery
原文你可以指定一个service discovery provider,ocelot将使用它来找下游的host和port. Consul 下面的配置要放在GlobalConfiguration中.如 ...
IO流的操作规律。
1. 明确源和目的源代表输入流: InputStream, Reader 目的代表输出流: OutputStream, Writer 2. 操作数据是否纯文本纯文本:字符流非纯文本: 字节流 ...
通过配置文件新建solr的core
目录solr-7.5.0\server\solr 1. 新建文件夹 test-core 2. 在文件夹test-core下新建core.properties name=test-core confi ...
conda和pip相关操作
1.conda创建.删除.激活和退出环境创建:conda create -n [name] python=[version] 删除:conda remove -n [name] --all 激活:s ...
python自定义封装logging模块
#coding:utf-8 import logging class TestLog(object): ''' 封装后的logging ''' def __init__(self , logger = ...
【原创】大叔经验分享（50）hue访问mysql（librdbms）
cloudera manager安装hue后想开启访问mysql(librdbms)需要在这里配置(hue_safety_valve.ini) 添加配置如下 [librdbms] # The RDBM ...
mysql 5.7版本后时间datetime 默认为 0000-00-00 00:00:00 问题
CREATE TABLE `test_user` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `name` char(25) DEFAULT '' ...

Redis数据结构之HperLogLog

Redis数据结构之HperLogLog的更多相关文章

随机推荐

热门专题