Skip List | Set 1 (Introduction)

 

Can we search in a sorted linked list in better than O(n) time?
The worst case search time for a sorted linked list is O(n) as we can only linearly traverse the list and cannot skip nodes while searching. For a Balanced Binary Search Tree, we skip almost half of the nodes after one comparison with root. For a sorted array, we have random access and we can apply Binary Search on arrays.

Can we augment sorted linked lists to make the search faster? The answer is Skip List. The idea is simple, we create multiple layers so that we can skip some nodes. See the following example list with 16 nodes and two layers. The upper layer works as an “express lane” which connects only main outer stations, and the lower layer works as a “normal lane” which connects every station. Suppose we want to search for 50, we start from first node of “express lane” and keep moving on “express lane” till we find a node whose next is greater than 50. Once we find such a node (30 is the node in following example) on “express lane”, we move to “normal lane” using pointer from this node, and linearly search for 50 on “normal lane”. In following example, we start from 30 on “normal lane” and with linear search, we find 50.

What is the time complexity with two layers? The worst case time complexity is number of nodes on “express lane” plus number of nodes in a segment (A segment is number of “normal lane” nodes between two “express lane” nodes) of “normal lane”. So if we have n nodes on “normal lane”,  nodes on “express lane” and we equally divide the “normal lane”, then there will be  nodes in every segment of “normal lane” .  is actually optimal division with two layers. With this arrangement, the number of nodes traversed for a search will be . Therefore, with  extra space, we are able to reduce the time complexity to .

Can we do better?
The time complexity of skip lists can be reduced further by adding more layers. In fact, the time complexity of search, insert and delete can become O(Logn) in average case. We will soon be publishing more posts on Skip Lists.

References
MIT Video Lecture on Skip Lists
http://en.wikipedia.org/wiki/Skip_list

==========================================================

http://www.geeksforgeeks.org/skip-list/

====================================================================================================================

Bloom Filters by Example

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.

The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to demonstrate:

                             
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1.

It's easier to see what that means than explain it, so enter some strings and see how the bit vector changes. Fnv and Murmur are two simple hash functions:

Enter a string:

fnv: 
murmur:

Your set: []

When you add a string, you can see that the bits at the index given by the hashes are set to 1. I've used the color green to show the newly added ones, but any colored cell is simply a 1.

To test for membership, you simply hash the string with the same hash functions, then see if those values are set in the bit vector. If they aren't, you know that the element isn't in the set. If they are, you only know that it might be, because another element or some combination of other elements could have set the same bits. Again, let's demonstrate:

Test an element for membership:

fnv: 
murmur:

Is the element in the set? no

Probability of a false positive: 0%

And that's the basics of a bloom filter!

Advanced Topics

Before I write a bit more about Bloom filters, a disclaimer: I've never used them in production. Don't take my word for it. All I intend to do is give you general ideas and pointers to where you can find out more.

In the following text, we will refer to a Bloom filter with k hashes, m bits in the filter, and n elements that have been inserted.

Hash Functions

The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible (cryptographic hashes such as sha1, though widely used therefore are not very good choices).

Examples of fast, simple hashes that are independent enough3 include murmur, thefnv series of hashes, and Jenkins Hashes.

To see the difference that a faster-than-cryptographic hash function can make, check out this story of a ~800% speedup when switching a bloom filter implementation from md5 to murmur.

In a short survey of bloom filter implementations:

    • Cassandra uses Murmur hashes
    • Hadoop includes default implementations of Jenkins and Murmur hashes
    • python-bloomfilter uses cryptographic hashes
    • Plan9 uses a simple hash as proposed in Mitzenmacher 2005
    • Sdroege Bloom filter uses fnv1a (included just because I wanted to show one that uses fnv.)
    • Squid uses MD5

      How big should I make my Bloom filter?

      It's a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.

      Your false positive rate will be approximately (1-e-kn/m)k, so you can just plug the number n of elements you expect to insert, and try various values of k and m to configure your filter for your application.2

      This leads to an obvious question:

      How many hash functions should I use?

      The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.

      Since you have to pick k when you create the filter, you'll have to ballpark what range you expect n to be in. Once you have that, you still have to choose a potential m (the number of bits) and k (the number of hash functions).

      It seems a difficult optimization problem, but fortunately, given an m and an n, we have a function to choose the optimal value of k(m/n)ln(2) 23

      So, to choose the size of a bloom filter, we:

      1. Choose a ballpark value for n
      2. Choose a value for m
      3. Calculate the optimal value of k
      4. Calculate the error rate for our chosen values of nm, and k. If it's unacceptable, return to step 2 and change m; otherwise we're done.

      How fast and space efficient is a Bloom filter?

      Given a Bloom filter with m bits and k hashing functions, both insertion and membership testing are O(k). That is, each time you want to add an element to the set or check set membership, you just need to run the element through the k hash functions and add it to the set or check those bits.

      The space advantages are more difficult to sum up; again it depends on the error rate you're willing to tolerate. It also depends on the potential range of the elements to be inserted; if it is very limited, a deterministic bit vector can do better. If you can't even ballpark estimate the number of elements to be inserted, you may be better off with a hash table or a scalable Bloom filter4.

      What can I use them for?

      I'll link you to wiki instead of copying what they say. C. Titus Brown also has an excellent talk on an application of Bloom filters to bioinformatics.

      References

      1: Network Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview.

      2: Wikipedia, which has an excellent and comprehensive page on Bloom filters

      3: Less Hashing, Same Performance, Kirsch and Mitzenmacher

      4: Scalable Bloom Filters, Almeida et al

===========================================================================================================

http://billmill.org/bloomfilter-tutorial/

Skip List & Bloom Filter的更多相关文章

  1. Bloom Filter:海量数据的HashSet

    Bloom Filter一般用于数据的去重计算,近似于HashSet的功能:但是不同于Bitmap(用于精确计算),其为一种估算的数据结构,存在误判(false positive)的情况. 1. 基本 ...

  2. 探索C#之布隆过滤器(Bloom filter)

    阅读目录: 背景介绍 算法原理 误判率 BF改进 总结 背景介绍 Bloom filter(后面简称BF)是Bloom在1970年提出的二进制向量数据结构.通俗来说就是在大数据集合下高效判断某个成员是 ...

  3. Bloom Filter 布隆过滤器

    Bloom Filter 是由伯顿.布隆(Burton Bloom)在1970年提出的一种多hash函数映射的快速查找算法.它实际上是一个很长的二进制向量和一些列随机映射函数.应用在数据量很大的情况下 ...

  4. Bloom Filter学习

    参考文献: Bloom Filters - the math    http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html    B ...

  5. 【转】探索C#之布隆过滤器(Bloom filter)

    原文:蘑菇先生,http://www.cnblogs.com/mushroom/p/4556801.html 背景介绍 Bloom filter(后面简称BF)是Bloom在1970年提出的二进制向量 ...

  6. bloom filter

    Bloom filter 是由 Howard Bloom 在 1970 年提出的二进制向量数据结构,它具有很好的空间和时间效率,被用来检测一个元素是不是集合中的一个成员. 结    构 二进制 召回率 ...

  7. Bloom Filter 概念和原理

    Bloom filter 是由 Howard Bloom 在 1970 年提出的二进制向量数据结构,它具有很好的空间和时间效率,被用来检测一个元素是不是集合中的一个成员.如果检测结果为是,该元素不一定 ...

  8. 【转】Bloom Filter布隆过滤器的概念和原理

    转自:http://blog.csdn.net/jiaomeng/article/details/1495500 之前看数学之美丽,里面有提到布隆过滤器的过滤垃圾邮件,感觉到何其的牛,竟然有这么高效的 ...

  9. [爬虫学习笔记]基于Bloom Filter的url去重模块UrlSeen

            Url Seen用来做url去重.对于一个大的爬虫系统,它可能已经有百亿或者千亿的url,新来一个url如何能快速的判断url是否已经出现过非常关键.因为大的爬虫系统可能一秒钟就会下载 ...

随机推荐

  1. PZ73H-PZ73X刀闸阀厂家,PZ73H-PZ73X刀闸阀价格 - 专题栏目 - 无极资讯网

    无极资讯网 首页 最新资讯 最新图集 最新标签   搜索 PZ73H-PZ73X刀闸阀 无极资讯网精心为您挑选了(PZ73H-PZ73X刀闸阀)信息,其中包含了(PZ73H-PZ73X刀闸阀)厂家,( ...

  2. 一个数字键盘引发的血案——移动端H5输入框、光标、数字键盘全假套件实现

    https://juejin.im/post/5a44c5eef265da432d2868f6 为啥要写假键盘? 还是输入框.光标全假的假键盘? 手机自带的不用非得写个假的,吃饱没事干吧? 装逼?炫技 ...

  3. 初探flow.js

    第一部分:前言 我们知道JS是弱类型语言,在声明变量时不论是什么类型的变量我们都用var即可,所以js是非常灵活的,但是同时问题就是弱类型语言有可能会出错,比如在调用函数时,且往往在运行起来时才可以检 ...

  4. JetBrains PyCharm(Professional版本)的下载、安装和初步使用

    不多说,直接上干货! 首先谈及这款软件,博主我用的理由:搞机器学习和深度学习! 想学习Python的同学们,在这里隆重介绍一款 Python 的开发工具 pyCharm IDE.这是我最喜欢的 Pyt ...

  5. 【随笔】Linux主机简单判断CC攻击的命令

    今天看到一个很有意思的命令tcpdump,在这里记录下. 如果想要看tcpdump的详细用法,可以点击这里. 什么是CC攻击? 关于CC攻击,这里引用百度的解释: CC攻击的原理就是攻击者控制某些主机 ...

  6. Android规划周期任务

    问题:应用总要周期性的执行某项任务,例如检查服务器上的更新或者提醒用户做某些事情. 解决方案:用AlarmManager来管理和执行任务.AlarmManager可用于计划未来的单次或重复操作,甚至在 ...

  7. orcale 之 SQL 数据查询

    从数据库中检索行,并允许从一个或多个表中选择一个或多个行或列.虽然 SELECT 语句的完整语法较复杂,但是其主要的子句可归纳如下: SELECT select_list [ INTO new_tab ...

  8. 快速部署简单私有云CloudStack(上)

    前言: 亲身用了大半年,没出过重大毛病,也就是服务挂了,跟服务器也没啥关系.如果想更深入学习cloudstack可以试试高级网络,我是一直用的简单网络(扁平网络). 由来:CloudStack的前身是 ...

  9. 我Java学习时的模样(三)

    读Java源码 平常使用Java的时候,那些集合类使用起来很顺手,但是有没有想过这些集合内部的实现原理是怎样的,它的添加移除都有哪些操作? 有了一些工作经验之后,必须要读一读Java包中的源码,需要知 ...

  10. Scrapy框架学习(三)Spider、Downloader Middleware、Spider Middleware、Item Pipeline的用法

    Spider有以下属性: Spider属性 name 爬虫名称,定义Spider名字的字符串,必须是唯一的.常见的命名方法是以爬取网站的域名来命名,比如爬取baidu.com,那就将Spider的名字 ...