What Is It?

A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.

But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time,n being the number of elements in the hash table.

For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table. Here’s our unique random string creator function:

library(plyr)
library(ggplot2) # Create n unique strings. We use upper and lower case letters only. Create
# length 1 strings first, and if we haven't satisfied the number of strings
# requested, we'll create length 2 strings, and so on until we've created n
# strings.
#
# Returns them in random order
#
unique_strings <- function(n){
string_i <- 1
string_len <- 1
ans <- character(n)
chars <- c(letters,LETTERS)
new_strings <- function(len,pfx){
for(i in 1:length(chars)){
if (len == 1){
ans[string_i] <<- paste(pfx,chars[i],sep='')
string_i <<- string_i + 1
} else {
new_strings(len-1,pfx=paste(pfx,chars[i],sep=''))
}
if (string_i > n) return ()
}
}
while(string_i <= n){
new_strings(string_len,'')
string_len <- string_len + 1
}
sample(ans)
}

Using a Vector

All vectors in R can be named, so let’s see how looking up a vector element by name compares to looking up an element by index.

Here’s our fake hash table creator:

# Create a named integer vector size n. Note that elements are initialized to 0
#
fake_hash_table <- function(n){
ans <- integer(n)
names(ans) <- unique_strings(n)
ans
}

Now let’s create hash tables size 210 thru 215 and search each element:

timings1 <- adply(2^(10:15),.mar=1,.fun=function(i){
ht <- fake_hash_table(i)
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i)ht[names(ht)[j]]==0L)[3],
system.time(for (k in 1:i)ht[k]==0L)[3]),
index = c('1_string','2_numeric')
)
})

We perform element lookup by using an equality test, so

ht[names(ht)[j]]==0L

performs the lookup using a named string, while

ht[k]==0L

performs the lookup by index.

p1 <- ggplot(timings1,aes(x=size,y=seconds,group=index)) +
geom_point(aes(color=index)) + geom_line(aes(color=index)) +
scale_x_log10(
breaks=2^(10:15),
labels=c(expression(2^10), expression(2^11),expression(2^12),
expression(2^13), expression(2^14), expression(2^15))
) +
theme(
axis.text = element_text(colour = "black")
) +
ggtitle('Search Time for Integer Vectors')
p1

So we see that as the hash size increases the time it takes to search all the keys by name is exponential, whereas searching by index is constant.

And note that we’re talking over 30 seconds to search the 215 size hash table, and I’ve got a pretty performant laptop with 8Gb memory, 2.7Ghz Intel processor with SSD drives.

Using a List

List elements can be named, right? They’re also more flexible than integer vectors since they can store any R object. What does their performance look like comparing name lookup to index lookup?

timings2 <- adply(2^(10:15),.mar=1,.fun=function(i){
strings <- unique_strings(i)
ht <- list()
lapply(strings, function(s) ht[[s]] <<- 0L)
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i) ht[[strings[j]]]==0L)[3],
system.time(for (k in 1:i) ht[[k]]==0L)[3]),
index = c('1_string','2_numeric')
)
})
p2 <- ggplot(timings2,aes(x=size,y=seconds,group=index)) +
geom_point(aes(color=index)) + geom_line(aes(color=index)) +
scale_x_log10(
breaks=2^(10:15),
labels=c(expression(2^10), expression(2^11),expression(2^12),
expression(2^13), expression(2^14), expression(2^15))
) +
theme(
axis.text = element_text(colour = "black")
) +
ggtitle('Search Time for Lists')
p2

A little better but still exponential growth with named indexing. We cut our 30 seconds down to over 6 seconds for the large hash table, though.

Let’s see if we can do better.

Using an Environment

R environments store bindings of variables to values. In fact they are well suited to implementing a hash table since internally that’s how they are implemented!

e <- new.env()

# Assigning in the environment the usual way
with(e, foo <- 'bar') e
## <environment: 0x3e81d48>
ls.str(e)
## foo : chr "bar"

You can also use environments the same way one would use a list:

e$bar <- 'baz'
ls.str(e)
## bar : chr "baz"
## foo : chr "bar"

By default R environments are hashed, but you can also create them without hashing. Let’s evaluate the two:

timings3 <- adply(2^(10:15),.mar=1,.fun=function(i){
strings <- unique_strings(i)
ht1 <- new.env(hash=TRUE)
ht2 <- new.env(hash=FALSE)
lapply(strings, function(s){ ht1[[s]] <<- 0L; ht2[[s]] <<- 0L;})
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i) ht1[[strings[j]]]==0L)[3],
system.time(for (k in 1:i) ht2[[strings[k]]]==0L)[3]),
envir = c('2_hashed','1_unhashed')
)
})

Note that we’re performing the lookup by named string in both cases. The only difference is that ht1 is hashed and ht2 is not.

p3 <- ggplot(timings3,aes(x=size,y=seconds,group=envir)) +
geom_point(aes(color=envir)) + geom_line(aes(color=envir)) +
scale_x_log10(
breaks=2^(10:15),
labels=c(expression(2^10), expression(2^11),expression(2^12),
expression(2^13), expression(2^14), expression(2^15))
) +
theme(
axis.text = element_text(colour = "black")
) +
ggtitle('Search Time for Environments')
p3

Bam! See that blue line? That’s near constant time for searching the entire 215 size hash table!

An interesting note about un-hashed environments: they are implemented with lists underneath the hood! You’d expect the red line of the above plot to mirror the red line of the list plot, but they differ ever so slightly in performance, with lists being faster.

In Conclusion

When you need a true associative array indexed by string, then you definitely want to use R hashed environments. They are more flexible than vectors as they can store any R object, and they are faster than lists since they are hashed. Plus you can work with them just like lists.

But using environments comes with caveats. I’ll explain in Part II.

转自:http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i

Hash Table Performance in R: Part I(转)的更多相关文章

  1. [转载] 散列表(Hash Table)从理论到实用(上)

    转载自:白话算法(6) 散列表(Hash Table)从理论到实用(上) 处理实际问题的一般数学方法是,首先提炼出问题的本质元素,然后把它看作一个比现实无限宽广的可能性系统,这个系统中的实质关系可以通 ...

  2. DHT(Distributed Hash Table) Translator

    DHT(Distributed Hash Table) Translator What is DHT? DHT is the real core of how GlusterFS aggregates ...

  3. 数据结构 : Hash Table

    http://www.cnblogs.com/lucifer1982/archive/2008/06/18/1224319.html 作者:Angel Lucifer 引子 这篇仍然不讲并行/并发. ...

  4. Hash Map (Hash Table)

    Reference: Wiki  PrincetonAlgorithm What is Hash Table Hash table (hash map) is a data structure use ...

  5. 数据结构基础-Hash Table详解(转)

    理解Hash 哈希表(hash table)是从一个集合A到另一个集合B的映射(mapping). 映射是一种对应关系,而且集合A的某个元素只能对应集合B中的一个元素.但反过来,集合B中的一个元素可能 ...

  6. 散列表(hash table)——算法导论(13)

    1. 引言 许多应用都需要动态集合结构,它至少需要支持Insert,search和delete字典操作.散列表(hash table)是实现字典操作的一种有效的数据结构. 2. 直接寻址表 在介绍散列 ...

  7. 哈希表(Hash Table)

    参考: Hash table - Wiki Hash table_百度百科 从头到尾彻底解析Hash表算法 谈谈 Hash Table 我们身边的哈希,最常见的就是perl和python里面的字典了, ...

  8. Berkeley DB的数据存储结构——哈希表(Hash Table)、B树(BTree)、队列(Queue)、记录号(Recno)

    Berkeley DB的数据存储结构 BDB支持四种数据存储结构及相应算法,官方称为访问方法(Access Method),分别是哈希表(Hash Table).B树(BTree).队列(Queue) ...

  9. 几种常见 容器 比较和分析 hashmap, map, vector, list ...hash table

    list支持快速的插入和删除,但是查找费时; vector支持快速的查找,但是插入费时. map查找的时间复杂度是对数的,这几乎是最快的,hash也是对数的.  如果我自己写,我也会用二叉检索树,它在 ...

随机推荐

  1. SystemVerilog搭建验证平台使用DPI时遇到的问题及解决方案

    本文目的在于分享一下把DPI稿能用了的过程,主要说一下平台其他部分搭建好之后,在完成DPI相关工作阶段遇到的问题,以及解决的办法. 工作环境:win10 64bit, Questasim 10.1b ...

  2. 这个demo是为解决IQKeyboardManager和Masonry同时使用时,导航栏上移和make.right失效的问题

    原文链接在我的个人博客主页 (一).引言: 在 IQKeyboardManager 和 Masonry 同时使用时,导航栏上移和make.right失效等问题多多. 其实我们完美的效果应该是这样的:* ...

  3. php数组--2017-04-16

    一.定义数组 (1)索引数组 $arr=array(1,2,3,3); (2)关联数组  类似于集合 $arr1=array("one"=>"111",& ...

  4. git工具使用的简单介绍

    百度百科 写道 Git是一款免费.开源的分布式版本控制系统,用于敏捷高效地处理任何或小或大的项目. Git的读音为/gɪt/. Git是一个开源的分布式版本控制系统,用以有效.高速的处理从很小到非常大 ...

  5. 用SourceTree轻松Git项目图解

    这篇文档的目的是:让使用Git更轻松. 看完这篇文档你能做到的是: 1.简单的用Git管理项目. 2.怎样既要开发又要处理发布出去的版本bug情况. SourceTree是一个免费的Git图形化管理工 ...

  6. Java多线程学习笔记(一)——Thread类中方法介绍

    currentThread():返回代码正在被哪个线程调用. public class CurrentThreadWay { public static void main(String[] args ...

  7. Maven(二)之Maven项目构建演练

    从上一篇的讲解中我们知道了什么是Maven,然后它的安装配置,到修改本地仓库,这篇我们用一个实际的例子,带领大家走进我们的Maven之旅.让我们一起来体验一下Maven的高度自动化构建项目的过程. 一 ...

  8. 对百度WebUploader的二次封装,精简前端代码之图片预览上传(两句代码搞定上传)

    前言 本篇文章上一篇: 对百度WebUploader开源上传控件的二次封装,精简前端代码(两句代码搞定上传) 此篇是在上面的基础上扩展出来专门上传图片的控件封装. 首先我们看看效果: 正文 使用方式同 ...

  9. SIP DB33标准笔记 注册/目录发送/心跳

    SIP协议扩展中: 在 RFC 3261 基础上定义了一个新方法 DO.方法 DO 的功能包括:控制对方动作.更新对方信息.查询对方状态.历史监控资料查询和回放等.发送方法 DO 的请求报文时,不会创 ...

  10. JavaScript 小函数积累及性能优化

    获取值的类型: var toString = Object.prototype.toString; function getType(o) { return toString.call(o).slic ...