Hash Table Performance in R: Part I(转)
What Is It?
A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.
But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time,n being the number of elements in the hash table.
For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table. Here’s our unique random string creator function:
library(plyr)
library(ggplot2) # Create n unique strings. We use upper and lower case letters only. Create
# length 1 strings first, and if we haven't satisfied the number of strings
# requested, we'll create length 2 strings, and so on until we've created n
# strings.
#
# Returns them in random order
#
unique_strings <- function(n){
string_i <- 1
string_len <- 1
ans <- character(n)
chars <- c(letters,LETTERS)
new_strings <- function(len,pfx){
for(i in 1:length(chars)){
if (len == 1){
ans[string_i] <<- paste(pfx,chars[i],sep='')
string_i <<- string_i + 1
} else {
new_strings(len-1,pfx=paste(pfx,chars[i],sep=''))
}
if (string_i > n) return ()
}
}
while(string_i <= n){
new_strings(string_len,'')
string_len <- string_len + 1
}
sample(ans)
}
Using a Vector
All vectors in R can be named, so let’s see how looking up a vector element by name compares to looking up an element by index.
Here’s our fake hash table creator:
# Create a named integer vector size n. Note that elements are initialized to 0
#
fake_hash_table <- function(n){
ans <- integer(n)
names(ans) <- unique_strings(n)
ans
}
Now let’s create hash tables size 210 thru 215 and search each element:
timings1 <- adply(2^(10:15),.mar=1,.fun=function(i){
ht <- fake_hash_table(i)
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i)ht[names(ht)[j]]==0L)[3],
system.time(for (k in 1:i)ht[k]==0L)[3]),
index = c('1_string','2_numeric')
)
})
We perform element lookup by using an equality test, so
ht[names(ht)[j]]==0L
performs the lookup using a named string, while
ht[k]==0L
performs the lookup by index.
p1 <- ggplot(timings1,aes(x=size,y=seconds,group=index)) +
geom_point(aes(color=index)) + geom_line(aes(color=index)) +
scale_x_log10(
breaks=2^(10:15),
labels=c(expression(2^10), expression(2^11),expression(2^12),
expression(2^13), expression(2^14), expression(2^15))
) +
theme(
axis.text = element_text(colour = "black")
) +
ggtitle('Search Time for Integer Vectors')
p1

So we see that as the hash size increases the time it takes to search all the keys by name is exponential, whereas searching by index is constant.
And note that we’re talking over 30 seconds to search the 215 size hash table, and I’ve got a pretty performant laptop with 8Gb memory, 2.7Ghz Intel processor with SSD drives.
Using a List
List elements can be named, right? They’re also more flexible than integer vectors since they can store any R object. What does their performance look like comparing name lookup to index lookup?
timings2 <- adply(2^(10:15),.mar=1,.fun=function(i){
strings <- unique_strings(i)
ht <- list()
lapply(strings, function(s) ht[[s]] <<- 0L)
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i) ht[[strings[j]]]==0L)[3],
system.time(for (k in 1:i) ht[[k]]==0L)[3]),
index = c('1_string','2_numeric')
)
})
p2 <- ggplot(timings2,aes(x=size,y=seconds,group=index)) +
geom_point(aes(color=index)) + geom_line(aes(color=index)) +
scale_x_log10(
breaks=2^(10:15),
labels=c(expression(2^10), expression(2^11),expression(2^12),
expression(2^13), expression(2^14), expression(2^15))
) +
theme(
axis.text = element_text(colour = "black")
) +
ggtitle('Search Time for Lists')
p2

A little better but still exponential growth with named indexing. We cut our 30 seconds down to over 6 seconds for the large hash table, though.
Let’s see if we can do better.
Using an Environment
R environments store bindings of variables to values. In fact they are well suited to implementing a hash table since internally that’s how they are implemented!
e <- new.env() # Assigning in the environment the usual way
with(e, foo <- 'bar') e
## <environment: 0x3e81d48>
ls.str(e)
## foo : chr "bar"
You can also use environments the same way one would use a list:
e$bar <- 'baz'
ls.str(e)
## bar : chr "baz"
## foo : chr "bar"
By default R environments are hashed, but you can also create them without hashing. Let’s evaluate the two:
timings3 <- adply(2^(10:15),.mar=1,.fun=function(i){
strings <- unique_strings(i)
ht1 <- new.env(hash=TRUE)
ht2 <- new.env(hash=FALSE)
lapply(strings, function(s){ ht1[[s]] <<- 0L; ht2[[s]] <<- 0L;})
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i) ht1[[strings[j]]]==0L)[3],
system.time(for (k in 1:i) ht2[[strings[k]]]==0L)[3]),
envir = c('2_hashed','1_unhashed')
)
})
Note that we’re performing the lookup by named string in both cases. The only difference is that ht1 is hashed and ht2 is not.
p3 <- ggplot(timings3,aes(x=size,y=seconds,group=envir)) +
geom_point(aes(color=envir)) + geom_line(aes(color=envir)) +
scale_x_log10(
breaks=2^(10:15),
labels=c(expression(2^10), expression(2^11),expression(2^12),
expression(2^13), expression(2^14), expression(2^15))
) +
theme(
axis.text = element_text(colour = "black")
) +
ggtitle('Search Time for Environments')
p3

Bam! See that blue line? That’s near constant time for searching the entire 215 size hash table!
An interesting note about un-hashed environments: they are implemented with lists underneath the hood! You’d expect the red line of the above plot to mirror the red line of the list plot, but they differ ever so slightly in performance, with lists being faster.
In Conclusion
When you need a true associative array indexed by string, then you definitely want to use R hashed environments. They are more flexible than vectors as they can store any R object, and they are faster than lists since they are hashed. Plus you can work with them just like lists.
But using environments comes with caveats. I’ll explain in Part II.
转自:http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i
Hash Table Performance in R: Part I(转)的更多相关文章
- [转载] 散列表(Hash Table)从理论到实用(上)
转载自:白话算法(6) 散列表(Hash Table)从理论到实用(上) 处理实际问题的一般数学方法是,首先提炼出问题的本质元素,然后把它看作一个比现实无限宽广的可能性系统,这个系统中的实质关系可以通 ...
- DHT(Distributed Hash Table) Translator
DHT(Distributed Hash Table) Translator What is DHT? DHT is the real core of how GlusterFS aggregates ...
- 数据结构 : Hash Table
http://www.cnblogs.com/lucifer1982/archive/2008/06/18/1224319.html 作者:Angel Lucifer 引子 这篇仍然不讲并行/并发. ...
- Hash Map (Hash Table)
Reference: Wiki PrincetonAlgorithm What is Hash Table Hash table (hash map) is a data structure use ...
- 数据结构基础-Hash Table详解(转)
理解Hash 哈希表(hash table)是从一个集合A到另一个集合B的映射(mapping). 映射是一种对应关系,而且集合A的某个元素只能对应集合B中的一个元素.但反过来,集合B中的一个元素可能 ...
- 散列表(hash table)——算法导论(13)
1. 引言 许多应用都需要动态集合结构,它至少需要支持Insert,search和delete字典操作.散列表(hash table)是实现字典操作的一种有效的数据结构. 2. 直接寻址表 在介绍散列 ...
- 哈希表(Hash Table)
参考: Hash table - Wiki Hash table_百度百科 从头到尾彻底解析Hash表算法 谈谈 Hash Table 我们身边的哈希,最常见的就是perl和python里面的字典了, ...
- Berkeley DB的数据存储结构——哈希表(Hash Table)、B树(BTree)、队列(Queue)、记录号(Recno)
Berkeley DB的数据存储结构 BDB支持四种数据存储结构及相应算法,官方称为访问方法(Access Method),分别是哈希表(Hash Table).B树(BTree).队列(Queue) ...
- 几种常见 容器 比较和分析 hashmap, map, vector, list ...hash table
list支持快速的插入和删除,但是查找费时; vector支持快速的查找,但是插入费时. map查找的时间复杂度是对数的,这几乎是最快的,hash也是对数的. 如果我自己写,我也会用二叉检索树,它在 ...
随机推荐
- ECP系统J2EE架构开发平台
一 体系结构 ECP平台是一个基于J2EE架构设计的大型分布式企业协同管理平台,通过采用成熟的J2EE的多层企业架构体系,充分保证了系统的健壮性.开放性和扩展性.可选择部署于多种系统环境,满足不同类型 ...
- C#各个版本中的新增特性详解
序言 自从2000年初期发布以来,c#编程语言不断的得到改进,使我们能够更加清晰的编写代码,也更加容易维护我们的代码,增强的功能已经从1.0搞到啦7.0甚至7.1,每一次改过都伴随着.NET Fram ...
- 一文搞定FastDFS分布式文件系统配置与部署
Ubuntu下FastDFS分布式文件系统配置与部署 白宁超 2017年4月15日09:11:52 摘要: FastDFS是一个开源的轻量级分布式文件系统,功能包括:文件存储.文件同步.文件访问(文件 ...
- Java --- JSP2新特性
自从03年发布了jsp2.0之后,新增了一些额外的特性,这些特性使得动态网页设计变得更加容易.jsp2.0以后的版本统称jsp2.主要的新增特性有如下几个: 直接配置jsp属性 表达式语言(EL) 标 ...
- Java 比较(==, equals, compareTo, compare)
在Java中,有 ==, equals(), compareTo(), compare() 等方法可以比较两个值或对象,比较容易混淆.画了个简单的思维导图总结一下 Java Compares 我经常记 ...
- 【2017-04-24】winform基础、登录窗口、窗口属性
一.winform基础 客户端应用程序:C/S 客户端应用程序可以操作用户电脑中的文件,代码要在用户电脑上执行,吃用户电脑配置. 窗体是由控件和属性做出来的 控件:窗体里所放的东西."视图 ...
- 读APUE分析散列表的使用
最近学习APUE读到避免线程死锁的部分,看到部分源码涉及到避免死锁部分,源码使用了散列表来实现对结构(struct)的存储与查找. 本文不讨论代码中的互斥量部分. #include <stdli ...
- 如何做一个导航栏————浮动跟伪类(hover)事件的应用
我们先说一下伪类选择器的写法: 写法:选择器名称:伪类状态{}4 常见伪类状态: 未访问:link 鼠标移上去:hover 激活选定:active 已访问:visited 获得焦点的时候触发:focu ...
- jQuery修炼心得-DOM节点的插入
1. 内部插入append()与appendTo() append:这个操作与对指定的元素执行原生的appendChild方法,将它们添加到文档中的情况类似. appendTo:实际上,使用这个方法是 ...
- Hadoop集群
你可以用以下三种支持的模式中的一种启动Hadoop集群: 单机模式 伪分布式模式 完全分布式模式 单机模式的操作方法 默认情况下,Hadoop被配置成以非分布式模式运行的一个独立Java进程.这对调试 ...