Hash Table Performance in R: Part I（转）

What Is It?

A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.

But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time,n being the number of elements in the hash table.

For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table. Here’s our unique random string creator function:

library(plyr)

library(ggplot2)

# Create n unique strings. We use upper and lower case letters only. Create

# length 1 strings first, and if we haven't satisfied the number of strings

# requested, we'll create length 2 strings, and so on until we've created n

# strings.

#

# Returns them in random order

#

unique_strings <- function(n){

    string_i <- 1

    string_len <- 1

    ans <- character(n)

    chars <- c(letters,LETTERS)

    new_strings <- function(len,pfx){

    for(i in 1:length(chars)){

        if (len == 1){

        ans[string_i] <<- paste(pfx,chars[i],sep='')

        string_i <<- string_i + 1

        } else {

        new_strings(len-1,pfx=paste(pfx,chars[i],sep=''))

        }

        if (string_i > n) return ()

    }

    }

    while(string_i <= n){

    new_strings(string_len,'')

    string_len <- string_len + 1

    }

    sample(ans)

}

Using a Vector

All vectors in R can be named, so let’s see how looking up a vector element by name compares to looking up an element by index.

Here’s our fake hash table creator:

# Create a named integer vector size n. Note that elements are initialized to 0

#

fake_hash_table <- function(n){

    ans <- integer(n)

    names(ans) <- unique_strings(n)

    ans

}

Now let’s create hash tables size 2¹⁰ thru 2¹⁵ and search each element:

timings1 <- adply(2^(10:15),.mar=1,.fun=function(i){

    ht <- fake_hash_table(i)

    data.frame(

    size=c(i,i),

    seconds=c(

        system.time(for (j in 1:i)ht[names(ht)[j]]==0L)[3],

        system.time(for (k in 1:i)ht[k]==0L)[3]),

    index = c('1_string','2_numeric')

    )

})

We perform element lookup by using an equality test, so

ht[names(ht)[j]]==0L

performs the lookup using a named string, while

ht[k]==0L

performs the lookup by index.

p1 <- ggplot(timings1,aes(x=size,y=seconds,group=index)) +

    geom_point(aes(color=index)) + geom_line(aes(color=index)) +

    scale_x_log10(

    breaks=2^(10:15),

    labels=c(expression(2^10), expression(2^11),expression(2^12),

        expression(2^13), expression(2^14), expression(2^15))

    ) +

    theme(

    axis.text = element_text(colour = "black")

    ) +

    ggtitle('Search Time for Integer Vectors')

p1

So we see that as the hash size increases the time it takes to search all the keys by name is exponential, whereas searching by index is constant.

And note that we’re talking over 30 seconds to search the 2¹⁵ size hash table, and I’ve got a pretty performant laptop with 8Gb memory, 2.7Ghz Intel processor with SSD drives.

Using a List

List elements can be named, right? They’re also more flexible than integer vectors since they can store any R object. What does their performance look like comparing name lookup to index lookup?

timings2 <- adply(2^(10:15),.mar=1,.fun=function(i){

    strings <- unique_strings(i)

    ht <- list()

    lapply(strings, function(s) ht[[s]] <<- 0L)

    data.frame(

    size=c(i,i),

    seconds=c(

        system.time(for (j in 1:i) ht[[strings[j]]]==0L)[3],

        system.time(for (k in 1:i) ht[[k]]==0L)[3]),

    index = c('1_string','2_numeric')

    )

})

p2 <- ggplot(timings2,aes(x=size,y=seconds,group=index)) +

    geom_point(aes(color=index)) + geom_line(aes(color=index)) +

    scale_x_log10(

    breaks=2^(10:15),

    labels=c(expression(2^10), expression(2^11),expression(2^12),

        expression(2^13), expression(2^14), expression(2^15))

    ) +

    theme(

    axis.text = element_text(colour = "black")

    ) +

    ggtitle('Search Time for Lists')

p2

A little better but still exponential growth with named indexing. We cut our 30 seconds down to over 6 seconds for the large hash table, though.

Let’s see if we can do better.

Using an Environment

R environments store bindings of variables to values. In fact they are well suited to implementing a hash table since internally that’s how they are implemented!

e <- new.env()

# Assigning in the environment the usual way

with(e, foo <- 'bar')

e

## <environment: 0x3e81d48>

ls.str(e)

## foo :  chr "bar"

You can also use environments the same way one would use a list:

e$bar <- 'baz'

ls.str(e)

## bar :  chr "baz"

## foo :  chr "bar"

By default R environments are hashed, but you can also create them without hashing. Let’s evaluate the two:

timings3 <- adply(2^(10:15),.mar=1,.fun=function(i){

    strings <- unique_strings(i)

    ht1 <- new.env(hash=TRUE)

    ht2 <- new.env(hash=FALSE)

    lapply(strings, function(s){ ht1[[s]] <<- 0L; ht2[[s]] <<- 0L;})

    data.frame(

    size=c(i,i),

    seconds=c(

        system.time(for (j in 1:i) ht1[[strings[j]]]==0L)[3],

        system.time(for (k in 1:i) ht2[[strings[k]]]==0L)[3]),

    envir = c('2_hashed','1_unhashed')

    )

})

Note that we’re performing the lookup by named string in both cases. The only difference is that ht1 is hashed and ht2 is not.

p3 <- ggplot(timings3,aes(x=size,y=seconds,group=envir)) +

    geom_point(aes(color=envir)) + geom_line(aes(color=envir)) +

    scale_x_log10(

    breaks=2^(10:15),

    labels=c(expression(2^10), expression(2^11),expression(2^12),

        expression(2^13), expression(2^14), expression(2^15))

    ) +

    theme(

    axis.text = element_text(colour = "black")

    ) +

    ggtitle('Search Time for Environments')

p3

Bam! See that blue line? That’s near constant time for searching the entire 2¹⁵ size hash table!

An interesting note about un-hashed environments: they are implemented with lists underneath the hood! You’d expect the red line of the above plot to mirror the red line of the list plot, but they differ ever so slightly in performance, with lists being faster.

In Conclusion

When you need a true associative array indexed by string, then you definitely want to use R hashed environments. They are more flexible than vectors as they can store any R object, and they are faster than lists since they are hashed. Plus you can work with them just like lists.

But using environments comes with caveats. I’ll explain in Part II.

转自：http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i