This is a re-implementation of R's hashed environments, the global
variable cache, the global string cache and symbol table with
cache-conscious array hash tables. The results are quite encouraging.
However, the implementation is a big departure from R's API:
"An array hash is a cache-conscious data structure that takes
advantage of hardware prefetchers for improved performance on large
hash tables, those large enough to fit in main memory and larger than
fast fixed size cpu caches.
However, their implementation is a radical departure from standard
chained hash tables. Rather than using chains of hash buckets for
collision resolution, array hashes use segements of contiguous memory
called dynamic arrays to store keys and values. Adding and deleting
items from the hash involve copying the entire segment to new areas in
memory. While this may seem wasteful and slow, it's surprisingly
efficient in both time and space.
In R, hashed environments are implemented using lists with each list
element (a CONS cell) acting as the hash bucket. The CONS cell is the
binding agent for a symbol and value. Hashed environments are searched
using the pointer address of the symbol rather than the symbol's
R-Array-Hash takes advantage of this by implementing an integer array
hash to store addresses of symbols and their associated values. Care
is also taken to account for whether or not a binding is locked,
Similarly, R-Array-Hash re-implements R's string cache using a string
array hash. This introduces the most radical change to R's API: CHAR()
no longer returns an address that points to the area at the end of the
SEXP (containing the string value). Rather it returns an address
located in one of the contiguous dynamic arrays of the string hash
table. Therefore, care must be taken in C code to use the address
immediately since additions and deletions to the string hash could
render the result of CHAR() useless. There are many areas of the code
that sidestep this by calling translateChar(), which has been changed
to always copy the string pointed by CHAR()."
Nice writeup and promising idea. From the "gimme numbers" department:
- do you pass the R regression tests?
- what sort of speedups do you see on which type of benchmarks?
When you asked about benchmark code on Twitter, I shared the somewhat
well-known (but no R ...) http://benchmarksgame.alioth.debian.org/ Did you write new benchmarks? Did you try the ones once assembled by Simon?
> When you asked about benchmark code on Twitter, I shared the somewhat
> well-known (but no R ...) http://benchmarksgame.alioth.debian.org/ > Did you write new benchmarks? Did you try the ones once assembled by Simon?
I decided to design the benchmark very close to the one I found in:
Askitis, Nikolas, and Justin Zobel. "Redesigning the string hash
table, burst trie, and bst to exploit cache." Journal of Experimental
Algorithmics (JEA) 15 (2010): 1-7.
Its a synthetic benchmark that just measures aspects of constructing
and searching an R environment.
On Fri, Mar 6, 2015 at 10:03 AM, Jeffrey Horner
<[hidden email]> wrote:
> On Fri, Mar 6, 2015 at 9:36 AM, Dirk Eddelbuettel <[hidden email]> wrote:
>> When you asked about benchmark code on Twitter, I shared the somewhat
>> well-known (but no R ...) http://benchmarksgame.alioth.debian.org/ >> Did you write new benchmarks? Did you try the ones once assembled by Simon?
I added some runs of that benchmark which you can view here: