uniform sampling without replacement algorithm

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

uniform sampling without replacement algorithm

Pavel S. Ruzankin
Let us consider the current uniform sampling without replacement
algorithm. It resides in function do_sample in
https://svn.r-project.org/R/trunk/src/main/random.c
Its complexity is obviously O(n), where the sample is selected from
1...n, since the algorithm has to create a vector of length n. So when
the sample size is much lesser than n, the algorithm is not effective.
Algorithms with average complexity O(s log s), were s is the sample
size, were described long ago. E.g. see
https://www.degruyter.com/view/j/mcma.1999.5.issue-1/mcma.1999.5.1.39/mcma.1999.5.1.39.xml
Here the Tree algorithm has complexity O(s log s). I suppose that there
may be algorithms with complexity close to s. Is somebody planning to
implement some more effective algorithm?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: uniform sampling without replacement algorithm

Pavel S. Ruzankin
If somebody is interested I can write the code. But somebody else has to
add the code for handling int / long int / double cases, since I do not
have enough experience in that.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: uniform sampling without replacement algorithm

Pavel S. Ruzankin
In reply to this post by Pavel S. Ruzankin
See also:
P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random
sampling without replacement. International Journal of Computer
Mathematics 16:4, pages 201-209.
http://dx.doi.org/10.1080/00207168408803438

Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random
sampling without replacement. /IJCM/, 11(2): 127–140.
http://dx.doi.org/10.1080/00207168208803304

In the latter paper the authors claim that their algorithms have O(s)
complexity. I doubt that this statement is correct. Is it?


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: uniform sampling without replacement algorithm

Radford Neal
In reply to this post by Pavel S. Ruzankin
> From: "Pavel S. Ruzankin" <[hidden email]>

> Let us consider the current uniform sampling without replacement
> algorithm. It resides in function do_sample in
> https://svn.r-project.org/R/trunk/src/main/random.c
> Its complexity is obviously O(n), where the sample is selected from
> 1...n, since the algorithm has to create a vector of length n. So when
> the sample size is much lesser than n, the algorithm is not effective.


Recent R Core implementations have a faster hashing scheme, but it is
enabled by default only for very large n, since it produces a
different result from the old, simple algorithm, and so (for example)
re-running an old analysis might give different results if it were
enabled.

The development version of pqR (at https://github.com/radfordneal/pqR
with the latest development branch being "73") implements a fast
hashing scheme that produces the same result as the old method, and so
is always enabled (when it would seem advantageous).  Here is the core
part of the code used (which calls a few pqR internal routines not in
R Core versions):


    /******  Copyright 2017 by Radford M. Neal, GPL 2 or later  ******/

/* Equal probability sampling; without-replacement case.

   This version is written to produce the same result as earlier versions,
   in which the algorithm was as follows (with x being temporary storage,
   and y being the result):

        for (i = 0; i < n; i++)
            x[i] = i;
        for (i = 0; i < k; i++) {
            j = n * unif_rand();
            y[i] = x[j] + 1;
            x[j] = x[--n];
        }

   When k <= 2, special code is used, for speed.

   When n is small or k is not much smaller than n, a modification of the
   above algorithm is used, which avoids the need for temporary storage - the
   result is allocated as of length n, and then has its length reduced to k
   (usually with no copy being done).

   When k is much smaller than n, and n is not small, a hashing scheme is
   used, in which hash entries record which elements of x in the above
   algorithm would have been modified from their original setting in which
   x[i] == i.
 */

static SEXP SampleNoReplace (int k, int n)
{
    SEXP r;

    if (k <= 2) {

        /* Special code for k = 0, 1, or 2, mimicing effect of previous code. */

        if (k == 0)
            return allocVector(INTSXP,0);

        int i1 = 1 + (int) (n * unif_rand());
        if (k == 1)
            return ScalarInteger (i1);

        int i2 = 1 + (int) ((n-1) * unif_rand());
        if (i2 == i1) i2 = n;
        r = allocVector(INTSXP,2);
        INTEGER(r)[0] = i1;
        INTEGER(r)[1] = i2;
    }
    else if (n < 100 || k > 0.6*n) {

        /* Code similar to previous method, but with temporary storage avoided.
           This reqires storing the initial sequence in decreasing rather than
           increasing order, and picking elements from the tail rather than the
           head, so that the space no longer used after each choice can hold the
           result, at the front of the vector.  Note:  Unlike the previous
           code, the indexes in the sequences are from 1 to n, not 0 to n-1. */

        r = allocVector(INTSXP,n);
        int *y = INTEGER(r);
        int i;
        for (i = 0; i < n; i++)
            y[i] = n-i;
        for (i = 0; i < k; i++) {
            int j = n - 1 - (int) ((n-i) * unif_rand());
            int t = y[j];
            y[j] = y[i];
            y[i] = t;
        }
        if (k < n)
            r = reallocVector(r,k,1);
    }

    else {

        /* Hash table implementation, producing same result as previous code.
           Mimics previous code by using a hash table to record how 'x' would
           have been changed.  At each iteration, it looks up x[j] in the
           hash table (j from 1 up), taking its value to be j if it is not in
           the table, and using this value as the next sampled value.  Also
           lookups up x[n-i], which is taken to be n-i if not present, and
           replaces/creates the entry for x[j] as having value x[n-i].  The
           hash table is non-chaining, with linear search. */

        /* Decide on the size of the hash table. */

        unsigned tblsize, mintblsize;
        mintblsize = 1.5 * k;
        tblsize = 32;
        while (tblsize < 0x80000000U && tblsize < mintblsize)
            tblsize <<= 1;
        unsigned tblmask = tblsize - 1;

        /* Allocate hash table, as auto variable if small, else with R_alloc. */

        struct tblentry { int pos, val; } *tbl;
        struct tblentry local [ tblsize < 1000 ? tblsize : 1 ];
        void *vmax = VMAXGET();
        tbl = tblsize < 1000 ? local
                             : (struct tblentry *) R_alloc(tblsize,sizeof *tbl);

        /* Clear all entries to zero.  Non-empty pos values start at 1. */

        memset (tbl, 0, tblsize * sizeof *tbl);

        /* Allocate vector to hold result. */

        r = allocVector(INTSXP,k);
        int *y = INTEGER(r);

        /* Do the sampling as described above. */

        int i;
        for (i = 0; i < k; i++) {
            int j = 1 + (int) ((n-i) * unif_rand());
            unsigned h;
            for (h = j & tblmask; ; h = (h+1) & tblmask) {
                if (tbl[h].pos == 0) {
                    y[i] = j;
                    break;
                }
                if (tbl[h].pos == j) {
                    y[i] = tbl[h].val;
                    break;
                }
            }
            unsigned h2;
            for (h2 = (n-i) & tblmask; ; h2 = (h2+1) & tblmask) {
                if (tbl[h2].pos == 0) {
                    tbl[h].val = n-i;
                    break;
                }
                if (tbl[h2].pos == n-i) {
                    tbl[h].val = tbl[h2].val;
                    break;
                }
            }
            tbl[h].pos = j;  /* don't set until after search for entry n-i */
        }

        VMAXSET(vmax);
    }

    return r;
}


This will be in a new version of pqR that will have many other performance
improvements as well, which I expect to release in a few weeks.

   Radford Neal

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: uniform sampling without replacement algorithm

R devel mailing list
In reply to this post by Pavel S. Ruzankin
Splus used a similar method for sampling from "bigdata" objects.  One
problem was that sample() is used both for creating a sample and for
scrambling the order of a vector.  Scrambling the order of a big vector
wastes time.  It would be nice to be able to tell sample() that we don't
care about the order.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Oct 17, 2017 at 10:55 AM, Pavel S. Ruzankin <[hidden email]>
wrote:

> Let us consider the current uniform sampling without replacement
> algorithm. It resides in function do_sample in
> https://svn.r-project.org/R/trunk/src/main/random.c
> Its complexity is obviously O(n), where the sample is selected from 1...n,
> since the algorithm has to create a vector of length n. So when the sample
> size is much lesser than n, the algorithm is not effective. Algorithms with
> average complexity O(s log s), were s is the sample size, were described
> long ago. E.g. see
> https://www.degruyter.com/view/j/mcma.1999.5.issue-1/mcma.
> 1999.5.1.39/mcma.1999.5.1.39.xml
> Here the Tree algorithm has complexity O(s log s). I suppose that there
> may be algorithms with complexity close to s. Is somebody planning to
> implement some more effective algorithm?
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: uniform sampling without replacement algorithm

Pavel S. Ruzankin
In reply to this post by Radford Neal
Thank you for your answer. Certainly, hash table must be faster than
binary tree.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: uniform sampling without replacement algorithm

Pavel S. Ruzankin
In reply to this post by R devel mailing list
The binary tree algorithm does not need additional scrambling. I have
written the R code for the algorithm in the last answer at:
https://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement/46807110#46807110

However, the algorithm will probably be outperformed by hash table
algorithms for relatively large sample sizes.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel