millions of comparisons, speed wanted

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

millions of comparisons, speed wanted

dusadrian

Dear all,

I have a 10 columns matrix which has 2^10=1024 unique rows, with values 0 and
1.
What I would like to do is a little complicated; in a simple statement, for a
subset (say 1000 rows) I want to perform pairwise comparisons between all
rows, find out which rows differ by only one  value, replace that value by
"x", get rid of the comparisons that differ by more than one value and repeat
the algorithm until no further minimization is possible. Any row that hasn't
been minimized is kept for the next iteration.

For 1,000 rows, there are almost 500,000 pairs, but in the next iterations I
could get some 5,000 rows which generates something like 12.5 millions pairs,
and that takes a _very_ long time.

The code I have created (10 lines, below) is super fast (using vectorization)
but only for a reasonable number of rows. I am searching for:
- ways to improve my code (if possible)
- ideas: create a C code for the slow parts of the code? use MySQL? other
ways?

As a toy example, having an input matrix called "input", my algorithm looks
like this:

## code start
ncolumns <- 6
input <- bincombinations(ncolumns) # from package e1071
# subset, let's say 97% of rows
input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ]
minimized <- 1

while (sum(minimized) > 0) {

   minimized <- logical(nrow(input))

   to.be.compared <- combn2(1:nrow(input)) # from package combinat
   
   # the following line takes _a lot_ of time, for millions of comparisons
   logical.result <- apply(to.be.compared, 1, function(idx) input[idx[1], ] ==
input[idx[2], ])

   compare.minimized <- which(colSums(!logical.result) == 1)

   logical.result <- logical.result[, compare.minimized]

   result <- sapply(compare.minimized, function(idx) input[to.be.compared[idx,
1], ])

   result[!logical.result] <- "x"

   minimized[unique(as.vector(to.be.compared[compare.minimized, ]))] <- TRUE

   if (sum(minimized) > 0) {
      input <- rbind(input[!minimized, ], unique(t(result)))
   }
}
## code end

Any suggestion is welcomed, thank you very much in advance.
Adrian

--
Adrian DUSA
Romanian Social Data Archive
1, Schitu Magureanu Bd
050025 Bucharest sector 5
Romania
Tel./Fax: +40 21 3126618 \
          +40 21 3120210 / int.101

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: millions of comparisons, speed wanted

Liaw, Andy
Just some untested idea:

If the data are all 0/1, you could use dist(input, method="manhattan"), and
then check which entry equals 1.  This should be much faster than creating
all pairs of rows and check position-by-position.

HTH,
Andy

From: Adrian DUSA

>
> Dear all,
>
> I have a 10 columns matrix which has 2^10=1024 unique rows,
> with values 0 and
> 1.
> What I would like to do is a little complicated; in a simple
> statement, for a
> subset (say 1000 rows) I want to perform pairwise comparisons
> between all
> rows, find out which rows differ by only one  value, replace
> that value by
> "x", get rid of the comparisons that differ by more than one
> value and repeat
> the algorithm until no further minimization is possible. Any
> row that hasn't
> been minimized is kept for the next iteration.
>
> For 1,000 rows, there are almost 500,000 pairs, but in the
> next iterations I
> could get some 5,000 rows which generates something like 12.5
> millions pairs,
> and that takes a _very_ long time.
>
> The code I have created (10 lines, below) is super fast
> (using vectorization)
> but only for a reasonable number of rows. I am searching for:
> - ways to improve my code (if possible)
> - ideas: create a C code for the slow parts of the code? use
> MySQL? other
> ways?
>
> As a toy example, having an input matrix called "input", my
> algorithm looks
> like this:
>
> ## code start
> ncolumns <- 6
> input <- bincombinations(ncolumns) # from package e1071
> # subset, let's say 97% of rows
> input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ]
> minimized <- 1
>
> while (sum(minimized) > 0) {
>
>    minimized <- logical(nrow(input))
>
>    to.be.compared <- combn2(1:nrow(input)) # from package combinat
>    
>    # the following line takes _a lot_ of time, for millions
> of comparisons
>    logical.result <- apply(to.be.compared, 1, function(idx)
> input[idx[1], ] ==
> input[idx[2], ])
>
>    compare.minimized <- which(colSums(!logical.result) == 1)
>
>    logical.result <- logical.result[, compare.minimized]
>
>    result <- sapply(compare.minimized, function(idx)
> input[to.be.compared[idx,
> 1], ])
>
>    result[!logical.result] <- "x"
>
>    
> minimized[unique(as.vector(to.be.compared[compare.minimized,
> ]))] <- TRUE
>
>    if (sum(minimized) > 0) {
>       input <- rbind(input[!minimized, ], unique(t(result)))
>    }
> }
> ## code end
>
> Any suggestion is welcomed, thank you very much in advance.
> Adrian
>
> --
> Adrian DUSA
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd
> 050025 Bucharest sector 5
> Romania
> Tel./Fax: +40 21 3126618 \
>           +40 21 3120210 / int.101
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: millions of comparisons, speed wanted

Adrian Dusa-2
Dear Andy,

On Thursday 15 December 2005 20:57, Liaw, Andy wrote:
> Just some untested idea:
> If the data are all 0/1, you could use dist(input, method="manhattan"), and
> then check which entry equals 1.  This should be much faster than creating
> all pairs of rows and check position-by-position.

Thanks for the idea, I played a little with it. At the beginning yes, the data
are all 0/1, but during the minimizing iterations there are also "x" values;
for example comparing:
0 1 0 1 1
0 0 0 1 1
should return
0 "x" 0 1 1

whereas
0 "x" 0 1 1
0 0 0 1 1
shouldn't even be compared (they have different number of figures).

Replacing "x" with NA in dist is not yielding results either, as with
NA 0 0 1 1
0 0 0 1 1
dist returns 0.

I even wanted to see if I could tweak the dist code, but it calls a C program
and I gave up.

Nice idea anyhow, maybe I'll find a way to use it further.
Best,
Adrian

--
Adrian DUSA
Romanian Social Data Archive
1, Schitu Magureanu Bd
050025 Bucharest sector 5
Romania
Tel./Fax: +40 21 3126618 \
          +40 21 3120210 / int.101

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: millions of comparisons, speed wanted

Martin Maechler
I have not taken the time to look into this example,
but
        daisy()
from the (recommended, hence part of R) package 'cluster'
is more flexible than dist(), particularly in the case of NAs
and for (a mixture of continuous and) categorical variables.

It uses a version of Gower's formula in order to deal with NAs
and asymmetric binary variables.  The example below look like
very well matching to this problem.

Regards,
Martin Maechler, ETH Zurich


>>>>> "Adrian" == Adrian DUSA <[hidden email]>
>>>>>     on Thu, 15 Dec 2005 22:04:01 +0200 writes:

    Adrian> Dear Andy,
    Adrian> On Thursday 15 December 2005 20:57, Liaw, Andy wrote:
    >> Just some untested idea:
    >> If the data are all 0/1, you could use dist(input, method="manhattan"), and
    >> then check which entry equals 1.  This should be much faster than creating
    >> all pairs of rows and check position-by-position.

    Adrian> Thanks for the idea, I played a little with it. At the beginning yes, the data
    Adrian> are all 0/1, but during the minimizing iterations there are also "x" values;
    Adrian> for example comparing:
    Adrian> 0 1 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> should return
    Adrian> 0 "x" 0 1 1

    Adrian> whereas
    Adrian> 0 "x" 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> shouldn't even be compared (they have different number of figures).

    Adrian> Replacing "x" with NA in dist is not yielding results either, as with
    Adrian> NA 0 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> dist returns 0.

    Adrian> I even wanted to see if I could tweak the dist code, but it calls a C program
    Adrian> and I gave up.

    Adrian> Nice idea anyhow, maybe I'll find a way to use it further.
    Adrian> Best,
    Adrian> Adrian

    Adrian> --
    Adrian> Adrian DUSA
    Adrian> Romanian Social Data Archive
    Adrian> 1, Schitu Magureanu Bd
    Adrian> 050025 Bucharest sector 5
    Adrian> Romania
    Adrian> Tel./Fax: +40 21 3126618 \
    Adrian> +40 21 3120210 / int.101

    Adrian> ______________________________________________
    Adrian> [hidden email] mailing list
    Adrian> https://stat.ethz.ch/mailman/listinfo/r-help
    Adrian> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: millions of comparisons, speed wanted

dusadrian
The daisy function is _very_ good!
I have been able to use it for nominal variables as well, simply by:
daisy(input)*ncol(input)

Now, for very large number of rows (say 5000), daisy works for about 3
minutes using the swap space. I probably need more RAM (only 512 on my
computer). But at least I get a result... :)

For relatively small input matrices, it increased the speed by a
factor of 3. Way to go!

Best,
Adrian


On 12/16/05, Martin Maechler <[hidden email]> wrote:

> I have not taken the time to look into this example,
> but
>        daisy()
> from the (recommended, hence part of R) package 'cluster'
> is more flexible than dist(), particularly in the case of NAs
> and for (a mixture of continuous and) categorical variables.
>
> It uses a version of Gower's formula in order to deal with NAs
> and asymmetric binary variables.  The example below look like
> very well matching to this problem.
>
> Regards,
> Martin Maechler, ETH Zurich

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html