# millions of comparisons, speed wanted

5 messages
Open this post in threaded view
|

## millions of comparisons, speed wanted

 Dear all, I have a 10 columns matrix which has 2^10=1024 unique rows, with values 0 and 1. What I would like to do is a little complicated; in a simple statement, for a subset (say 1000 rows) I want to perform pairwise comparisons between all rows, find out which rows differ by only one  value, replace that value by "x", get rid of the comparisons that differ by more than one value and repeat the algorithm until no further minimization is possible. Any row that hasn't been minimized is kept for the next iteration. For 1,000 rows, there are almost 500,000 pairs, but in the next iterations I could get some 5,000 rows which generates something like 12.5 millions pairs, and that takes a _very_ long time. The code I have created (10 lines, below) is super fast (using vectorization) but only for a reasonable number of rows. I am searching for: - ways to improve my code (if possible) - ideas: create a C code for the slow parts of the code? use MySQL? other ways? As a toy example, having an input matrix called "input", my algorithm looks like this: ## code start ncolumns <- 6 input <- bincombinations(ncolumns) # from package e1071 # subset, let's say 97% of rows input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ] minimized <- 1 while (sum(minimized) > 0) {    minimized <- logical(nrow(input))    to.be.compared <- combn2(1:nrow(input)) # from package combinat        # the following line takes _a lot_ of time, for millions of comparisons    logical.result <- apply(to.be.compared, 1, function(idx) input[idx[1], ] == input[idx[2], ])    compare.minimized <- which(colSums(!logical.result) == 1)    logical.result <- logical.result[, compare.minimized]    result <- sapply(compare.minimized, function(idx) input[to.be.compared[idx, 1], ])    result[!logical.result] <- "x"    minimized[unique(as.vector(to.be.compared[compare.minimized, ]))] <- TRUE    if (sum(minimized) > 0) {       input <- rbind(input[!minimized, ], unique(t(result)))    } } ## code end Any suggestion is welcomed, thank you very much in advance. Adrian -- Adrian DUSA Romanian Social Data Archive 1, Schitu Magureanu Bd 050025 Bucharest sector 5 Romania Tel./Fax: +40 21 3126618 \           +40 21 3120210 / int.101 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Open this post in threaded view
|

## Re: millions of comparisons, speed wanted

 Just some untested idea: If the data are all 0/1, you could use dist(input, method="manhattan"), and then check which entry equals 1.  This should be much faster than creating all pairs of rows and check position-by-position. HTH, Andy From: Adrian DUSA > > Dear all, > > I have a 10 columns matrix which has 2^10=1024 unique rows, > with values 0 and > 1. > What I would like to do is a little complicated; in a simple > statement, for a > subset (say 1000 rows) I want to perform pairwise comparisons > between all > rows, find out which rows differ by only one  value, replace > that value by > "x", get rid of the comparisons that differ by more than one > value and repeat > the algorithm until no further minimization is possible. Any > row that hasn't > been minimized is kept for the next iteration. > > For 1,000 rows, there are almost 500,000 pairs, but in the > next iterations I > could get some 5,000 rows which generates something like 12.5 > millions pairs, > and that takes a _very_ long time. > > The code I have created (10 lines, below) is super fast > (using vectorization) > but only for a reasonable number of rows. I am searching for: > - ways to improve my code (if possible) > - ideas: create a C code for the slow parts of the code? use > MySQL? other > ways? > > As a toy example, having an input matrix called "input", my > algorithm looks > like this: > > ## code start > ncolumns <- 6 > input <- bincombinations(ncolumns) # from package e1071 > # subset, let's say 97% of rows > input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ] > minimized <- 1 > > while (sum(minimized) > 0) { > >    minimized <- logical(nrow(input)) > >    to.be.compared <- combn2(1:nrow(input)) # from package combinat >     >    # the following line takes _a lot_ of time, for millions > of comparisons >    logical.result <- apply(to.be.compared, 1, function(idx) > input[idx[1], ] == > input[idx[2], ]) > >    compare.minimized <- which(colSums(!logical.result) == 1) > >    logical.result <- logical.result[, compare.minimized] > >    result <- sapply(compare.minimized, function(idx) > input[to.be.compared[idx, > 1], ]) > >    result[!logical.result] <- "x" > >     > minimized[unique(as.vector(to.be.compared[compare.minimized, > ]))] <- TRUE > >    if (sum(minimized) > 0) { >       input <- rbind(input[!minimized, ], unique(t(result))) >    } > } > ## code end > > Any suggestion is welcomed, thank you very much in advance. > Adrian > > -- > Adrian DUSA > Romanian Social Data Archive > 1, Schitu Magureanu Bd > 050025 Bucharest sector 5 > Romania > Tel./Fax: +40 21 3126618 \ >           +40 21 3120210 / int.101 > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html> > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Open this post in threaded view
|

## Re: millions of comparisons, speed wanted

 Dear Andy, On Thursday 15 December 2005 20:57, Liaw, Andy wrote: > Just some untested idea: > If the data are all 0/1, you could use dist(input, method="manhattan"), and > then check which entry equals 1.  This should be much faster than creating > all pairs of rows and check position-by-position. Thanks for the idea, I played a little with it. At the beginning yes, the data are all 0/1, but during the minimizing iterations there are also "x" values; for example comparing: 0 1 0 1 1 0 0 0 1 1 should return 0 "x" 0 1 1 whereas 0 "x" 0 1 1 0 0 0 1 1 shouldn't even be compared (they have different number of figures). Replacing "x" with NA in dist is not yielding results either, as with NA 0 0 1 1 0 0 0 1 1 dist returns 0. I even wanted to see if I could tweak the dist code, but it calls a C program and I gave up. Nice idea anyhow, maybe I'll find a way to use it further. Best, Adrian -- Adrian DUSA Romanian Social Data Archive 1, Schitu Magureanu Bd 050025 Bucharest sector 5 Romania Tel./Fax: +40 21 3126618 \           +40 21 3120210 / int.101 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html