Dear all, I have a 10 columns matrix which has 2^10=1024 unique rows, with values 0 and 1. What I would like to do is a little complicated; in a simple statement, for a subset (say 1000 rows) I want to perform pairwise comparisons between all rows, find out which rows differ by only one value, replace that value by "x", get rid of the comparisons that differ by more than one value and repeat the algorithm until no further minimization is possible. Any row that hasn't been minimized is kept for the next iteration. For 1,000 rows, there are almost 500,000 pairs, but in the next iterations I could get some 5,000 rows which generates something like 12.5 millions pairs, and that takes a _very_ long time. The code I have created (10 lines, below) is super fast (using vectorization) but only for a reasonable number of rows. I am searching for: - ways to improve my code (if possible) - ideas: create a C code for the slow parts of the code? use MySQL? other ways? As a toy example, having an input matrix called "input", my algorithm looks like this: ## code start ncolumns <- 6 input <- bincombinations(ncolumns) # from package e1071 # subset, let's say 97% of rows input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ] minimized <- 1 while (sum(minimized) > 0) { minimized <- logical(nrow(input)) to.be.compared <- combn2(1:nrow(input)) # from package combinat # the following line takes _a lot_ of time, for millions of comparisons logical.result <- apply(to.be.compared, 1, function(idx) input[idx[1], ] == input[idx[2], ]) compare.minimized <- which(colSums(!logical.result) == 1) logical.result <- logical.result[, compare.minimized] result <- sapply(compare.minimized, function(idx) input[to.be.compared[idx, 1], ]) result[!logical.result] <- "x" minimized[unique(as.vector(to.be.compared[compare.minimized, ]))] <- TRUE if (sum(minimized) > 0) { input <- rbind(input[!minimized, ], unique(t(result))) } } ## code end Any suggestion is welcomed, thank you very much in advance. Adrian -- Adrian DUSA Romanian Social Data Archive 1, Schitu Magureanu Bd 050025 Bucharest sector 5 Romania Tel./Fax: +40 21 3126618 \ +40 21 3120210 / int.101 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
Just some untested idea:
If the data are all 0/1, you could use dist(input, method="manhattan"), and then check which entry equals 1. This should be much faster than creating all pairs of rows and check position-by-position. HTH, Andy From: Adrian DUSA > > Dear all, > > I have a 10 columns matrix which has 2^10=1024 unique rows, > with values 0 and > 1. > What I would like to do is a little complicated; in a simple > statement, for a > subset (say 1000 rows) I want to perform pairwise comparisons > between all > rows, find out which rows differ by only one value, replace > that value by > "x", get rid of the comparisons that differ by more than one > value and repeat > the algorithm until no further minimization is possible. Any > row that hasn't > been minimized is kept for the next iteration. > > For 1,000 rows, there are almost 500,000 pairs, but in the > next iterations I > could get some 5,000 rows which generates something like 12.5 > millions pairs, > and that takes a _very_ long time. > > The code I have created (10 lines, below) is super fast > (using vectorization) > but only for a reasonable number of rows. I am searching for: > - ways to improve my code (if possible) > - ideas: create a C code for the slow parts of the code? use > MySQL? other > ways? > > As a toy example, having an input matrix called "input", my > algorithm looks > like this: > > ## code start > ncolumns <- 6 > input <- bincombinations(ncolumns) # from package e1071 > # subset, let's say 97% of rows > input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ] > minimized <- 1 > > while (sum(minimized) > 0) { > > minimized <- logical(nrow(input)) > > to.be.compared <- combn2(1:nrow(input)) # from package combinat > > # the following line takes _a lot_ of time, for millions > of comparisons > logical.result <- apply(to.be.compared, 1, function(idx) > input[idx[1], ] == > input[idx[2], ]) > > compare.minimized <- which(colSums(!logical.result) == 1) > > logical.result <- logical.result[, compare.minimized] > > result <- sapply(compare.minimized, function(idx) > input[to.be.compared[idx, > 1], ]) > > result[!logical.result] <- "x" > > > minimized[unique(as.vector(to.be.compared[compare.minimized, > ]))] <- TRUE > > if (sum(minimized) > 0) { > input <- rbind(input[!minimized, ], unique(t(result))) > } > } > ## code end > > Any suggestion is welcomed, thank you very much in advance. > Adrian > > -- > Adrian DUSA > Romanian Social Data Archive > 1, Schitu Magureanu Bd > 050025 Bucharest sector 5 > Romania > Tel./Fax: +40 21 3126618 \ > +40 21 3120210 / int.101 > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
Dear Andy,
On Thursday 15 December 2005 20:57, Liaw, Andy wrote: > Just some untested idea: > If the data are all 0/1, you could use dist(input, method="manhattan"), and > then check which entry equals 1. This should be much faster than creating > all pairs of rows and check position-by-position. Thanks for the idea, I played a little with it. At the beginning yes, the data are all 0/1, but during the minimizing iterations there are also "x" values; for example comparing: 0 1 0 1 1 0 0 0 1 1 should return 0 "x" 0 1 1 whereas 0 "x" 0 1 1 0 0 0 1 1 shouldn't even be compared (they have different number of figures). Replacing "x" with NA in dist is not yielding results either, as with NA 0 0 1 1 0 0 0 1 1 dist returns 0. I even wanted to see if I could tweak the dist code, but it calls a C program and I gave up. Nice idea anyhow, maybe I'll find a way to use it further. Best, Adrian -- Adrian DUSA Romanian Social Data Archive 1, Schitu Magureanu Bd 050025 Bucharest sector 5 Romania Tel./Fax: +40 21 3126618 \ +40 21 3120210 / int.101 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
I have not taken the time to look into this example,
but daisy() from the (recommended, hence part of R) package 'cluster' is more flexible than dist(), particularly in the case of NAs and for (a mixture of continuous and) categorical variables. It uses a version of Gower's formula in order to deal with NAs and asymmetric binary variables. The example below look like very well matching to this problem. Regards, Martin Maechler, ETH Zurich >>>>> "Adrian" == Adrian DUSA <[hidden email]> >>>>> on Thu, 15 Dec 2005 22:04:01 +0200 writes: Adrian> Dear Andy, Adrian> On Thursday 15 December 2005 20:57, Liaw, Andy wrote: >> Just some untested idea: >> If the data are all 0/1, you could use dist(input, method="manhattan"), and >> then check which entry equals 1. This should be much faster than creating >> all pairs of rows and check position-by-position. Adrian> Thanks for the idea, I played a little with it. At the beginning yes, the data Adrian> are all 0/1, but during the minimizing iterations there are also "x" values; Adrian> for example comparing: Adrian> 0 1 0 1 1 Adrian> 0 0 0 1 1 Adrian> should return Adrian> 0 "x" 0 1 1 Adrian> whereas Adrian> 0 "x" 0 1 1 Adrian> 0 0 0 1 1 Adrian> shouldn't even be compared (they have different number of figures). Adrian> Replacing "x" with NA in dist is not yielding results either, as with Adrian> NA 0 0 1 1 Adrian> 0 0 0 1 1 Adrian> dist returns 0. Adrian> I even wanted to see if I could tweak the dist code, but it calls a C program Adrian> and I gave up. Adrian> Nice idea anyhow, maybe I'll find a way to use it further. Adrian> Best, Adrian> Adrian Adrian> -- Adrian> Adrian DUSA Adrian> Romanian Social Data Archive Adrian> 1, Schitu Magureanu Bd Adrian> 050025 Bucharest sector 5 Adrian> Romania Adrian> Tel./Fax: +40 21 3126618 \ Adrian> +40 21 3120210 / int.101 Adrian> ______________________________________________ Adrian> [hidden email] mailing list Adrian> https://stat.ethz.ch/mailman/listinfo/r-help Adrian> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
The daisy function is _very_ good!
I have been able to use it for nominal variables as well, simply by: daisy(input)*ncol(input) Now, for very large number of rows (say 5000), daisy works for about 3 minutes using the swap space. I probably need more RAM (only 512 on my computer). But at least I get a result... :) For relatively small input matrices, it increased the speed by a factor of 3. Way to go! Best, Adrian On 12/16/05, Martin Maechler <[hidden email]> wrote: > I have not taken the time to look into this example, > but > daisy() > from the (recommended, hence part of R) package 'cluster' > is more flexible than dist(), particularly in the case of NAs > and for (a mixture of continuous and) categorical variables. > > It uses a version of Gower's formula in order to deal with NAs > and asymmetric binary variables. The example below look like > very well matching to this problem. > > Regards, > Martin Maechler, ETH Zurich ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
Free forum by Nabble | Edit this page |