# select rows by criteria

7 messages
Open this post in threaded view
|
Report Content as Inappropriate

## select rows by criteria

 Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once! Any ideas, how to achieve that? Cheers
Open this post in threaded view
|
Report Content as Inappropriate

## Re: select rows by criteria

 Hello, syrvn wrote Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once! Any ideas, how to achieve that? Cheers See ?combn It gives all possible combinations as a matrix (default) or list. Then, 'apply'. #--------------------------- # Name changed to 'DF', # 'df' is the R function for the F distribution density # (and a frequent choice for example data in R-help!) # DF <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),                 value = c(34,12,23,25,34,42,48,29,30,27)) f <- function(j, v, const) abs(mean(v[j]) - const) inxmat <- with(DF, combn(ID, 7)) meansDist1 <- apply(inxmat, 2, function(jnx) f(jnx, DF\$value, 35)) (i1 <- which(meansDist1 == min(meansDist1))) inxmat <- with(DF, combn(ID, 3)) meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF\$value, 45)) (i2 <- which(meansDist2 == min(meansDist2))) meansDist3 <- meansDist1 + meansDist2  # Compromise of both criteria? (i3 <- which(meansDist3 == min(meansDist3))) Maybe it's combn(1:10, 3)[, 101] you want, or maybe there's another way to compromise the two criteria. Hope this helps, Rui Barradas
Open this post in threaded view
|
Report Content as Inappropriate

## Re: select rows by criteria

 Sorry, correction: The second index matrix is the matrix of elements not in the first, not another combination, this time 3 out of 10. Change this in my first post > > inxmat <- with(DF, combn(ID, 3)) > meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF\$value, 45)) > (i2 <- which(meansDist2 == min(meansDist2))) > to this inxmat2 <- with(DF, apply(inxmat, 2, function(x) setdiff(ID, x))) meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF\$value, 45)) (i2 <- which(meansDist2 == min(meansDist2))) Rui Barradas
Open this post in threaded view
|
Report Content as Inappropriate

## Re: select rows by criteria

 In reply to this post by syrvn On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote: > Hello, > > I am stuck with selecting the right rows from a data frame. I think the > problem is rather how to select them > then how to implement the R code. > > Consider the following data frame: > > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = > c(34,12,23,25,34,42,48,29,30,27)) > > What I want to achieve is to select 7 rows (values) so that the mean value > of those rows are closest > to the value of 35 and the remaining 3 rows (values) are closest to 45. > However, each value is only > allowed to be sampled once! Hi. If some 3 rows have mean close to 45, then they have sum close to 3*45, so the remaining 7 rows have sum close to   sum(df\$value) - 3*45 # [1] 169 and they have mean close to 169/7 = 24.14286. In other words, the two criteria cannot be optimized together. For this reason, let me choose the criterion on 3 rows. The closest solution may be found as follows.   # generate all triples and compute their means   tripleMeans <- colMeans(combn(df\$value, 3))   # select the index of the triple with mean closest to 35   indClosest <- which.min(abs(tripleMeans - 35))   # generate the indices, which form the closest triple in df\$value   tripleInd <- combn(1:length(df\$value), 3)[, indClosest]   tripleInd # [1] 1 3 7   # check the mean of the triple   mean(df\$value[tripleInd]) # [1] 35 This code constructs all triples. If it is used for k-tuples for a larger k and for a set of n values, its complexity will be proportional to choose(n, k), so it will be large even for moderate n, k. It is hard to provide a significant speed up, since some variants of "knapsack problem", which is NP-complete, may be reduced to your question. Consequently, it is, in general, NP-complete. Hope this helps. Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|
Report Content as Inappropriate

## Re: select rows by criteria

 On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote: > On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote: > > Hello, > > > > I am stuck with selecting the right rows from a data frame. I think the > > problem is rather how to select them > > then how to implement the R code. > > > > Consider the following data frame: > > > > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = > > c(34,12,23,25,34,42,48,29,30,27)) > > > > What I want to achieve is to select 7 rows (values) so that the mean value > > of those rows are closest > > to the value of 35 and the remaining 3 rows (values) are closest to 45. > > However, each value is only > > allowed to be sampled once! > > Hi. > > If some 3 rows have mean close to 45, then they have sum close > to 3*45, so the remaining 7 rows have sum close to > >   sum(df\$value) - 3*45 # [1] 169 > > and they have mean close to 169/7 = 24.14286. In other words, > the two criteria cannot be optimized together. > > For this reason, let me choose the criterion on 3 rows. > The closest solution may be found as follows. > >   # generate all triples and compute their means >   tripleMeans <- colMeans(combn(df\$value, 3)) > >   # select the index of the triple with mean closest to 35 >   indClosest <- which.min(abs(tripleMeans - 35)) I am sorry. There should be 45 and not 35.   indClosest <- which.min(abs(tripleMeans - 45))   # generate the indices, which form the closest triple in df\$value   tripleInd <- combn(1:length(df\$value), 3)[, indClosest]   tripleInd # [1] 1 6 7   # check the mean of the triple   mean(df\$value[tripleInd]) # [1] 41.33333 Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|
Report Content as Inappropriate

## Re: select rows by criteria

 In reply to this post by Petr Savicky On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote: > On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote: > > Hello, > > > > I am stuck with selecting the right rows from a data frame. I think the > > problem is rather how to select them > > then how to implement the R code. > > > > Consider the following data frame: > > > > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = > > c(34,12,23,25,34,42,48,29,30,27)) > > > > What I want to achieve is to select 7 rows (values) so that the mean value > > of those rows are closest > > to the value of 35 and the remaining 3 rows (values) are closest to 45. > > However, each value is only > > allowed to be sampled once! > > Hi. > > If some 3 rows have mean close to 45, then they have sum close > to 3*45, so the remaining 7 rows have sum close to > >   sum(df\$value) - 3*45 # [1] 169 > > and they have mean close to 169/7 = 24.14286. In other words, > the two criteria cannot be optimized together. > > For this reason, let me choose the criterion on 3 rows. > The closest solution may be found as follows. > >   # generate all triples and compute their means >   tripleMeans <- colMeans(combn(df\$value, 3)) > >   # select the index of the triple with mean closest to 35 >   indClosest <- which.min(abs(tripleMeans - 35)) > >   # generate the indices, which form the closest triple in df\$value >   tripleInd <- combn(1:length(df\$value), 3)[, indClosest] >   tripleInd # [1] 1 3 7 > >   # check the mean of the triple >   mean(df\$value[tripleInd]) # [1] 35 > > This code constructs all triples. If it is used for k-tuples > for a larger k and for a set of n values, its complexity > will be proportional to choose(n, k), so it will be large > even for moderate n, k. It is hard to provide a significant > speed up, since some variants of "knapsack problem", which > is NP-complete, may be reduced to your question. Consequently, > it is, in general, NP-complete. Hi. Also this statement requires a correction. It applies to the search of an exact optimum if the numbers in df\$value are large. There are efficient algorithms, which find an approximate solution. Also, if the numbers in df\$value are integers (or may be rounded to integers after an appropriate scaling), then there is an algorithm, whose complexity is O(k*n*max(df\$value)). This may be significantly less than choose(n, k). CRAN task view Optimization and Mathematical Programming   http://cran.at.r-project.org/web/views/Optimization.htmlmay suggest also other solutions. Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.