Hello,
I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once! Any ideas, how to achieve that? Cheers |
Hello,
See ?combn It gives all possible combinations as a matrix (default) or list. Then, 'apply'. #--------------------------- # Name changed to 'DF', # 'df' is the R function for the F distribution density # (and a frequent choice for example data in R-help!) # DF <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) f <- function(j, v, const) abs(mean(v[j]) - const) inxmat <- with(DF, combn(ID, 7)) meansDist1 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 35)) (i1 <- which(meansDist1 == min(meansDist1))) inxmat <- with(DF, combn(ID, 3)) meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45)) (i2 <- which(meansDist2 == min(meansDist2))) meansDist3 <- meansDist1 + meansDist2 # Compromise of both criteria? (i3 <- which(meansDist3 == min(meansDist3))) Maybe it's combn(1:10, 3)[, 101] you want, or maybe there's another way to compromise the two criteria. Hope this helps, Rui Barradas |
Sorry, correction:
The second index matrix is the matrix of elements not in the first, not another combination, this time 3 out of 10. Change this in my first post > > inxmat <- with(DF, combn(ID, 3)) > meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45)) > (i2 <- which(meansDist2 == min(meansDist2))) > to this inxmat2 <- with(DF, apply(inxmat, 2, function(x) setdiff(ID, x))) meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF$value, 45)) (i2 <- which(meansDist2 == min(meansDist2))) Rui Barradas |
In reply to this post by syrvn
On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
> Hello, > > I am stuck with selecting the right rows from a data frame. I think the > problem is rather how to select them > then how to implement the R code. > > Consider the following data frame: > > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = > c(34,12,23,25,34,42,48,29,30,27)) > > What I want to achieve is to select 7 rows (values) so that the mean value > of those rows are closest > to the value of 35 and the remaining 3 rows (values) are closest to 45. > However, each value is only > allowed to be sampled once! Hi. If some 3 rows have mean close to 45, then they have sum close to 3*45, so the remaining 7 rows have sum close to sum(df$value) - 3*45 # [1] 169 and they have mean close to 169/7 = 24.14286. In other words, the two criteria cannot be optimized together. For this reason, let me choose the criterion on 3 rows. The closest solution may be found as follows. # generate all triples and compute their means tripleMeans <- colMeans(combn(df$value, 3)) # select the index of the triple with mean closest to 35 indClosest <- which.min(abs(tripleMeans - 35)) # generate the indices, which form the closest triple in df$value tripleInd <- combn(1:length(df$value), 3)[, indClosest] tripleInd # [1] 1 3 7 # check the mean of the triple mean(df$value[tripleInd]) # [1] 35 This code constructs all triples. If it is used for k-tuples for a larger k and for a set of n values, its complexity will be proportional to choose(n, k), so it will be large even for moderate n, k. It is hard to provide a significant speed up, since some variants of "knapsack problem", which is NP-complete, may be reduced to your question. Consequently, it is, in general, NP-complete. Hope this helps. Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:
> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote: > > Hello, > > > > I am stuck with selecting the right rows from a data frame. I think the > > problem is rather how to select them > > then how to implement the R code. > > > > Consider the following data frame: > > > > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = > > c(34,12,23,25,34,42,48,29,30,27)) > > > > What I want to achieve is to select 7 rows (values) so that the mean value > > of those rows are closest > > to the value of 35 and the remaining 3 rows (values) are closest to 45. > > However, each value is only > > allowed to be sampled once! > > Hi. > > If some 3 rows have mean close to 45, then they have sum close > to 3*45, so the remaining 7 rows have sum close to > > sum(df$value) - 3*45 # [1] 169 > > and they have mean close to 169/7 = 24.14286. In other words, > the two criteria cannot be optimized together. > > For this reason, let me choose the criterion on 3 rows. > The closest solution may be found as follows. > > # generate all triples and compute their means > tripleMeans <- colMeans(combn(df$value, 3)) > > # select the index of the triple with mean closest to 35 > indClosest <- which.min(abs(tripleMeans - 35)) I am sorry. There should be 45 and not 35. indClosest <- which.min(abs(tripleMeans - 45)) # generate the indices, which form the closest triple in df$value tripleInd <- combn(1:length(df$value), 3)[, indClosest] tripleInd # [1] 1 6 7 # check the mean of the triple mean(df$value[tripleInd]) # [1] 41.33333 Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Petr Savicky
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:
> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote: > > Hello, > > > > I am stuck with selecting the right rows from a data frame. I think the > > problem is rather how to select them > > then how to implement the R code. > > > > Consider the following data frame: > > > > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = > > c(34,12,23,25,34,42,48,29,30,27)) > > > > What I want to achieve is to select 7 rows (values) so that the mean value > > of those rows are closest > > to the value of 35 and the remaining 3 rows (values) are closest to 45. > > However, each value is only > > allowed to be sampled once! > > Hi. > > If some 3 rows have mean close to 45, then they have sum close > to 3*45, so the remaining 7 rows have sum close to > > sum(df$value) - 3*45 # [1] 169 > > and they have mean close to 169/7 = 24.14286. In other words, > the two criteria cannot be optimized together. > > For this reason, let me choose the criterion on 3 rows. > The closest solution may be found as follows. > > # generate all triples and compute their means > tripleMeans <- colMeans(combn(df$value, 3)) > > # select the index of the triple with mean closest to 35 > indClosest <- which.min(abs(tripleMeans - 35)) > > # generate the indices, which form the closest triple in df$value > tripleInd <- combn(1:length(df$value), 3)[, indClosest] > tripleInd # [1] 1 3 7 > > # check the mean of the triple > mean(df$value[tripleInd]) # [1] 35 > > This code constructs all triples. If it is used for k-tuples > for a larger k and for a set of n values, its complexity > will be proportional to choose(n, k), so it will be large > even for moderate n, k. It is hard to provide a significant > speed up, since some variants of "knapsack problem", which > is NP-complete, may be reduced to your question. Consequently, > it is, in general, NP-complete. Hi. Also this statement requires a correction. It applies to the search of an exact optimum if the numbers in df$value are large. There are efficient algorithms, which find an approximate solution. Also, if the numbers in df$value are integers (or may be rounded to integers after an appropriate scaling), then there is an algorithm, whose complexity is O(k*n*max(df$value)). This may be significantly less than choose(n, k). CRAN task view Optimization and Mathematical Programming http://cran.at.r-project.org/web/views/Optimization.html may suggest also other solutions. Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Petr Savicky
Hello, again.
There are two solutions for the 3 rows criterion, 'which.min' only finds one, the first in the order given by 'combn'. (And I've corrected my first post but still with an error) # Forgot to change the index matrix meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF$value, 45)) # Two solutions (i2 <- which(meansDist2 == min(meansDist2))) inxmat2[, i2] mean(DF$value[inxmat2[, i2][, 1]]) [1] 41.33333 Petr's solution and mine give the same mean value. But use for small values of (n, k) only. Rui Barradas |
Free forum by Nabble | Edit this page |