Quantcast

select rows by criteria

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

select rows by criteria

syrvn
Hello,

I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them
then how to implement the R code.

Consider the following data frame:

df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27))

What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest
to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only
allowed to be sampled once!

Any ideas, how to achieve that?


Cheers




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: select rows by criteria

Rui Barradas
Hello,


syrvn wrote
Hello,

I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them
then how to implement the R code.

Consider the following data frame:

df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27))

What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest
to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only
allowed to be sampled once!

Any ideas, how to achieve that?


Cheers
See

?combn

It gives all possible combinations as a matrix (default) or list. Then, 'apply'.


#---------------------------
# Name changed to 'DF',
# 'df' is the R function for the F distribution density
# (and a frequent choice for example data in R-help!)
#
DF <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
                value = c(34,12,23,25,34,42,48,29,30,27))

f <- function(j, v, const) abs(mean(v[j]) - const)

inxmat <- with(DF, combn(ID, 7))
meansDist1 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 35))
(i1 <- which(meansDist1 == min(meansDist1)))

inxmat <- with(DF, combn(ID, 3))
meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45))
(i2 <- which(meansDist2 == min(meansDist2)))

meansDist3 <- meansDist1 + meansDist2  # Compromise of both criteria?
(i3 <- which(meansDist3 == min(meansDist3)))

Maybe it's combn(1:10, 3)[, 101] you want, or maybe there's another way to compromise the two criteria.

Hope this helps,

Rui Barradas
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: select rows by criteria

Rui Barradas
Sorry, correction:

The second index matrix is the matrix of elements not in the first,
not another combination, this time 3 out of 10.

Change this in my first post

>
> inxmat <- with(DF, combn(ID, 3))
> meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45))
> (i2 <- which(meansDist2 == min(meansDist2)))
>

to this

inxmat2 <- with(DF, apply(inxmat, 2, function(x) setdiff(ID, x)))
meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF$value, 45))
(i2 <- which(meansDist2 == min(meansDist2)))

Rui Barradas

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: select rows by criteria

Petr Savicky
In reply to this post by syrvn
On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:

> Hello,
>
> I am stuck with selecting the right rows from a data frame. I think the
> problem is rather how to select them
> then how to implement the R code.
>
> Consider the following data frame:
>
> df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value =
> c(34,12,23,25,34,42,48,29,30,27))
>
> What I want to achieve is to select 7 rows (values) so that the mean value
> of those rows are closest
> to the value of 35 and the remaining 3 rows (values) are closest to 45.
> However, each value is only
> allowed to be sampled once!

Hi.

If some 3 rows have mean close to 45, then they have sum close
to 3*45, so the remaining 7 rows have sum close to

  sum(df$value) - 3*45 # [1] 169

and they have mean close to 169/7 = 24.14286. In other words,
the two criteria cannot be optimized together.

For this reason, let me choose the criterion on 3 rows.
The closest solution may be found as follows.

  # generate all triples and compute their means
  tripleMeans <- colMeans(combn(df$value, 3))

  # select the index of the triple with mean closest to 35
  indClosest <- which.min(abs(tripleMeans - 35))

  # generate the indices, which form the closest triple in df$value
  tripleInd <- combn(1:length(df$value), 3)[, indClosest]
  tripleInd # [1] 1 3 7

  # check the mean of the triple
  mean(df$value[tripleInd]) # [1] 35

This code constructs all triples. If it is used for k-tuples
for a larger k and for a set of n values, its complexity
will be proportional to choose(n, k), so it will be large
even for moderate n, k. It is hard to provide a significant
speed up, since some variants of "knapsack problem", which
is NP-complete, may be reduced to your question. Consequently,
it is, in general, NP-complete.

Hope this helps.

Petr Savicky.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: select rows by criteria

Petr Savicky
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:

> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
> > Hello,
> >
> > I am stuck with selecting the right rows from a data frame. I think the
> > problem is rather how to select them
> > then how to implement the R code.
> >
> > Consider the following data frame:
> >
> > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value =
> > c(34,12,23,25,34,42,48,29,30,27))
> >
> > What I want to achieve is to select 7 rows (values) so that the mean value
> > of those rows are closest
> > to the value of 35 and the remaining 3 rows (values) are closest to 45.
> > However, each value is only
> > allowed to be sampled once!
>
> Hi.
>
> If some 3 rows have mean close to 45, then they have sum close
> to 3*45, so the remaining 7 rows have sum close to
>
>   sum(df$value) - 3*45 # [1] 169
>
> and they have mean close to 169/7 = 24.14286. In other words,
> the two criteria cannot be optimized together.
>
> For this reason, let me choose the criterion on 3 rows.
> The closest solution may be found as follows.
>
>   # generate all triples and compute their means
>   tripleMeans <- colMeans(combn(df$value, 3))
>
>   # select the index of the triple with mean closest to 35
>   indClosest <- which.min(abs(tripleMeans - 35))

I am sorry. There should be 45 and not 35.

  indClosest <- which.min(abs(tripleMeans - 45))

  # generate the indices, which form the closest triple in df$value
  tripleInd <- combn(1:length(df$value), 3)[, indClosest]
  tripleInd # [1] 1 6 7

  # check the mean of the triple
  mean(df$value[tripleInd]) # [1] 41.33333

Petr Savicky.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: select rows by criteria

Petr Savicky
In reply to this post by Petr Savicky
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:

> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
> > Hello,
> >
> > I am stuck with selecting the right rows from a data frame. I think the
> > problem is rather how to select them
> > then how to implement the R code.
> >
> > Consider the following data frame:
> >
> > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value =
> > c(34,12,23,25,34,42,48,29,30,27))
> >
> > What I want to achieve is to select 7 rows (values) so that the mean value
> > of those rows are closest
> > to the value of 35 and the remaining 3 rows (values) are closest to 45.
> > However, each value is only
> > allowed to be sampled once!
>
> Hi.
>
> If some 3 rows have mean close to 45, then they have sum close
> to 3*45, so the remaining 7 rows have sum close to
>
>   sum(df$value) - 3*45 # [1] 169
>
> and they have mean close to 169/7 = 24.14286. In other words,
> the two criteria cannot be optimized together.
>
> For this reason, let me choose the criterion on 3 rows.
> The closest solution may be found as follows.
>
>   # generate all triples and compute their means
>   tripleMeans <- colMeans(combn(df$value, 3))
>
>   # select the index of the triple with mean closest to 35
>   indClosest <- which.min(abs(tripleMeans - 35))
>
>   # generate the indices, which form the closest triple in df$value
>   tripleInd <- combn(1:length(df$value), 3)[, indClosest]
>   tripleInd # [1] 1 3 7
>
>   # check the mean of the triple
>   mean(df$value[tripleInd]) # [1] 35
>
> This code constructs all triples. If it is used for k-tuples
> for a larger k and for a set of n values, its complexity
> will be proportional to choose(n, k), so it will be large
> even for moderate n, k. It is hard to provide a significant
> speed up, since some variants of "knapsack problem", which
> is NP-complete, may be reduced to your question. Consequently,
> it is, in general, NP-complete.

Hi.

Also this statement requires a correction. It applies to the
search of an exact optimum if the numbers in df$value are large.
There are efficient algorithms, which find an approximate solution.
Also, if the numbers in df$value are integers (or may be rounded
to integers after an appropriate scaling), then there is an
algorithm, whose complexity is O(k*n*max(df$value)). This
may be significantly less than choose(n, k).

CRAN task view Optimization and Mathematical Programming

  http://cran.at.r-project.org/web/views/Optimization.html

may suggest also other solutions.

Petr Savicky.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: select rows by criteria

Rui Barradas
In reply to this post by Petr Savicky
Hello, again.

Petr Savicky wrote
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:
> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
> > Hello,
> >
> > I am stuck with selecting the right rows from a data frame. I think the
> > problem is rather how to select them
> > then how to implement the R code.
> >
> > Consider the following data frame:
> >
> > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value =
> > c(34,12,23,25,34,42,48,29,30,27))
> >
> > What I want to achieve is to select 7 rows (values) so that the mean value
> > of those rows are closest
> > to the value of 35 and the remaining 3 rows (values) are closest to 45.
> > However, each value is only
> > allowed to be sampled once!
>
> Hi.
>
> If some 3 rows have mean close to 45, then they have sum close
> to 3*45, so the remaining 7 rows have sum close to
>
>   sum(df$value) - 3*45 # [1] 169
>
> and they have mean close to 169/7 = 24.14286. In other words,
> the two criteria cannot be optimized together.
>
> For this reason, let me choose the criterion on 3 rows.
> The closest solution may be found as follows.
>
>   # generate all triples and compute their means
>   tripleMeans <- colMeans(combn(df$value, 3))
>
>   # select the index of the triple with mean closest to 35
>   indClosest <- which.min(abs(tripleMeans - 35))

I am sorry. There should be 45 and not 35.

  indClosest <- which.min(abs(tripleMeans - 45))

  # generate the indices, which form the closest triple in df$value
  tripleInd <- combn(1:length(df$value), 3)[, indClosest]
  tripleInd # [1] 1 6 7

  # check the mean of the triple
  mean(df$value[tripleInd]) # [1] 41.33333

Petr Savicky.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
There are two solutions for the 3 rows criterion, 'which.min' only finds one, the first in the order given by 'combn'.
(And I've corrected my first post but still with an error)

# Forgot to change the index matrix
meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF$value, 45))

# Two solutions
(i2 <- which(meansDist2 == min(meansDist2)))
inxmat2[, i2]

mean(DF$value[inxmat2[, i2][, 1]])
[1] 41.33333

Petr's solution and mine give the same mean value.
But use for small values of (n, k) only.

Rui Barradas

Loading...