Quantcast

Difficult subset challenge

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Difficult subset challenge

Noah Silverman
Hi,

I'm having difficulty coming up with a good way to subest some data to generate statistics.

My data frame has multiple observations by group.

Here is an overly-simplified toy example of the data
==========================
code v1 v2
G1 1.2 2.3
G1 0 2.4
G1 1.4 3.4
G2 2.9 2.3
G2 4.3 4.4
etc..
=========================

I want to normalize the data *by group*  for certain variable.  But, I want to ignore 0 values when calculating the mean and standard deviation.

What I *want* to do is something like this:
=======================
         for (code in unique (d$code) ){
                 mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] )
                 sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] )
                 d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig
         }
=======================

My goal, if it isn't apparent, is to replace values with their normalized value.  (But, the statistics used for normalization are calculated skipping zero values.)

This doesn't work as the indexing from the which command is relative (1,2,3, etc.)


Suggestions?



--
Noah Silverman
UCLA Department of Statistics
8208 Math Sciences Building
Los Angeles, CA 90095

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Difficult subset challenge

Joshua Wiley-2
Hi Noah,

I am unclear if the 0s should be standardized or not---I am assuming
since you want them excluded from the calculation of the mean and SD,
you do not want  (0 - M) / sigma.  If that is the case, here is an
example:


## read in your data
## FYI: providing via dput() would be easier next time
d <- read.table(textConnection("
code    v1      v2
G1              1.2     2.3
G1              0       2.4
G1              1.4     3.4
G2              2.9     2.3
G2              4.3     4.4"), header = TRUE)
closeAllConnections()

## temporary data as a matrix
tmp <- as.matrix(d[-1])
## index 0s and set to missing
tmp[index.0 <- which(tmp == 0, arr.ind = TRUE)] <- NA
## scale by column and d$code and pull back to matrix
tmp <- do.call("rbind", by(tmp, d$code, scale))
## NAs back to 0s
tmp[index.0] <- 0
d[, 2:3] <- tmp

If you want the zeros standardized, it will take a bit of a different
approach.  The other issue that could come up here is speed, but that
can get to be very dataset dependent (e.g., what is most efficient for
a few levels of code may not be the same as what is efficient for many
columns, etc.  That said, it would not take much work to create a
parallelized version of what by() is doing, and scale is already
vectorized so it works pretty darn fast assuming you pass it a matrix.

Cheers,

Josh

On Sat, Dec 10, 2011 at 1:44 PM, Noah Silverman <[hidden email]> wrote:

> Hi,
>
> I'm having difficulty coming up with a good way to subest some data to generate statistics.
>
> My data frame has multiple observations by group.
>
> Here is an overly-simplified toy example of the data
> ==========================
> code    v1      v2
> G1              1.2     2.3
> G1              0       2.4
> G1              1.4     3.4
> G2              2.9     2.3
> G2              4.3     4.4
> etc..
> =========================
>
> I want to normalize the data *by group*  for certain variable.  But, I want to ignore 0 values when calculating the mean and standard deviation.
>
> What I *want* to do is something like this:
> =======================
>         for (code in unique (d$code) ){
>                 mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] )
>                 sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] )
>                 d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig
>         }
> =======================
>
> My goal, if it isn't apparent, is to replace values with their normalized value.  (But, the statistics used for normalization are calculated skipping zero values.)
>
> This doesn't work as the indexing from the which command is relative (1,2,3, etc.)
>
>
> Suggestions?
>
>
>
> --
> Noah Silverman
> UCLA Department of Statistics
> 8208 Math Sciences Building
> Los Angeles, CA 90095
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...