|
Hi,
I'm having difficulty coming up with a good way to subest some data to generate statistics. My data frame has multiple observations by group. Here is an overly-simplified toy example of the data ========================== code v1 v2 G1 1.2 2.3 G1 0 2.4 G1 1.4 3.4 G2 2.9 2.3 G2 4.3 4.4 etc.. ========================= I want to normalize the data *by group* for certain variable. But, I want to ignore 0 values when calculating the mean and standard deviation. What I *want* to do is something like this: ======================= for (code in unique (d$code) ){ mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] ) sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] ) d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig } ======================= My goal, if it isn't apparent, is to replace values with their normalized value. (But, the statistics used for normalization are calculated skipping zero values.) This doesn't work as the indexing from the which command is relative (1,2,3, etc.) Suggestions? -- Noah Silverman UCLA Department of Statistics 8208 Math Sciences Building Los Angeles, CA 90095 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hi Noah,
I am unclear if the 0s should be standardized or not---I am assuming since you want them excluded from the calculation of the mean and SD, you do not want (0 - M) / sigma. If that is the case, here is an example: ## read in your data ## FYI: providing via dput() would be easier next time d <- read.table(textConnection(" code v1 v2 G1 1.2 2.3 G1 0 2.4 G1 1.4 3.4 G2 2.9 2.3 G2 4.3 4.4"), header = TRUE) closeAllConnections() ## temporary data as a matrix tmp <- as.matrix(d[-1]) ## index 0s and set to missing tmp[index.0 <- which(tmp == 0, arr.ind = TRUE)] <- NA ## scale by column and d$code and pull back to matrix tmp <- do.call("rbind", by(tmp, d$code, scale)) ## NAs back to 0s tmp[index.0] <- 0 d[, 2:3] <- tmp If you want the zeros standardized, it will take a bit of a different approach. The other issue that could come up here is speed, but that can get to be very dataset dependent (e.g., what is most efficient for a few levels of code may not be the same as what is efficient for many columns, etc. That said, it would not take much work to create a parallelized version of what by() is doing, and scale is already vectorized so it works pretty darn fast assuming you pass it a matrix. Cheers, Josh On Sat, Dec 10, 2011 at 1:44 PM, Noah Silverman <[hidden email]> wrote: > Hi, > > I'm having difficulty coming up with a good way to subest some data to generate statistics. > > My data frame has multiple observations by group. > > Here is an overly-simplified toy example of the data > ========================== > code v1 v2 > G1 1.2 2.3 > G1 0 2.4 > G1 1.4 3.4 > G2 2.9 2.3 > G2 4.3 4.4 > etc.. > ========================= > > I want to normalize the data *by group* for certain variable. But, I want to ignore 0 values when calculating the mean and standard deviation. > > What I *want* to do is something like this: > ======================= > for (code in unique (d$code) ){ > mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] ) > sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] ) > d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig > } > ======================= > > My goal, if it isn't apparent, is to replace values with their normalized value. (But, the statistics used for normalization are calculated skipping zero values.) > > This doesn't work as the indexing from the which command is relative (1,2,3, etc.) > > > Suggestions? > > > > -- > Noah Silverman > UCLA Department of Statistics > 8208 Math Sciences Building > Los Angeles, CA 90095 > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
