|
I have a dataframe called "data" with 5 records (in rows) each of
which has been scored on each of many variables (in columns). Five of the variables are named var1, var2, var3, var4, var5 using headers. The other variables are named using other conventions. I can create a new variable called var6 with the value 15 for each record with this code: > var6=var1+var2+var3+var4+var5 but this is tedious for my real dataset with dozens of variables. I would rather use a wildcard to add up all the variables that begin with "Var" like this pseudocode: > Var6=sum(var*) Any suggestions for implementing this in R? Thanks! Mark ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
See:
?rowSums On 2/20/06, [hidden email] <[hidden email]> wrote: > I have a dataframe called "data" with 5 records (in rows) each of > which has been scored on each of many variables (in columns). > > Five of the variables are named var1, var2, var3, var4, var5 using > headers. The other variables are named using other conventions. > > I can create a new variable called var6 with the value 15 for each > record with this code: > > > var6=var1+var2+var3+var4+var5 > > but this is tedious for my real dataset with dozens of variables. I > would rather use a wildcard to add up all the variables that begin > with "Var" like this pseudocode: > > > Var6=sum(var*) > > Any suggestions for implementing this in R? Thanks! Mark > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
In reply to this post by mtb954
data <- data.frame(var1=c(1,2,3), var2=c(3,4,5), var3=c(4,5,6), foo =
c(100,200,300)) # sum rows with "var" in their name rowSums(data[, grep("var", names(data))]) 1 2 3 8 11 14 [hidden email] wrote: > I have a dataframe called "data" with 5 records (in rows) each of > which has been scored on each of many variables (in columns). > > Five of the variables are named var1, var2, var3, var4, var5 using > headers. The other variables are named using other conventions. > > I can create a new variable called var6 with the value 15 for each > record with this code: > > >> var6=var1+var2+var3+var4+var5 >> > > but this is tedious for my real dataset with dozens of variables. I > would rather use a wildcard to add up all the variables that begin > with "Var" like this pseudocode: > > >> Var6=sum(var*) >> > > Any suggestions for implementing this in R? Thanks! Mark > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > -- Simon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat. Centre for Resource and Environmental Studies The Australian National University Canberra ACT 0200 Australia T: +61 2 6125 7800 email: Simon.Blomberg_at_anu.edu.au F: +61 2 6125 0757 CRICOS Provider # 00120C ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
In reply to this post by mtb954
On Mon, 2006-02-20 at 18:41 -0600, [hidden email] wrote:
> I have a dataframe called "data" with 5 records (in rows) each of > which has been scored on each of many variables (in columns). > > Five of the variables are named var1, var2, var3, var4, var5 using > headers. The other variables are named using other conventions. > > I can create a new variable called var6 with the value 15 for each > record with this code: > > > var6=var1+var2+var3+var4+var5 > > but this is tedious for my real dataset with dozens of variables. I > would rather use a wildcard to add up all the variables that begin > with "Var" like this pseudocode: > > > Var6=sum(var*) > > Any suggestions for implementing this in R? Thanks! Mark Here is one approach using grep(). Given a data frame called MyDF with the following structure: > str(MyDF) `data.frame': 10 obs. of 20 variables: $ other4 : num -0.869 0.376 -2.022 0.619 -0.129 ... $ var8 : num -0.380 1.428 -1.075 -0.796 -0.588 ... $ var4 : num -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ... $ other9 : num 0.0210 -0.6455 0.0289 1.2405 -1.3359 ... $ var10 : num 0.647 -0.798 0.180 1.135 -0.258 ... $ other2 : num 0.1332 -0.2227 0.0423 0.6881 2.0304 ... $ other10: num 0.811 2.166 0.569 0.302 0.669 ... $ var1 : num -0.774 -1.812 -1.230 -0.969 0.245 ... $ var2 : num -0.0538 0.3712 0.8222 -0.8025 -0.6914 ... $ other6 : num 0.871 0.291 2.079 1.098 1.025 ... $ other1 : num -0.5130 0.1358 0.8744 0.0997 1.7458 ... $ var9 : num 0.664 -0.456 0.415 2.090 -0.283 ... $ other3 : num -0.425 -0.283 0.706 -1.879 -0.828 ... $ other7 : num 0.100 0.177 0.570 -0.631 -1.009 ... $ var3 : num 1.446 -0.862 0.184 1.077 0.146 ... $ var5 : num 0.402 -0.498 -0.906 0.641 1.690 ... $ var6 : num 0.892 -0.242 0.561 0.530 -0.291 ... $ other5 : num -1.210 0.815 -1.284 -0.152 0.329 ... $ other8 : num -0.265 -1.278 1.152 0.232 -1.189 ... $ var7 : num -0.616 -0.994 -0.263 1.626 -1.372 ... Note that the column names are either var* or other*. Using grep() we get the indices of the column names that contain "other" plus one or more following characters, where the "other" begins the word: > grep("\\bother.", names(MyDF)) [1] 1 4 6 7 10 11 13 14 18 19 See ?regexp for more information. Note that I use "\\b" to being the search at the starting word boundary and then "." to require that there be following characters. Thus, this would not match " other1" or " other". You can then use the following to subset the data frame MyDF and sum the rows for the requested columns: > rowSums(MyDF[, grep("\\bother.", names(MyDF))]) 1 2 3 4 5 6 7 -1.344893 1.531417 2.715234 1.616971 1.307379 4.655568 4.638446 8 9 10 -2.640485 -2.226270 -2.158248 You could use grep("other", names(MyDF)), but this would also get "other" if it appears anywhere in the name. For example: > grep("other", "Thisotherone") [1] 1 It just depends upon your naming schema and how strict you need to be in the search. HTH, Marc Schwartz ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
Thanks Gabor, Simon and Marc...I got this to work with the grep() and
rowSums examples you provided. Mark On 2/20/06, Marc Schwartz <[hidden email]> wrote: > On Mon, 2006-02-20 at 18:41 -0600, [hidden email] wrote: > > I have a dataframe called "data" with 5 records (in rows) each of > > which has been scored on each of many variables (in columns). > > > > Five of the variables are named var1, var2, var3, var4, var5 using > > headers. The other variables are named using other conventions. > > > > I can create a new variable called var6 with the value 15 for each > > record with this code: > > > > > var6=var1+var2+var3+var4+var5 > > > > but this is tedious for my real dataset with dozens of variables. I > > would rather use a wildcard to add up all the variables that begin > > with "Var" like this pseudocode: > > > > > Var6=sum(var*) > > > > Any suggestions for implementing this in R? Thanks! Mark > > Here is one approach using grep(). > > Given a data frame called MyDF with the following structure: > > > str(MyDF) > `data.frame': 10 obs. of 20 variables: > $ other4 : num -0.869 0.376 -2.022 0.619 -0.129 ... > $ var8 : num -0.380 1.428 -1.075 -0.796 -0.588 ... > $ var4 : num -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ... > $ other9 : num 0.0210 -0.6455 0.0289 1.2405 -1.3359 ... > $ var10 : num 0.647 -0.798 0.180 1.135 -0.258 ... > $ other2 : num 0.1332 -0.2227 0.0423 0.6881 2.0304 ... > $ other10: num 0.811 2.166 0.569 0.302 0.669 ... > $ var1 : num -0.774 -1.812 -1.230 -0.969 0.245 ... > $ var2 : num -0.0538 0.3712 0.8222 -0.8025 -0.6914 ... > $ other6 : num 0.871 0.291 2.079 1.098 1.025 ... > $ other1 : num -0.5130 0.1358 0.8744 0.0997 1.7458 ... > $ var9 : num 0.664 -0.456 0.415 2.090 -0.283 ... > $ other3 : num -0.425 -0.283 0.706 -1.879 -0.828 ... > $ other7 : num 0.100 0.177 0.570 -0.631 -1.009 ... > $ var3 : num 1.446 -0.862 0.184 1.077 0.146 ... > $ var5 : num 0.402 -0.498 -0.906 0.641 1.690 ... > $ var6 : num 0.892 -0.242 0.561 0.530 -0.291 ... > $ other5 : num -1.210 0.815 -1.284 -0.152 0.329 ... > $ other8 : num -0.265 -1.278 1.152 0.232 -1.189 ... > $ var7 : num -0.616 -0.994 -0.263 1.626 -1.372 ... > > > Note that the column names are either var* or other*. > > Using grep() we get the indices of the column names that contain "other" > plus one or more following characters, where the "other" begins the > word: > > > grep("\\bother.", names(MyDF)) > [1] 1 4 6 7 10 11 13 14 18 19 > > See ?regexp for more information. Note that I use "\\b" to being the > search at the starting word boundary and then "." to require that there > be following characters. Thus, this would not match " other1" or > " other". > > You can then use the following to subset the data frame MyDF and sum the > rows for the requested columns: > > > rowSums(MyDF[, grep("\\bother.", names(MyDF))]) > 1 2 3 4 5 6 7 > -1.344893 1.531417 2.715234 1.616971 1.307379 4.655568 4.638446 > 8 9 10 > -2.640485 -2.226270 -2.158248 > > > You could use grep("other", names(MyDF)), but this would also get > "other" if it appears anywhere in the name. For example: > > > grep("other", "Thisotherone") > [1] 1 > > It just depends upon your naming schema and how strict you need to be in > the search. > > HTH, > > Marc Schwartz > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
In reply to this post by Simon Blomberg-2
Hi, I have a dataframe, x of the following form: Date Symbol A B C 20041201 ABC 10 12 15 20041201 DEF 9 5 4 ... 20050101 ABC 5 3 1 20050101 GHM 12 4 2 .... here A, B,C are properties of a set symbols recorded for a given date. I wante to decile the symbols For each date and property and create another set of columns "bucketA","bucketB", "bucketC" containing the decile rank for each symbol. The following non-vectorized code does what I want, bucket <- function(data,nBuckets) { q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra NAs cut(data,q,include.lowest=T,labels=F) } calcDeciles <- function(x,colNames) { nBuckets <- 10 dates <- unique(x$Date) for ( date in dates) { iVec <- x$Date == date xx <- x[iVec,] for (colName in colNames) { data <- xx[,colName] bColName <- paste("bucket",colName,sep="") x[iVec,bColName] <- bucket(data,nBuckets) } } x } x <- calcDeciles(x,c("A","B","C")) I was wondering if it is possible to vectorize the above function to make it more efficient. I tried, rlist <- tapply(x$A,x$Date,bucket) but I am not sure how to assign the contents of "rlist" to their appropriate slots in the original dataframe. Thanks, Maneesh ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
It might help to give a simple reproducible example in the future. For
example df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100), B=rpois(500, 50), C=rpois(500, 30) ) might generate something like date A B C 1 1 93 51 32 2 1 95 51 30 3 1 102 59 28 4 1 105 52 32 5 1 105 53 26 6 1 99 59 37 ... . ... .. .. 495 5 100 57 19 496 5 96 47 44 497 5 111 56 35 498 5 105 49 23 499 5 105 61 30 500 5 92 53 32 Here is my proposed solution. Can you double check with your existing functions to see if they are correct. decile.fn <- function(x, nbreaks=10){ br <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T ) br[1] <- -Inf return( cut(x, br, labels=F) ) } out <- apply( df[ ,c("A", "B", "C")], 2, function(v) unlist( tapply( v, df$date, decile.fn ) ) ) rownames(out) <- rownames(df) out <- cbind(df$date, out) Regards, Adai On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote: > Hi, > > I have a dataframe, x of the following form: > > Date Symbol A B C > 20041201 ABC 10 12 15 > 20041201 DEF 9 5 4 > ... > 20050101 ABC 5 3 1 > 20050101 GHM 12 4 2 > .... > > here A, B,C are properties of a set symbols recorded for a given date. > I wante to decile the symbols For each date and property and > create another set of columns "bucketA","bucketB", "bucketC" containing the > decile rank > for each symbol. The following non-vectorized code does what I want, > > bucket <- function(data,nBuckets) { > q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) > q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra NAs > cut(data,q,include.lowest=T,labels=F) > } > > calcDeciles <- function(x,colNames) { > nBuckets <- 10 > dates <- unique(x$Date) > for ( date in dates) { > iVec <- x$Date == date > xx <- x[iVec,] > for (colName in colNames) { > data <- xx[,colName] > bColName <- paste("bucket",colName,sep="") > x[iVec,bColName] <- bucket(data,nBuckets) > } > } > x > } > > x <- calcDeciles(x,c("A","B","C")) > > > I was wondering if it is possible to vectorize the above function to make it > more efficient. > I tried, > rlist <- tapply(x$A,x$Date,bucket) > but I am not sure how to assign the contents of "rlist" to their appropriate > slots in the original > dataframe. > > Thanks, > > Maneesh > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
Hi Adai,
I think your solution only works if the rows of the data frame are ordered by "date" and the ordering function is the same used to order the levels of factor(df$date) ? It turns out (as I implied in my question) my data is indeed organized in this manner, so my current problem is solved. In the general case, I suppose, one could always order the data frame by date before proceeding ? Thanks, Maneesh >From: Adaikalavan Ramasamy <[hidden email]> >Reply-To: [hidden email] >To: maneesh deshpande <[hidden email]> >CC: [hidden email] >Subject: Re: [R] Ranking within factor subgroups >Date: Wed, 22 Feb 2006 03:44:45 +0000 > >It might help to give a simple reproducible example in the future. For >example > > df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100), > B=rpois(500, 50), C=rpois(500, 30) ) > >might generate something like > > date A B C > 1 1 93 51 32 > 2 1 95 51 30 > 3 1 102 59 28 > 4 1 105 52 32 > 5 1 105 53 26 > 6 1 99 59 37 > ... . ... .. .. > 495 5 100 57 19 > 496 5 96 47 44 > 497 5 111 56 35 > 498 5 105 49 23 > 499 5 105 61 30 > 500 5 92 53 32 > >Here is my proposed solution. Can you double check with your existing >functions to see if they are correct. > > decile.fn <- function(x, nbreaks=10){ > br <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T ) > br[1] <- -Inf > return( cut(x, br, labels=F) ) > } > > out <- apply( df[ ,c("A", "B", "C")], 2, > function(v) unlist( tapply( v, df$date, decile.fn ) ) ) > > rownames(out) <- rownames(df) > out <- cbind(df$date, out) > >Regards, Adai > > > >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote: > > Hi, > > > > I have a dataframe, x of the following form: > > > > Date Symbol A B C > > 20041201 ABC 10 12 15 > > 20041201 DEF 9 5 4 > > ... > > 20050101 ABC 5 3 1 > > 20050101 GHM 12 4 2 > > .... > > > > here A, B,C are properties of a set symbols recorded for a given date. > > I wante to decile the symbols For each date and property and > > create another set of columns "bucketA","bucketB", "bucketC" containing >the > > decile rank > > for each symbol. The following non-vectorized code does what I want, > > > > bucket <- function(data,nBuckets) { > > q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) > > q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra >NAs > > cut(data,q,include.lowest=T,labels=F) > > } > > > > calcDeciles <- function(x,colNames) { > > nBuckets <- 10 > > dates <- unique(x$Date) > > for ( date in dates) { > > iVec <- x$Date == date > > xx <- x[iVec,] > > for (colName in colNames) { > > data <- xx[,colName] > > bColName <- paste("bucket",colName,sep="") > > x[iVec,bColName] <- bucket(data,nBuckets) > > } > > } > > x > > } > > > > x <- calcDeciles(x,c("A","B","C")) > > > > > > I was wondering if it is possible to vectorize the above function to >make it > > more efficient. > > I tried, > > rlist <- tapply(x$A,x$Date,bucket) > > but I am not sure how to assign the contents of "rlist" to their >appropriate > > slots in the original > > dataframe. > > > > Thanks, > > > > Maneesh > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! >http://www.R-project.org/posting-guide.html > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
"maneesh deshpande" <[hidden email]> writes:
> Hi Adai, > > I think your solution only works if the rows of the data frame are ordered > by "date" and > the ordering function is the same used to order the levels of > factor(df$date) ? > It turns out (as I implied in my question) my data is indeed organized in > this manner, so my > current problem is solved. > In the general case, I suppose, one could always order the data frame by > date before proceeding ? > > Thanks, > > Maneesh You might prefer to look at split/unsplit/split<-, i.e. the z-scores by group line: z <- unsplit(lapply(split(x, g), scale), g) with "scale" suitably replaced. Presumably (meaning: I didn't quite read your code closely enough) z <- unsplit(lapply(split(x, g), bucket, 10), g) could do it. > > >From: Adaikalavan Ramasamy <[hidden email]> > >Reply-To: [hidden email] > >To: maneesh deshpande <[hidden email]> > >CC: [hidden email] > >Subject: Re: [R] Ranking within factor subgroups > >Date: Wed, 22 Feb 2006 03:44:45 +0000 > > > >It might help to give a simple reproducible example in the future. For > >example > > > > df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100), > > B=rpois(500, 50), C=rpois(500, 30) ) > > > >might generate something like > > > > date A B C > > 1 1 93 51 32 > > 2 1 95 51 30 > > 3 1 102 59 28 > > 4 1 105 52 32 > > 5 1 105 53 26 > > 6 1 99 59 37 > > ... . ... .. .. > > 495 5 100 57 19 > > 496 5 96 47 44 > > 497 5 111 56 35 > > 498 5 105 49 23 > > 499 5 105 61 30 > > 500 5 92 53 32 > > > >Here is my proposed solution. Can you double check with your existing > >functions to see if they are correct. > > > > decile.fn <- function(x, nbreaks=10){ > > br <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T ) > > br[1] <- -Inf > > return( cut(x, br, labels=F) ) > > } > > > > out <- apply( df[ ,c("A", "B", "C")], 2, > > function(v) unlist( tapply( v, df$date, decile.fn ) ) ) > > > > rownames(out) <- rownames(df) > > out <- cbind(df$date, out) > > > >Regards, Adai > > > > > > > >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote: > > > Hi, > > > > > > I have a dataframe, x of the following form: > > > > > > Date Symbol A B C > > > 20041201 ABC 10 12 15 > > > 20041201 DEF 9 5 4 > > > ... > > > 20050101 ABC 5 3 1 > > > 20050101 GHM 12 4 2 > > > .... > > > > > > here A, B,C are properties of a set symbols recorded for a given date. > > > I wante to decile the symbols For each date and property and > > > create another set of columns "bucketA","bucketB", "bucketC" containing > >the > > > decile rank > > > for each symbol. The following non-vectorized code does what I want, > > > > > > bucket <- function(data,nBuckets) { > > > q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) > > > q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra > >NAs > > > cut(data,q,include.lowest=T,labels=F) > > > } > > > > > > calcDeciles <- function(x,colNames) { > > > nBuckets <- 10 > > > dates <- unique(x$Date) > > > for ( date in dates) { > > > iVec <- x$Date == date > > > xx <- x[iVec,] > > > for (colName in colNames) { > > > data <- xx[,colName] > > > bColName <- paste("bucket",colName,sep="") > > > x[iVec,bColName] <- bucket(data,nBuckets) > > > } > > > } > > > x > > > } > > > > > > x <- calcDeciles(x,c("A","B","C")) > > > > > > > > > I was wondering if it is possible to vectorize the above function to > >make it > > > more efficient. > > > I tried, > > > rlist <- tapply(x$A,x$Date,bucket) > > > but I am not sure how to assign the contents of "rlist" to their > >appropriate > > > slots in the original > > > dataframe. > > > > > > Thanks, > > > > > > Maneesh > > > > > > ______________________________________________ > > > [hidden email] mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > >http://www.R-project.org/posting-guide.html > > > > > > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > -- O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([hidden email]) FAX: (+45) 35327907 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
Hi Peter,
That did the trick. Thank you very much. Regards, Maneesh >From: Peter Dalgaard <[hidden email]> >To: "maneesh deshpande" <[hidden email]> >CC: [hidden email], [hidden email] >Subject: Re: [R] Ranking within factor subgroups >Date: 23 Feb 2006 07:28:13 +0100 > >"maneesh deshpande" <[hidden email]> writes: > > > Hi Adai, > > > > I think your solution only works if the rows of the data frame are >ordered > > by "date" and > > the ordering function is the same used to order the levels of > > factor(df$date) ? > > It turns out (as I implied in my question) my data is indeed organized >in > > this manner, so my > > current problem is solved. > > In the general case, I suppose, one could always order the data frame by > > date before proceeding ? > > > > Thanks, > > > > Maneesh > >You might prefer to look at split/unsplit/split<-, i.e. the z-scores >by group line: > > z <- unsplit(lapply(split(x, g), scale), g) > >with "scale" suitably replaced. Presumably (meaning: I didn't quite >read your code closely enough) > > z <- unsplit(lapply(split(x, g), bucket, 10), g) > >could do it. > > > > > >From: Adaikalavan Ramasamy <[hidden email]> > > >Reply-To: [hidden email] > > >To: maneesh deshpande <[hidden email]> > > >CC: [hidden email] > > >Subject: Re: [R] Ranking within factor subgroups > > >Date: Wed, 22 Feb 2006 03:44:45 +0000 > > > > > >It might help to give a simple reproducible example in the future. For > > >example > > > > > > df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100), > > > B=rpois(500, 50), C=rpois(500, 30) ) > > > > > >might generate something like > > > > > > date A B C > > > 1 1 93 51 32 > > > 2 1 95 51 30 > > > 3 1 102 59 28 > > > 4 1 105 52 32 > > > 5 1 105 53 26 > > > 6 1 99 59 37 > > > ... . ... .. .. > > > 495 5 100 57 19 > > > 496 5 96 47 44 > > > 497 5 111 56 35 > > > 498 5 105 49 23 > > > 499 5 105 61 30 > > > 500 5 92 53 32 > > > > > >Here is my proposed solution. Can you double check with your existing > > >functions to see if they are correct. > > > > > > decile.fn <- function(x, nbreaks=10){ > > > br <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T ) > > > br[1] <- -Inf > > > return( cut(x, br, labels=F) ) > > > } > > > > > > out <- apply( df[ ,c("A", "B", "C")], 2, > > > function(v) unlist( tapply( v, df$date, decile.fn ) ) >) > > > > > > rownames(out) <- rownames(df) > > > out <- cbind(df$date, out) > > > > > >Regards, Adai > > > > > > > > > > > >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote: > > > > Hi, > > > > > > > > I have a dataframe, x of the following form: > > > > > > > > Date Symbol A B C > > > > 20041201 ABC 10 12 15 > > > > 20041201 DEF 9 5 4 > > > > ... > > > > 20050101 ABC 5 3 1 > > > > 20050101 GHM 12 4 2 > > > > .... > > > > > > > > here A, B,C are properties of a set symbols recorded for a given >date. > > > > I wante to decile the symbols For each date and property and > > > > create another set of columns "bucketA","bucketB", "bucketC" >containing > > >the > > > > decile rank > > > > for each symbol. The following non-vectorized code does what I want, > > > > > > > > bucket <- function(data,nBuckets) { > > > > q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) > > > > q[1] <- q[1] - 0.1 # need to do this to ensure there are no >extra > > >NAs > > > > cut(data,q,include.lowest=T,labels=F) > > > > } > > > > > > > > calcDeciles <- function(x,colNames) { > > > > nBuckets <- 10 > > > > dates <- unique(x$Date) > > > > for ( date in dates) { > > > > iVec <- x$Date == date > > > > xx <- x[iVec,] > > > > for (colName in colNames) { > > > > data <- xx[,colName] > > > > bColName <- paste("bucket",colName,sep="") > > > > x[iVec,bColName] <- bucket(data,nBuckets) > > > > } > > > > } > > > > x > > > > } > > > > > > > > x <- calcDeciles(x,c("A","B","C")) > > > > > > > > > > > > I was wondering if it is possible to vectorize the above function to > > >make it > > > > more efficient. > > > > I tried, > > > > rlist <- tapply(x$A,x$Date,bucket) > > > > but I am not sure how to assign the contents of "rlist" to their > > >appropriate > > > > slots in the original > > > > dataframe. > > > > > > > > Thanks, > > > > > > > > Maneesh > > > > > > > > ______________________________________________ > > > > [hidden email] mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > >http://www.R-project.org/posting-guide.html > > > > > > > > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! >http://www.R-project.org/posting-guide.html > > > >-- > O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K > (*) \(*) -- University of Copenhagen Denmark Ph: (+45) >35327918 >~~~~~~~~~~ - ([hidden email]) FAX: (+45) >35327907 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
In reply to this post by Peter Dalgaard
Thank you! I did not know about the split and unsplit functions. It
looks like a very powerful and useful combination to master. Regards, Adai On Thu, 2006-02-23 at 07:28 +0100, Peter Dalgaard wrote: > "maneesh deshpande" <[hidden email]> writes: > > > Hi Adai, > > > > I think your solution only works if the rows of the data frame are ordered > > by "date" and > > the ordering function is the same used to order the levels of > > factor(df$date) ? > > It turns out (as I implied in my question) my data is indeed organized in > > this manner, so my > > current problem is solved. > > In the general case, I suppose, one could always order the data frame by > > date before proceeding ? > > > > Thanks, > > > > Maneesh > > You might prefer to look at split/unsplit/split<-, i.e. the z-scores > by group line: > > z <- unsplit(lapply(split(x, g), scale), g) > > with "scale" suitably replaced. Presumably (meaning: I didn't quite > read your code closely enough) > > z <- unsplit(lapply(split(x, g), bucket, 10), g) > > could do it. > > > > > >From: Adaikalavan Ramasamy <[hidden email]> > > >Reply-To: [hidden email] > > >To: maneesh deshpande <[hidden email]> > > >CC: [hidden email] > > >Subject: Re: [R] Ranking within factor subgroups > > >Date: Wed, 22 Feb 2006 03:44:45 +0000 > > > > > >It might help to give a simple reproducible example in the future. For > > >example > > > > > > df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100), > > > B=rpois(500, 50), C=rpois(500, 30) ) > > > > > >might generate something like > > > > > > date A B C > > > 1 1 93 51 32 > > > 2 1 95 51 30 > > > 3 1 102 59 28 > > > 4 1 105 52 32 > > > 5 1 105 53 26 > > > 6 1 99 59 37 > > > ... . ... .. .. > > > 495 5 100 57 19 > > > 496 5 96 47 44 > > > 497 5 111 56 35 > > > 498 5 105 49 23 > > > 499 5 105 61 30 > > > 500 5 92 53 32 > > > > > >Here is my proposed solution. Can you double check with your existing > > >functions to see if they are correct. > > > > > > decile.fn <- function(x, nbreaks=10){ > > > br <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T ) > > > br[1] <- -Inf > > > return( cut(x, br, labels=F) ) > > > } > > > > > > out <- apply( df[ ,c("A", "B", "C")], 2, > > > function(v) unlist( tapply( v, df$date, decile.fn ) ) ) > > > > > > rownames(out) <- rownames(df) > > > out <- cbind(df$date, out) > > > > > >Regards, Adai > > > > > > > > > > > >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote: > > > > Hi, > > > > > > > > I have a dataframe, x of the following form: > > > > > > > > Date Symbol A B C > > > > 20041201 ABC 10 12 15 > > > > 20041201 DEF 9 5 4 > > > > ... > > > > 20050101 ABC 5 3 1 > > > > 20050101 GHM 12 4 2 > > > > .... > > > > > > > > here A, B,C are properties of a set symbols recorded for a given date. > > > > I wante to decile the symbols For each date and property and > > > > create another set of columns "bucketA","bucketB", "bucketC" containing > > >the > > > > decile rank > > > > for each symbol. The following non-vectorized code does what I want, > > > > > > > > bucket <- function(data,nBuckets) { > > > > q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) > > > > q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra > > >NAs > > > > cut(data,q,include.lowest=T,labels=F) > > > > } > > > > > > > > calcDeciles <- function(x,colNames) { > > > > nBuckets <- 10 > > > > dates <- unique(x$Date) > > > > for ( date in dates) { > > > > iVec <- x$Date == date > > > > xx <- x[iVec,] > > > > for (colName in colNames) { > > > > data <- xx[,colName] > > > > bColName <- paste("bucket",colName,sep="") > > > > x[iVec,bColName] <- bucket(data,nBuckets) > > > > } > > > > } > > > > x > > > > } > > > > > > > > x <- calcDeciles(x,c("A","B","C")) > > > > > > > > > > > > I was wondering if it is possible to vectorize the above function to > > >make it > > > > more efficient. > > > > I tried, > > > > rlist <- tapply(x$A,x$Date,bucket) > > > > but I am not sure how to assign the contents of "rlist" to their > > >appropriate > > > > slots in the original > > > > dataframe. > > > > > > > > Thanks, > > > > > > > > Maneesh > > > > > > > > ______________________________________________ > > > > [hidden email] mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > >http://www.R-project.org/posting-guide.html > > > > > > > > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
| Powered by Nabble | Edit this page |
