Quantcast

How to sum values across multiple variables using a wildcard?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

How to sum values across multiple variables using a wildcard?

mtb954
I have a dataframe called "data" with 5 records (in rows) each of
which has been scored on each of many variables (in columns).

Five of the variables are named var1, var2, var3, var4, var5 using
headers. The other variables are named using other conventions.

I can create a new variable called var6 with the value 15 for each
record with this code:

> var6=var1+var2+var3+var4+var5

but this is tedious for my real dataset with dozens of variables. I
would rather use a wildcard to add up all the variables that begin
with "Var" like this pseudocode:

> Var6=sum(var*)

Any suggestions for implementing this in R? Thanks! Mark

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to sum values across multiple variables using a wildcard?

Gabor Grothendieck
See:

?rowSums



On 2/20/06, [hidden email] <[hidden email]> wrote:

> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
>
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
>
> I can create a new variable called var6 with the value 15 for each
> record with this code:
>
> > var6=var1+var2+var3+var4+var5
>
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
>
> > Var6=sum(var*)
>
> Any suggestions for implementing this in R? Thanks! Mark
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to sum values across multiple variables using a wildcard?

Simon Blomberg-2
In reply to this post by mtb954
data <- data.frame(var1=c(1,2,3), var2=c(3,4,5), var3=c(4,5,6), foo =
c(100,200,300))
# sum rows with "var" in their name
rowSums(data[, grep("var", names(data))])

 1  2  3
 8 11 14



[hidden email] wrote:

> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
>
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
>
> I can create a new variable called var6 with the value 15 for each
> record with this code:
>
>  
>> var6=var1+var2+var3+var4+var5
>>    
>
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
>
>  
>> Var6=sum(var*)
>>    
>
> Any suggestions for implementing this in R? Thanks! Mark
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>  


--
Simon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat.
Centre for Resource and Environmental Studies
The Australian National University
Canberra ACT 0200
Australia
T: +61 2 6125 7800 email: Simon.Blomberg_at_anu.edu.au
F: +61 2 6125 0757
CRICOS Provider # 00120C

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to sum values across multiple variables using a wildcard?

Marc Schwartz (via MN)
In reply to this post by mtb954
On Mon, 2006-02-20 at 18:41 -0600, [hidden email] wrote:

> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
>
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
>
> I can create a new variable called var6 with the value 15 for each
> record with this code:
>
> > var6=var1+var2+var3+var4+var5
>
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
>
> > Var6=sum(var*)
>
> Any suggestions for implementing this in R? Thanks! Mark

Here is one approach using grep().

Given a data frame called MyDF with the following structure:

> str(MyDF)
`data.frame': 10 obs. of  20 variables:
 $ other4 : num  -0.869  0.376 -2.022  0.619 -0.129 ...
 $ var8   : num  -0.380  1.428 -1.075 -0.796 -0.588 ...
 $ var4   : num  -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ...
 $ other9 : num   0.0210 -0.6455  0.0289  1.2405 -1.3359 ...
 $ var10  : num   0.647 -0.798  0.180  1.135 -0.258 ...
 $ other2 : num   0.1332 -0.2227  0.0423  0.6881  2.0304 ...
 $ other10: num  0.811 2.166 0.569 0.302 0.669 ...
 $ var1   : num  -0.774 -1.812 -1.230 -0.969  0.245 ...
 $ var2   : num  -0.0538  0.3712  0.8222 -0.8025 -0.6914 ...
 $ other6 : num  0.871 0.291 2.079 1.098 1.025 ...
 $ other1 : num  -0.5130  0.1358  0.8744  0.0997  1.7458 ...
 $ var9   : num   0.664 -0.456  0.415  2.090 -0.283 ...
 $ other3 : num  -0.425 -0.283  0.706 -1.879 -0.828 ...
 $ other7 : num   0.100  0.177  0.570 -0.631 -1.009 ...
 $ var3   : num   1.446 -0.862  0.184  1.077  0.146 ...
 $ var5   : num   0.402 -0.498 -0.906  0.641  1.690 ...
 $ var6   : num   0.892 -0.242  0.561  0.530 -0.291 ...
 $ other5 : num  -1.210  0.815 -1.284 -0.152  0.329 ...
 $ other8 : num  -0.265 -1.278  1.152  0.232 -1.189 ...
 $ var7   : num  -0.616 -0.994 -0.263  1.626 -1.372 ...


Note that the column names are either var* or other*.

Using grep() we get the indices of the column names that contain "other"
plus one or more following characters, where the "other" begins the
word:

> grep("\\bother.", names(MyDF))
 [1]  1  4  6  7 10 11 13 14 18 19

See ?regexp for more information. Note that I use "\\b" to being the
search at the starting word boundary and then "." to require that there
be following characters. Thus, this would not match " other1" or
" other".

You can then use the following to subset the data frame MyDF and sum the
rows for the requested columns:

> rowSums(MyDF[, grep("\\bother.", names(MyDF))])
        1         2         3         4         5         6         7
-1.344893  1.531417  2.715234  1.616971  1.307379  4.655568  4.638446
        8         9        10
-2.640485 -2.226270 -2.158248


You could use grep("other", names(MyDF)), but this would also get
"other" if it appears anywhere in the name. For example:

> grep("other", "Thisotherone")
[1] 1

It just depends upon your naming schema and how strict you need to be in
the search.

HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to sum values across multiple variables using a wildcard?

mtb954
Thanks Gabor, Simon and Marc...I got this to work with the grep() and
rowSums examples  you provided.

Mark


On 2/20/06, Marc Schwartz <[hidden email]> wrote:

> On Mon, 2006-02-20 at 18:41 -0600, [hidden email] wrote:
> > I have a dataframe called "data" with 5 records (in rows) each of
> > which has been scored on each of many variables (in columns).
> >
> > Five of the variables are named var1, var2, var3, var4, var5 using
> > headers. The other variables are named using other conventions.
> >
> > I can create a new variable called var6 with the value 15 for each
> > record with this code:
> >
> > > var6=var1+var2+var3+var4+var5
> >
> > but this is tedious for my real dataset with dozens of variables. I
> > would rather use a wildcard to add up all the variables that begin
> > with "Var" like this pseudocode:
> >
> > > Var6=sum(var*)
> >
> > Any suggestions for implementing this in R? Thanks! Mark
>
> Here is one approach using grep().
>
> Given a data frame called MyDF with the following structure:
>
> > str(MyDF)
> `data.frame':   10 obs. of  20 variables:
>  $ other4 : num  -0.869  0.376 -2.022  0.619 -0.129 ...
>  $ var8   : num  -0.380  1.428 -1.075 -0.796 -0.588 ...
>  $ var4   : num  -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ...
>  $ other9 : num   0.0210 -0.6455  0.0289  1.2405 -1.3359 ...
>  $ var10  : num   0.647 -0.798  0.180  1.135 -0.258 ...
>  $ other2 : num   0.1332 -0.2227  0.0423  0.6881  2.0304 ...
>  $ other10: num  0.811 2.166 0.569 0.302 0.669 ...
>  $ var1   : num  -0.774 -1.812 -1.230 -0.969  0.245 ...
>  $ var2   : num  -0.0538  0.3712  0.8222 -0.8025 -0.6914 ...
>  $ other6 : num  0.871 0.291 2.079 1.098 1.025 ...
>  $ other1 : num  -0.5130  0.1358  0.8744  0.0997  1.7458 ...
>  $ var9   : num   0.664 -0.456  0.415  2.090 -0.283 ...
>  $ other3 : num  -0.425 -0.283  0.706 -1.879 -0.828 ...
>  $ other7 : num   0.100  0.177  0.570 -0.631 -1.009 ...
>  $ var3   : num   1.446 -0.862  0.184  1.077  0.146 ...
>  $ var5   : num   0.402 -0.498 -0.906  0.641  1.690 ...
>  $ var6   : num   0.892 -0.242  0.561  0.530 -0.291 ...
>  $ other5 : num  -1.210  0.815 -1.284 -0.152  0.329 ...
>  $ other8 : num  -0.265 -1.278  1.152  0.232 -1.189 ...
>  $ var7   : num  -0.616 -0.994 -0.263  1.626 -1.372 ...
>
>
> Note that the column names are either var* or other*.
>
> Using grep() we get the indices of the column names that contain "other"
> plus one or more following characters, where the "other" begins the
> word:
>
> > grep("\\bother.", names(MyDF))
>  [1]  1  4  6  7 10 11 13 14 18 19
>
> See ?regexp for more information. Note that I use "\\b" to being the
> search at the starting word boundary and then "." to require that there
> be following characters. Thus, this would not match " other1" or
> " other".
>
> You can then use the following to subset the data frame MyDF and sum the
> rows for the requested columns:
>
> > rowSums(MyDF[, grep("\\bother.", names(MyDF))])
>         1         2         3         4         5         6         7
> -1.344893  1.531417  2.715234  1.616971  1.307379  4.655568  4.638446
>         8         9        10
> -2.640485 -2.226270 -2.158248
>
>
> You could use grep("other", names(MyDF)), but this would also get
> "other" if it appears anywhere in the name. For example:
>
> > grep("other", "Thisotherone")
> [1] 1
>
> It just depends upon your naming schema and how strict you need to be in
> the search.
>
> HTH,
>
> Marc Schwartz
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Ranking within factor subgroups

maneesh deshpande
In reply to this post by Simon Blomberg-2

Hi,

I have a dataframe, x of the following form:

Date            Symbol   A    B  C
20041201     ABC      10  12 15
20041201     DEF       9    5   4
...
20050101     ABC         5  3   1
20050101     GHM       12 4    2
....

here A, B,C are properties of a set symbols recorded for a given date.
I wante to decile the symbols For each date and property and
create another set of columns "bucketA","bucketB", "bucketC" containing the
decile rank
for each symbol. The following non-vectorized code does what I want,

bucket <- function(data,nBuckets) {
     q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
     q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra NAs
     cut(data,q,include.lowest=T,labels=F)
}

calcDeciles <- function(x,colNames) {
nBuckets <- 10
dates <- unique(x$Date)
for ( date in dates) {
  iVec <- x$Date == date
  xx <- x[iVec,]
  for (colName in colNames) {
     data <- xx[,colName]
     bColName <- paste("bucket",colName,sep="")
     x[iVec,bColName] <- bucket(data,nBuckets)
  }
}
x
}

x <- calcDeciles(x,c("A","B","C"))


I was wondering if it is possible to vectorize the above function to make it
more efficient.
I tried,
rlist <- tapply(x$A,x$Date,bucket)
but I am not sure how to assign the contents of "rlist" to their appropriate
slots in the original
dataframe.

Thanks,

Maneesh

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Ranking within factor subgroups

Adaikalavan Ramasamy
It might help to give a simple reproducible example in the future. For
example

 df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100),
                         B=rpois(500, 50), C=rpois(500, 30) )

might generate something like

            date   A  B  C
          1    1  93 51 32
          2    1  95 51 30
          3    1 102 59 28
          4    1 105 52 32
          5    1 105 53 26
          6    1  99 59 37
        ...    . ... .. ..
        495    5 100 57 19
        496    5  96 47 44
        497    5 111 56 35
        498    5 105 49 23
        499    5 105 61 30
        500    5  92 53 32

Here is my proposed solution. Can you double check with your existing
functions to see if they are correct.

   decile.fn <- function(x, nbreaks=10){
     br     <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T )
     br[1]  <- -Inf
     return( cut(x, br, labels=F) )
   }

   out <- apply( df[ ,c("A", "B", "C")], 2,
                 function(v) unlist( tapply( v, df$date, decile.fn ) ) )

   rownames(out) <- rownames(df)
   out <- cbind(df$date, out)

Regards, Adai



On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote:

> Hi,
>
> I have a dataframe, x of the following form:
>
> Date            Symbol   A    B  C
> 20041201     ABC      10  12 15
> 20041201     DEF       9    5   4
> ...
> 20050101     ABC         5  3   1
> 20050101     GHM       12 4    2
> ....
>
> here A, B,C are properties of a set symbols recorded for a given date.
> I wante to decile the symbols For each date and property and
> create another set of columns "bucketA","bucketB", "bucketC" containing the
> decile rank
> for each symbol. The following non-vectorized code does what I want,
>
> bucket <- function(data,nBuckets) {
>      q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
>      q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra NAs
>      cut(data,q,include.lowest=T,labels=F)
> }
>
> calcDeciles <- function(x,colNames) {
> nBuckets <- 10
> dates <- unique(x$Date)
> for ( date in dates) {
>   iVec <- x$Date == date
>   xx <- x[iVec,]
>   for (colName in colNames) {
>      data <- xx[,colName]
>      bColName <- paste("bucket",colName,sep="")
>      x[iVec,bColName] <- bucket(data,nBuckets)
>   }
> }
> x
> }
>
> x <- calcDeciles(x,c("A","B","C"))
>
>
> I was wondering if it is possible to vectorize the above function to make it
> more efficient.
> I tried,
> rlist <- tapply(x$A,x$Date,bucket)
> but I am not sure how to assign the contents of "rlist" to their appropriate
> slots in the original
> dataframe.
>
> Thanks,
>
> Maneesh
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Ranking within factor subgroups

maneesh deshpande
Hi Adai,

I think your solution only works if the rows of the data frame are ordered
by "date" and
the ordering function is the same used to order the levels of
factor(df$date) ?
It turns out (as I implied in my question) my data is indeed organized in
this manner, so my
current problem is solved.
In the general case, I suppose, one could always order the data frame by
date before proceeding ?

Thanks,

Maneesh


>From: Adaikalavan Ramasamy <[hidden email]>
>Reply-To: [hidden email]
>To: maneesh deshpande <[hidden email]>
>CC: [hidden email]
>Subject: Re: [R]  Ranking within factor subgroups
>Date: Wed, 22 Feb 2006 03:44:45 +0000
>
>It might help to give a simple reproducible example in the future. For
>example
>
>  df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100),
>                          B=rpois(500, 50), C=rpois(500, 30) )
>
>might generate something like
>
>    date   A  B  C
>  1    1  93 51 32
>  2    1  95 51 30
>  3    1 102 59 28
>  4    1 105 52 32
>  5    1 105 53 26
>  6    1  99 59 37
> ...    . ... .. ..
> 495    5 100 57 19
> 496    5  96 47 44
> 497    5 111 56 35
> 498    5 105 49 23
> 499    5 105 61 30
> 500    5  92 53 32
>
>Here is my proposed solution. Can you double check with your existing
>functions to see if they are correct.
>
>    decile.fn <- function(x, nbreaks=10){
>      br     <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T )
>      br[1]  <- -Inf
>      return( cut(x, br, labels=F) )
>    }
>
>    out <- apply( df[ ,c("A", "B", "C")], 2,
>                  function(v) unlist( tapply( v, df$date, decile.fn ) ) )
>
>    rownames(out) <- rownames(df)
>    out <- cbind(df$date, out)
>
>Regards, Adai
>
>
>
>On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote:
> > Hi,
> >
> > I have a dataframe, x of the following form:
> >
> > Date            Symbol   A    B  C
> > 20041201     ABC      10  12 15
> > 20041201     DEF       9    5   4
> > ...
> > 20050101     ABC         5  3   1
> > 20050101     GHM       12 4    2
> > ....
> >
> > here A, B,C are properties of a set symbols recorded for a given date.
> > I wante to decile the symbols For each date and property and
> > create another set of columns "bucketA","bucketB", "bucketC" containing
>the
> > decile rank
> > for each symbol. The following non-vectorized code does what I want,
> >
> > bucket <- function(data,nBuckets) {
> >      q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
> >      q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra
>NAs
> >      cut(data,q,include.lowest=T,labels=F)
> > }
> >
> > calcDeciles <- function(x,colNames) {
> > nBuckets <- 10
> > dates <- unique(x$Date)
> > for ( date in dates) {
> >   iVec <- x$Date == date
> >   xx <- x[iVec,]
> >   for (colName in colNames) {
> >      data <- xx[,colName]
> >      bColName <- paste("bucket",colName,sep="")
> >      x[iVec,bColName] <- bucket(data,nBuckets)
> >   }
> > }
> > x
> > }
> >
> > x <- calcDeciles(x,c("A","B","C"))
> >
> >
> > I was wondering if it is possible to vectorize the above function to
>make it
> > more efficient.
> > I tried,
> > rlist <- tapply(x$A,x$Date,bucket)
> > but I am not sure how to assign the contents of "rlist" to their
>appropriate
> > slots in the original
> > dataframe.
> >
> > Thanks,
> >
> > Maneesh
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
>http://www.R-project.org/posting-guide.html
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Ranking within factor subgroups

Peter Dalgaard
"maneesh deshpande" <[hidden email]> writes:

> Hi Adai,
>
> I think your solution only works if the rows of the data frame are ordered
> by "date" and
> the ordering function is the same used to order the levels of
> factor(df$date) ?
> It turns out (as I implied in my question) my data is indeed organized in
> this manner, so my
> current problem is solved.
> In the general case, I suppose, one could always order the data frame by
> date before proceeding ?
>
> Thanks,
>
> Maneesh

You might prefer to look at split/unsplit/split<-, i.e. the z-scores
by group line:

     z <- unsplit(lapply(split(x, g), scale), g)

with "scale" suitably replaced. Presumably (meaning: I didn't quite
read your code closely enough)

    z <- unsplit(lapply(split(x, g), bucket, 10), g)

could do it.
 

>
> >From: Adaikalavan Ramasamy <[hidden email]>
> >Reply-To: [hidden email]
> >To: maneesh deshpande <[hidden email]>
> >CC: [hidden email]
> >Subject: Re: [R]  Ranking within factor subgroups
> >Date: Wed, 22 Feb 2006 03:44:45 +0000
> >
> >It might help to give a simple reproducible example in the future. For
> >example
> >
> >  df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100),
> >                          B=rpois(500, 50), C=rpois(500, 30) )
> >
> >might generate something like
> >
> >    date   A  B  C
> >  1    1  93 51 32
> >  2    1  95 51 30
> >  3    1 102 59 28
> >  4    1 105 52 32
> >  5    1 105 53 26
> >  6    1  99 59 37
> > ...    . ... .. ..
> > 495    5 100 57 19
> > 496    5  96 47 44
> > 497    5 111 56 35
> > 498    5 105 49 23
> > 499    5 105 61 30
> > 500    5  92 53 32
> >
> >Here is my proposed solution. Can you double check with your existing
> >functions to see if they are correct.
> >
> >    decile.fn <- function(x, nbreaks=10){
> >      br     <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T )
> >      br[1]  <- -Inf
> >      return( cut(x, br, labels=F) )
> >    }
> >
> >    out <- apply( df[ ,c("A", "B", "C")], 2,
> >                  function(v) unlist( tapply( v, df$date, decile.fn ) ) )
> >
> >    rownames(out) <- rownames(df)
> >    out <- cbind(df$date, out)
> >
> >Regards, Adai
> >
> >
> >
> >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote:
> > > Hi,
> > >
> > > I have a dataframe, x of the following form:
> > >
> > > Date            Symbol   A    B  C
> > > 20041201     ABC      10  12 15
> > > 20041201     DEF       9    5   4
> > > ...
> > > 20050101     ABC         5  3   1
> > > 20050101     GHM       12 4    2
> > > ....
> > >
> > > here A, B,C are properties of a set symbols recorded for a given date.
> > > I wante to decile the symbols For each date and property and
> > > create another set of columns "bucketA","bucketB", "bucketC" containing
> >the
> > > decile rank
> > > for each symbol. The following non-vectorized code does what I want,
> > >
> > > bucket <- function(data,nBuckets) {
> > >      q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
> > >      q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra
> >NAs
> > >      cut(data,q,include.lowest=T,labels=F)
> > > }
> > >
> > > calcDeciles <- function(x,colNames) {
> > > nBuckets <- 10
> > > dates <- unique(x$Date)
> > > for ( date in dates) {
> > >   iVec <- x$Date == date
> > >   xx <- x[iVec,]
> > >   for (colName in colNames) {
> > >      data <- xx[,colName]
> > >      bColName <- paste("bucket",colName,sep="")
> > >      x[iVec,bColName] <- bucket(data,nBuckets)
> > >   }
> > > }
> > > x
> > > }
> > >
> > > x <- calcDeciles(x,c("A","B","C"))
> > >
> > >
> > > I was wondering if it is possible to vectorize the above function to
> >make it
> > > more efficient.
> > > I tried,
> > > rlist <- tapply(x$A,x$Date,bucket)
> > > but I am not sure how to assign the contents of "rlist" to their
> >appropriate
> > > slots in the original
> > > dataframe.
> > >
> > > Thanks,
> > >
> > > Maneesh
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> >http://www.R-project.org/posting-guide.html
> > >
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

--
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Ranking within factor subgroups

maneesh deshpande
Hi Peter,

That did the trick. Thank you very much.

Regards,

Maneesh


>From: Peter Dalgaard <[hidden email]>
>To: "maneesh deshpande" <[hidden email]>
>CC: [hidden email], [hidden email]
>Subject: Re: [R] Ranking within factor subgroups
>Date: 23 Feb 2006 07:28:13 +0100
>
>"maneesh deshpande" <[hidden email]> writes:
>
> > Hi Adai,
> >
> > I think your solution only works if the rows of the data frame are
>ordered
> > by "date" and
> > the ordering function is the same used to order the levels of
> > factor(df$date) ?
> > It turns out (as I implied in my question) my data is indeed organized
>in
> > this manner, so my
> > current problem is solved.
> > In the general case, I suppose, one could always order the data frame by
> > date before proceeding ?
> >
> > Thanks,
> >
> > Maneesh
>
>You might prefer to look at split/unsplit/split<-, i.e. the z-scores
>by group line:
>
>      z <- unsplit(lapply(split(x, g), scale), g)
>
>with "scale" suitably replaced. Presumably (meaning: I didn't quite
>read your code closely enough)
>
>     z <- unsplit(lapply(split(x, g), bucket, 10), g)
>
>could do it.
>
> >
> > >From: Adaikalavan Ramasamy <[hidden email]>
> > >Reply-To: [hidden email]
> > >To: maneesh deshpande <[hidden email]>
> > >CC: [hidden email]
> > >Subject: Re: [R]  Ranking within factor subgroups
> > >Date: Wed, 22 Feb 2006 03:44:45 +0000
> > >
> > >It might help to give a simple reproducible example in the future. For
> > >example
> > >
> > >  df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100),
> > >                          B=rpois(500, 50), C=rpois(500, 30) )
> > >
> > >might generate something like
> > >
> > >    date   A  B  C
> > >  1    1  93 51 32
> > >  2    1  95 51 30
> > >  3    1 102 59 28
> > >  4    1 105 52 32
> > >  5    1 105 53 26
> > >  6    1  99 59 37
> > > ...    . ... .. ..
> > > 495    5 100 57 19
> > > 496    5  96 47 44
> > > 497    5 111 56 35
> > > 498    5 105 49 23
> > > 499    5 105 61 30
> > > 500    5  92 53 32
> > >
> > >Here is my proposed solution. Can you double check with your existing
> > >functions to see if they are correct.
> > >
> > >    decile.fn <- function(x, nbreaks=10){
> > >      br     <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T )
> > >      br[1]  <- -Inf
> > >      return( cut(x, br, labels=F) )
> > >    }
> > >
> > >    out <- apply( df[ ,c("A", "B", "C")], 2,
> > >                  function(v) unlist( tapply( v, df$date, decile.fn ) )
>)
> > >
> > >    rownames(out) <- rownames(df)
> > >    out <- cbind(df$date, out)
> > >
> > >Regards, Adai
> > >
> > >
> > >
> > >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote:
> > > > Hi,
> > > >
> > > > I have a dataframe, x of the following form:
> > > >
> > > > Date            Symbol   A    B  C
> > > > 20041201     ABC      10  12 15
> > > > 20041201     DEF       9    5   4
> > > > ...
> > > > 20050101     ABC         5  3   1
> > > > 20050101     GHM       12 4    2
> > > > ....
> > > >
> > > > here A, B,C are properties of a set symbols recorded for a given
>date.
> > > > I wante to decile the symbols For each date and property and
> > > > create another set of columns "bucketA","bucketB", "bucketC"
>containing
> > >the
> > > > decile rank
> > > > for each symbol. The following non-vectorized code does what I want,
> > > >
> > > > bucket <- function(data,nBuckets) {
> > > >      q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
> > > >      q[1] <- q[1] - 0.1 # need to do this to ensure there are no
>extra
> > >NAs
> > > >      cut(data,q,include.lowest=T,labels=F)
> > > > }
> > > >
> > > > calcDeciles <- function(x,colNames) {
> > > > nBuckets <- 10
> > > > dates <- unique(x$Date)
> > > > for ( date in dates) {
> > > >   iVec <- x$Date == date
> > > >   xx <- x[iVec,]
> > > >   for (colName in colNames) {
> > > >      data <- xx[,colName]
> > > >      bColName <- paste("bucket",colName,sep="")
> > > >      x[iVec,bColName] <- bucket(data,nBuckets)
> > > >   }
> > > > }
> > > > x
> > > > }
> > > >
> > > > x <- calcDeciles(x,c("A","B","C"))
> > > >
> > > >
> > > > I was wondering if it is possible to vectorize the above function to
> > >make it
> > > > more efficient.
> > > > I tried,
> > > > rlist <- tapply(x$A,x$Date,bucket)
> > > > but I am not sure how to assign the contents of "rlist" to their
> > >appropriate
> > > > slots in the original
> > > > dataframe.
> > > >
> > > > Thanks,
> > > >
> > > > Maneesh
> > > >
> > > > ______________________________________________
> > > > [hidden email] mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > >http://www.R-project.org/posting-guide.html
> > > >
> > >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
>http://www.R-project.org/posting-guide.html
> >
>
>--
>    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>  (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45)
>35327918
>~~~~~~~~~~ - ([hidden email])                  FAX: (+45)
>35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Ranking within factor subgroups

Adaikalavan Ramasamy
In reply to this post by Peter Dalgaard
Thank you! I did not know about the split and unsplit functions. It
looks like a very powerful and useful combination to master.

Regards, Adai



On Thu, 2006-02-23 at 07:28 +0100, Peter Dalgaard wrote:

> "maneesh deshpande" <[hidden email]> writes:
>
> > Hi Adai,
> >
> > I think your solution only works if the rows of the data frame are ordered
> > by "date" and
> > the ordering function is the same used to order the levels of
> > factor(df$date) ?
> > It turns out (as I implied in my question) my data is indeed organized in
> > this manner, so my
> > current problem is solved.
> > In the general case, I suppose, one could always order the data frame by
> > date before proceeding ?
> >
> > Thanks,
> >
> > Maneesh
>
> You might prefer to look at split/unsplit/split<-, i.e. the z-scores
> by group line:
>
>      z <- unsplit(lapply(split(x, g), scale), g)
>
> with "scale" suitably replaced. Presumably (meaning: I didn't quite
> read your code closely enough)
>
>     z <- unsplit(lapply(split(x, g), bucket, 10), g)
>
> could do it.
>  
> >
> > >From: Adaikalavan Ramasamy <[hidden email]>
> > >Reply-To: [hidden email]
> > >To: maneesh deshpande <[hidden email]>
> > >CC: [hidden email]
> > >Subject: Re: [R]  Ranking within factor subgroups
> > >Date: Wed, 22 Feb 2006 03:44:45 +0000
> > >
> > >It might help to give a simple reproducible example in the future. For
> > >example
> > >
> > >  df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100),
> > >                          B=rpois(500, 50), C=rpois(500, 30) )
> > >
> > >might generate something like
> > >
> > >    date   A  B  C
> > >  1    1  93 51 32
> > >  2    1  95 51 30
> > >  3    1 102 59 28
> > >  4    1 105 52 32
> > >  5    1 105 53 26
> > >  6    1  99 59 37
> > > ...    . ... .. ..
> > > 495    5 100 57 19
> > > 496    5  96 47 44
> > > 497    5 111 56 35
> > > 498    5 105 49 23
> > > 499    5 105 61 30
> > > 500    5  92 53 32
> > >
> > >Here is my proposed solution. Can you double check with your existing
> > >functions to see if they are correct.
> > >
> > >    decile.fn <- function(x, nbreaks=10){
> > >      br     <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T )
> > >      br[1]  <- -Inf
> > >      return( cut(x, br, labels=F) )
> > >    }
> > >
> > >    out <- apply( df[ ,c("A", "B", "C")], 2,
> > >                  function(v) unlist( tapply( v, df$date, decile.fn ) ) )
> > >
> > >    rownames(out) <- rownames(df)
> > >    out <- cbind(df$date, out)
> > >
> > >Regards, Adai
> > >
> > >
> > >
> > >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote:
> > > > Hi,
> > > >
> > > > I have a dataframe, x of the following form:
> > > >
> > > > Date            Symbol   A    B  C
> > > > 20041201     ABC      10  12 15
> > > > 20041201     DEF       9    5   4
> > > > ...
> > > > 20050101     ABC         5  3   1
> > > > 20050101     GHM       12 4    2
> > > > ....
> > > >
> > > > here A, B,C are properties of a set symbols recorded for a given date.
> > > > I wante to decile the symbols For each date and property and
> > > > create another set of columns "bucketA","bucketB", "bucketC" containing
> > >the
> > > > decile rank
> > > > for each symbol. The following non-vectorized code does what I want,
> > > >
> > > > bucket <- function(data,nBuckets) {
> > > >      q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
> > > >      q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra
> > >NAs
> > > >      cut(data,q,include.lowest=T,labels=F)
> > > > }
> > > >
> > > > calcDeciles <- function(x,colNames) {
> > > > nBuckets <- 10
> > > > dates <- unique(x$Date)
> > > > for ( date in dates) {
> > > >   iVec <- x$Date == date
> > > >   xx <- x[iVec,]
> > > >   for (colName in colNames) {
> > > >      data <- xx[,colName]
> > > >      bColName <- paste("bucket",colName,sep="")
> > > >      x[iVec,bColName] <- bucket(data,nBuckets)
> > > >   }
> > > > }
> > > > x
> > > > }
> > > >
> > > > x <- calcDeciles(x,c("A","B","C"))
> > > >
> > > >
> > > > I was wondering if it is possible to vectorize the above function to
> > >make it
> > > > more efficient.
> > > > I tried,
> > > > rlist <- tapply(x$A,x$Date,bucket)
> > > > but I am not sure how to assign the contents of "rlist" to their
> > >appropriate
> > > > slots in the original
> > > > dataframe.
> > > >
> > > > Thanks,
> > > >
> > > > Maneesh
> > > >
> > > > ______________________________________________
> > > > [hidden email] mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > >http://www.R-project.org/posting-guide.html
> > > >
> > >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Loading...