Quantcast

how to collapse categories or re-categorize variables?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

how to collapse categories or re-categorize variables?

CocaCola
I am sure this is a very basic question:

I have 600,000 categorical variables in a data.frame - each of which is
classified as "0", "1", or "2"

What I would like to do is collapse "1" and "2" and leave "0" by itself,
such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
the end I only want "0" and "1" as categories for each of the variables.

Also, if possible I would rather not create 600,000 new variables, if I can
replace the existing variables with the new values that would be great!

What would be the best way to do this?

Thank you!


--
Thanks,
CC

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

Wu Gong
Do you want to replace specific values of a data set?

df <- sample(c(0,1,2),600,replace=T)
table(df)
df[df==2]<-1
table(df)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

djmuseR
In reply to this post by CocaCola
Hi:

See ? levels. Here's a toy example:

> x <- factor(sample(0:2, 10, replace = TRUE))
> x
 [1] 1 2 1 0 2 2 2 2 2 1
Levels: 0 1 2

> levels(x) <- c(0, 1, 1)    # Change level 2 to 1
> x
 [1] 1 1 1 0 1 1 1 1 1 1
Levels: 0 1

HTH,
Dennis


On Fri, Jul 16, 2010 at 10:18 AM, CC <[hidden email]> wrote:

> I am sure this is a very basic question:
>
> I have 600,000 categorical variables in a data.frame - each of which is
> classified as "0", "1", or "2"
>
> What I would like to do is collapse "1" and "2" and leave "0" by itself,
> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
> the end I only want "0" and "1" as categories for each of the variables.
>
> Also, if possible I would rather not create 600,000 new variables, if I can
> replace the existing variables with the new values that would be great!
>
> What would be the best way to do this?
>
> Thank you!
>
>
> --
> Thanks,
> CC
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

Ista Zahn-2
In reply to this post by CocaCola
Hi,
On Fri, Jul 16, 2010 at 5:18 PM, CC <[hidden email]> wrote:
> I am sure this is a very basic question:
>
> I have 600,000 categorical variables in a data.frame - each of which is
> classified as "0", "1", or "2"
>
> What I would like to do is collapse "1" and "2" and leave "0" by itself,
> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
> the end I only want "0" and "1" as categories for each of the variables.

Something like this should work

for (i in names(dat)) {
dat[, i]  <- factor(dat[, i], levels = c("0", "1", "2"), labels =
c("0", "1", "1))
}

-Ista

>
> Also, if possible I would rather not create 600,000 new variables, if I can
> replace the existing variables with the new values that would be great!
>
> What would be the best way to do this?
>
> Thank you!
>
>
> --
> Thanks,
> CC
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

Peter Dalgaard-2
Ista Zahn wrote:

> Hi,
> On Fri, Jul 16, 2010 at 5:18 PM, CC <[hidden email]> wrote:
>> I am sure this is a very basic question:
>>
>> I have 600,000 categorical variables in a data.frame - each of which is
>> classified as "0", "1", or "2"
>>
>> What I would like to do is collapse "1" and "2" and leave "0" by itself,
>> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
>> the end I only want "0" and "1" as categories for each of the variables.
>
> Something like this should work
>
> for (i in names(dat)) {
> dat[, i]  <- factor(dat[, i], levels = c("0", "1", "2"), labels =
> c("0", "1", "1))
> }

Unfortunately, it won't:

> d <- 0:2
> factor(d, levels=c(0,1,1))
[1] 0    1    <NA>
Levels: 0 1 1
Warning message:
In `levels<-`(`*tmp*`, value = c("0", "1", "1")) :
  duplicated levels will not be allowed in factors anymore


This effect, I have been told, goes way back to design choices in S
(that you can have repeated level names) plus compatibility ever since.

It would make more sense if it behaved like

d <- factor(d); levels(d) <- c(0,1,1)

and maybe, some time in the future, it will. Meanwhile, the above is the
workaround.

(BTW, if there are 600000 variables, you probably don't want to iterate
over their names, more likely "for(i in seq_along(dat))...")

--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

Phil Spector
Please look at Peter Dalgaard's response a little more
carefully.  There's a big difference between the levels=
argument (which must be unique) and the labels= argument
(which need not be).  Here are two ways
to do what you want:

> d = 0:2
> factor(d,levels=0:2,labels=c('0','1','1'))
[1] 0 1 1
> library(car)
> recode(d,"c(1,2)='1'")
[1] 0 1 1


  - Phil Spector
  Statistical Computing Facility
  Department of Statistics
  UC Berkeley
  [hidden email]


On Sat, 17 Jul 2010, Peter Dalgaard wrote:

> Ista Zahn wrote:
>> Hi,
>> On Fri, Jul 16, 2010 at 5:18 PM, CC <[hidden email]> wrote:
>>> I am sure this is a very basic question:
>>>
>>> I have 600,000 categorical variables in a data.frame - each of which is
>>> classified as "0", "1", or "2"
>>>
>>> What I would like to do is collapse "1" and "2" and leave "0" by itself,
>>> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
>>> the end I only want "0" and "1" as categories for each of the variables.
>>
>> Something like this should work
>>
>> for (i in names(dat)) {
>> dat[, i]  <- factor(dat[, i], levels = c("0", "1", "2"), labels =
>> c("0", "1", "1))
>> }
>
> Unfortunately, it won't:
>
>> d <- 0:2
>> factor(d, levels=c(0,1,1))
> [1] 0    1    <NA>
> Levels: 0 1 1
> Warning message:
> In `levels<-`(`*tmp*`, value = c("0", "1", "1")) :
>  duplicated levels will not be allowed in factors anymore
>
>
> This effect, I have been told, goes way back to design choices in S
> (that you can have repeated level names) plus compatibility ever since.
>
> It would make more sense if it behaved like
>
> d <- factor(d); levels(d) <- c(0,1,1)
>
> and maybe, some time in the future, it will. Meanwhile, the above is the
> workaround.
>
> (BTW, if there are 600000 variables, you probably don't want to iterate
> over their names, more likely "for(i in seq_along(dat))...")
>
> --
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Phone: (+45)38153501
> Email: [hidden email]  Priv: [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

Ista Zahn-2
In reply to this post by Peter Dalgaard-2
On Sat, Jul 17, 2010 at 9:03 PM, Peter Dalgaard <[hidden email]> wrote:

> Ista Zahn wrote:
>> Hi,
>> On Fri, Jul 16, 2010 at 5:18 PM, CC <[hidden email]> wrote:
>>> I am sure this is a very basic question:
>>>
>>> I have 600,000 categorical variables in a data.frame - each of which is
>>> classified as "0", "1", or "2"
>>>
>>> What I would like to do is collapse "1" and "2" and leave "0" by itself,
>>> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
>>> the end I only want "0" and "1" as categories for each of the variables.
>>
>> Something like this should work
>>
>> for (i in names(dat)) {
>> dat[, i]  <- factor(dat[, i], levels = c("0", "1", "2"), labels =
>> c("0", "1", "1))
>> }
>
> Unfortunately, it won't:
>
>> d <- 0:2
>> factor(d, levels=c(0,1,1))
> [1] 0    1    <NA>
> Levels: 0 1 1
> Warning message:
> In `levels<-`(`*tmp*`, value = c("0", "1", "1")) :
>  duplicated levels will not be allowed in factors anymore
>

I stand corrected. Thank you Peter.

>
> This effect, I have been told, goes way back to design choices in S
> (that you can have repeated level names) plus compatibility ever since.
>
> It would make more sense if it behaved like
>
> d <- factor(d); levels(d) <- c(0,1,1)
>
> and maybe, some time in the future, it will. Meanwhile, the above is the
> workaround.
>
> (BTW, if there are 600000 variables, you probably don't want to iterate
> over their names, more likely "for(i in seq_along(dat))...")
>
> --
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Phone: (+45)38153501
> Email: [hidden email]  Priv: [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

Henric (Nilsson) Winell
In reply to this post by Peter Dalgaard-2
On 2010-07-17 23:03, Peter Dalgaard wrote:

> Ista Zahn wrote:
>> Hi,
>> On Fri, Jul 16, 2010 at 5:18 PM, CC <[hidden email]> wrote:
>>> I am sure this is a very basic question:
>>>
>>> I have 600,000 categorical variables in a data.frame - each of which is
>>> classified as "0", "1", or "2"
>>>
>>> What I would like to do is collapse "1" and "2" and leave "0" by itself,
>>> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
>>> the end I only want "0" and "1" as categories for each of the variables.
>> Something like this should work
>>
>> for (i in names(dat)) {
>> dat[, i]  <- factor(dat[, i], levels = c("0", "1", "2"), labels =
>> c("0", "1", "1))
>> }
>
> Unfortunately, it won't:
>
>> d <- 0:2
>> factor(d, levels=c(0,1,1))
> [1] 0    1    <NA>
> Levels: 0 1 1
> Warning message:
> In `levels<-`(`*tmp*`, value = c("0", "1", "1")) :
>   duplicated levels will not be allowed in factors anymore
>
>
> This effect, I have been told, goes way back to design choices in S
> (that you can have repeated level names) plus compatibility ever since.
>
> It would make more sense if it behaved like
>
> d <- factor(d); levels(d) <- c(0,1,1)
>
> and maybe, some time in the future, it will. Meanwhile, the above is the
> workaround.
>
> (BTW, if there are 600000 variables, you probably don't want to iterate
> over their names, more likely "for(i in seq_along(dat))...")

You could also use 'lapply' with 'levels<-':

 > ### Example data
 > set.seed(1)
 > d <- 0:2
 > DF <- data.frame(X1 = factor(sample(d, size = 10, replace = TRUE)),
+                  X2 = factor(sample(d, size = 10, replace = TRUE)))
 > DF
    X1 X2
1   0  0
2   1  0
3   1  2
4   2  1
5   0  2
6   2  1
7   2  2
8   1  2
9   1  1
10  0  2
 >
 > ### Reorder levels and replace
 > DF[] <- lapply(DF, function(x) "levels<-"(x, c("0", "1", "1")))
 > DF
    X1 X2
1   0  0
2   1  0
3   1  1
4   1  1
5   0  1
6   1  1
7   1  1
8   1  1
9   1  1
10  0  1


HTH,
Henric

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to collapse categories or re-categorize variables?

CocaCola
In reply to this post by Phil Spector
Thank you very much to all of you for the responses!

Phil, the following two examples worked well in re-categorizing.  Is there
any way I can retain my original data.frame format?  Once I re-categorize
the data, the format becomes numeric and the original column names are not
retained.  I probably should have mentioned this earlier - I plan on using
the re-categorized data for coxph using the surv() function.

Here's how I am applying the re-categorization to my data:

*****************************************************************************

library(survival)
library(car)

gg <- read.table("k.csv", header=TRUE, sep = ",")
col = dim(genot)[2]

for(i in 1:col) {
aa<- recode(gg[,i], "c(1,2)='1'")
}

for(i in 1:col) {
dd<-factor(gg[,i],levels=0:2,labels=c('0','1','1'))
}


Thanks,
CC



On Sat, Jul 17, 2010 at 2:15 PM, Phil Spector <[hidden email]>wrote:

> Please look at Peter Dalgaard's response a little more
> carefully.  There's a big difference between the levels=
> argument (which must be unique) and the labels= argument (which need not
> be).  Here are two ways
> to do what you want:
>
>  d = 0:2
>> factor(d,levels=0:2,labels=c('0','1','1'))
>>
> [1] 0 1 1
>
>> library(car)
>> recode(d,"c(1,2)='1'")
>>
> [1] 0 1 1
>
>
>                                        - Phil Spector
>                                         Statistical Computing Facility
>                                         Department of Statistics
>                                         UC Berkeley
>                                         [hidden email]
>
>
>
>
--
Thanks,
CC

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...