Quantcast

Coercian to character

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Coercian to character

Damian Betebenner-2

Data tablers

 

Does data.table now coerce factors to character variables when doing by summaries?

 

If so, is there any way to not allow this coercion?

 

Thanks,

 

Damian Betebenner

Center for Assessment

PO Box 351

Dover, NH   03821-0351

 

Phone (office): (603) 516-7900

Phone (cell): (857) 234-2474

Fax: (603) 516-7910

 

[hidden email]

www.nciea.org

 

 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Coercian to character

Matthew Dowle
It shouldn't coerce. What makes you think it does?

> DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
> DT[,sum(b),by=a]
     a V1
[1,] a  1
[2,] b  5
[3,] c  4
> str(DT[,sum(b),by=a])
Classes ‘data.table’ and 'data.frame': 3 obs. of  2 variables:
 $ a : Factor w/ 3 levels "a","b","c": 1 2 3
 $ V1: int  1 5 4



On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:

> Data tablers
>
>  
>
> Does data.table now coerce factors to character variables when doing
> by summaries?
>
>  
>
> If so, is there any way to not allow this coercion?
>
>  
>
> Thanks,
>
>  
>
> Damian Betebenner
>
> Center for Assessment
>
> PO Box 351
>
> Dover, NH   03821-0351
>
>  
>
> Phone (office): (603) 516-7900
>
> Phone (cell): (857) 234-2474
>
> Fax: (603) 516-7910
>
>  
>
> [hidden email]
>
> www.nciea.org
>
>  
>
>  
>
>  
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Coercian to character

Damian Betebenner-2
I started having character vectors popping up in places I never had before but upon further investigation that turned out to be an issue with my own setup, not data.table.

With regard to characters (and data.tables ability to handle them as a key now), I did notice that data.table and data.frame default to using
stringsAsFactors differently:

DF <- data.frame(X=letters[1:10], Y=rnorm(10))
sapply(DF, class)

        X         Y
 "factor" "numeric"

DT <- data.table(X=letters[1:10], Y=rnorm(10))
sapply(DT, class)

> DT <- data.table(X=rep(letters[1:10], each=2), Y=rnorm(20))
> sapply(DT, class)
          X           Y
"character"   "numeric"


Will this inconsistency cause problems down the road?

Thanks for all your help,

Damian


Damian Betebenner
Center for Assessment
PO Box 351
Dover, NH   03821-0351
 
Phone (office): (603) 516-7900
Phone (cell): (857) 234-2474
Fax: (603) 516-7910

[hidden email]
www.nciea.org




-----Original Message-----
From: Matthew Dowle [mailto:[hidden email]] On Behalf Of Matthew Dowle
Sent: Thursday, April 12, 2012 5:50 PM
To: Damian Betebenner
Cc: [hidden email]
Subject: Re: [datatable-help] Coercian to character

It shouldn't coerce. What makes you think it does?

> DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
> DT[,sum(b),by=a]
     a V1
[1,] a  1
[2,] b  5
[3,] c  4
> str(DT[,sum(b),by=a])
Classes ‘data.table’ and 'data.frame': 3 obs. of  2 variables:
 $ a : Factor w/ 3 levels "a","b","c": 1 2 3  $ V1: int  1 5 4



On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:

> Data tablers
>
>  
>
> Does data.table now coerce factors to character variables when doing
> by summaries?
>
>  
>
> If so, is there any way to not allow this coercion?
>
>  
>
> Thanks,
>
>  
>
> Damian Betebenner
>
> Center for Assessment
>
> PO Box 351
>
> Dover, NH   03821-0351
>
>  
>
> Phone (office): (603) 516-7900
>
> Phone (cell): (857) 234-2474
>
> Fax: (603) 516-7910
>
>  
>
> [hidden email]
>
> www.nciea.org
>
>  
>
>  
>
>  
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> -help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Coercian to character

Matthew Dowle
I thought I'd added something to FAQ 2.17 about that, but seems not.
Will add, thanks. Maybe I only wrote it up in the comment when closing
the related feature request. It's deliberately different since my guess
is that most people most of the time (now) want characters left as
characters and keep setting stringsAsFactors to FALSE. Think the default
for data.frame was TRUE as a hang over from old versions of R before the
global string cache was added.

It's not set in stone though so could be changed. In particular there
could be global default like we've done for other arguments so you could
change the default if need be.

It won't cause a compatibility issue (same as other differences in faq
2.17) or any issues down the road as far I can think, but let me know if
you think of anything.

Matthew

On Sun, 2012-04-15 at 04:40 -0500, Damian Betebenner wrote:

> I started having character vectors popping up in places I never had before but upon further investigation that turned out to be an issue with my own setup, not data.table.
>
> With regard to characters (and data.tables ability to handle them as a key now), I did notice that data.table and data.frame default to using
> stringsAsFactors differently:
>
> DF <- data.frame(X=letters[1:10], Y=rnorm(10))
> sapply(DF, class)
>
>         X         Y
>  "factor" "numeric"
>
> DT <- data.table(X=letters[1:10], Y=rnorm(10))
> sapply(DT, class)
>
> > DT <- data.table(X=rep(letters[1:10], each=2), Y=rnorm(20))
> > sapply(DT, class)
>           X           Y
> "character"   "numeric"
>
>
> Will this inconsistency cause problems down the road?
>
> Thanks for all your help,
>
> Damian
>
>
> Damian Betebenner
> Center for Assessment
> PO Box 351
> Dover, NH   03821-0351
>  
> Phone (office): (603) 516-7900
> Phone (cell): (857) 234-2474
> Fax: (603) 516-7910
>
> [hidden email]
> www.nciea.org
>
>
>
>
> -----Original Message-----
> From: Matthew Dowle [mailto:[hidden email]] On Behalf Of Matthew Dowle
> Sent: Thursday, April 12, 2012 5:50 PM
> To: Damian Betebenner
> Cc: [hidden email]
> Subject: Re: [datatable-help] Coercian to character
>
> It shouldn't coerce. What makes you think it does?
>
> > DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
> > DT[,sum(b),by=a]
>      a V1
> [1,] a  1
> [2,] b  5
> [3,] c  4
> > str(DT[,sum(b),by=a])
> Classes ‘data.table’ and 'data.frame': 3 obs. of  2 variables:
>  $ a : Factor w/ 3 levels "a","b","c": 1 2 3  $ V1: int  1 5 4
>
>
>
> On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:
> > Data tablers
> >
> >  
> >
> > Does data.table now coerce factors to character variables when doing
> > by summaries?
> >
> >  
> >
> > If so, is there any way to not allow this coercion?
> >
> >  
> >
> > Thanks,
> >
> >  
> >
> > Damian Betebenner
> >
> > Center for Assessment
> >
> > PO Box 351
> >
> > Dover, NH   03821-0351
> >
> >  
> >
> > Phone (office): (603) 516-7900
> >
> > Phone (cell): (857) 234-2474
> >
> > Fax: (603) 516-7910
> >
> >  
> >
> > [hidden email]
> >
> > www.nciea.org
> >
> >  
> >
> >  
> >
> >  
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > [hidden email]
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> > -help
>
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Coercian to character

Joseph Voelkel
Statistical work in R (what many us of do, I'd say) prefers factors over characters:

# factors
DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10))
(DF.lm<-lm(Y~X,DF))
predict(DF.lm)
# all works fine

# characters
DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10), stringsAsFactors=FALSE)
(DF.lm<-lm(Y~X,DF)) # warning
predict(DF.lm) # warning

Not sure if this one will get resolved.

Using factors instead of characters also ensures that a table of months or days of the week can be listed in the natural (not alphabetic) ordering.

Joe

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Matthew Dowle
Sent: Sunday, April 15, 2012 1:20 PM
To: Damian Betebenner
Cc: [hidden email]
Subject: Re: [datatable-help] Coercian to character

I thought I'd added something to FAQ 2.17 about that, but seems not.
Will add, thanks. Maybe I only wrote it up in the comment when closing the related feature request. It's deliberately different since my guess is that most people most of the time (now) want characters left as characters and keep setting stringsAsFactors to FALSE. Think the default for data.frame was TRUE as a hang over from old versions of R before the global string cache was added.

It's not set in stone though so could be changed. In particular there could be global default like we've done for other arguments so you could change the default if need be.

It won't cause a compatibility issue (same as other differences in faq
2.17) or any issues down the road as far I can think, but let me know if you think of anything.

Matthew

On Sun, 2012-04-15 at 04:40 -0500, Damian Betebenner wrote:

> I started having character vectors popping up in places I never had before but upon further investigation that turned out to be an issue with my own setup, not data.table.
>
> With regard to characters (and data.tables ability to handle them as a
> key now), I did notice that data.table and data.frame default to using stringsAsFactors differently:
>
> DF <- data.frame(X=letters[1:10], Y=rnorm(10)) sapply(DF, class)
>
>         X         Y
>  "factor" "numeric"
>
> DT <- data.table(X=letters[1:10], Y=rnorm(10)) sapply(DT, class)
>
> > DT <- data.table(X=rep(letters[1:10], each=2), Y=rnorm(20))
> > sapply(DT, class)
>           X           Y
> "character"   "numeric"
>
>
> Will this inconsistency cause problems down the road?
>
> Thanks for all your help,
>
> Damian
>
>
> Damian Betebenner
> Center for Assessment
> PO Box 351
> Dover, NH   03821-0351
>  
> Phone (office): (603) 516-7900
> Phone (cell): (857) 234-2474
> Fax: (603) 516-7910
>
> [hidden email]
> www.nciea.org
>
>
>
>
> -----Original Message-----
> From: Matthew Dowle [mailto:[hidden email]] On Behalf
> Of Matthew Dowle
> Sent: Thursday, April 12, 2012 5:50 PM
> To: Damian Betebenner
> Cc: [hidden email]
> Subject: Re: [datatable-help] Coercian to character
>
> It shouldn't coerce. What makes you think it does?
>
> > DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
> > DT[,sum(b),by=a]
>      a V1
> [1,] a  1
> [2,] b  5
> [3,] c  4
> > str(DT[,sum(b),by=a])
> Classes ‘data.table’ and 'data.frame': 3 obs. of  2 variables:
>  $ a : Factor w/ 3 levels "a","b","c": 1 2 3  $ V1: int  1 5 4
>
>
>
> On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:
> > Data tablers
> >
> >  
> >
> > Does data.table now coerce factors to character variables when doing
> > by summaries?
> >
> >  
> >
> > If so, is there any way to not allow this coercion?
> >
> >  
> >
> > Thanks,
> >
> >  
> >
> > Damian Betebenner
> >
> > Center for Assessment
> >
> > PO Box 351
> >
> > Dover, NH   03821-0351
> >
> >  
> >
> > Phone (office): (603) 516-7900
> >
> > Phone (cell): (857) 234-2474
> >
> > Fax: (603) 516-7910
> >
> >  
> >
> > [hidden email]
> >
> > www.nciea.org
> >
> >  
> >
> >  
> >
> >  
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > [hidden email]
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatab
> > le
> > -help
>
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Coercian to character

caneff
At the same time there are a lot of uses in R where the vector really
is just an id, not something you would regress on.  I often work on
data frames with 10 million rows  and 1 million ids, having that as a
factor is just downright wasteful and slow.

I think the global option is the best compromise here.

On Tue, Apr 17, 2012 at 11:37 AM, Joseph Voelkel <[hidden email]> wrote:

> Statistical work in R (what many us of do, I'd say) prefers factors over characters:
>
> # factors
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10))
> (DF.lm<-lm(Y~X,DF))
> predict(DF.lm)
> # all works fine
>
> # characters
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10), stringsAsFactors=FALSE)
> (DF.lm<-lm(Y~X,DF)) # warning
> predict(DF.lm) # warning
>
> Not sure if this one will get resolved.
>
> Using factors instead of characters also ensures that a table of months or days of the week can be listed in the natural (not alphabetic) ordering.
>
> Joe
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Matthew Dowle
> Sent: Sunday, April 15, 2012 1:20 PM
> To: Damian Betebenner
> Cc: [hidden email]
> Subject: Re: [datatable-help] Coercian to character
>
> I thought I'd added something to FAQ 2.17 about that, but seems not.
> Will add, thanks. Maybe I only wrote it up in the comment when closing the related feature request. It's deliberately different since my guess is that most people most of the time (now) want characters left as characters and keep setting stringsAsFactors to FALSE. Think the default for data.frame was TRUE as a hang over from old versions of R before the global string cache was added.
>
> It's not set in stone though so could be changed. In particular there could be global default like we've done for other arguments so you could change the default if need be.
>
> It won't cause a compatibility issue (same as other differences in faq
> 2.17) or any issues down the road as far I can think, but let me know if you think of anything.
>
> Matthew
>
> On Sun, 2012-04-15 at 04:40 -0500, Damian Betebenner wrote:
>> I started having character vectors popping up in places I never had before but upon further investigation that turned out to be an issue with my own setup, not data.table.
>>
>> With regard to characters (and data.tables ability to handle them as a
>> key now), I did notice that data.table and data.frame default to using stringsAsFactors differently:
>>
>> DF <- data.frame(X=letters[1:10], Y=rnorm(10)) sapply(DF, class)
>>
>>         X         Y
>>  "factor" "numeric"
>>
>> DT <- data.table(X=letters[1:10], Y=rnorm(10)) sapply(DT, class)
>>
>> > DT <- data.table(X=rep(letters[1:10], each=2), Y=rnorm(20))
>> > sapply(DT, class)
>>           X           Y
>> "character"   "numeric"
>>
>>
>> Will this inconsistency cause problems down the road?
>>
>> Thanks for all your help,
>>
>> Damian
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH   03821-0351
>>
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> [hidden email]
>> www.nciea.org
>>
>>
>>
>>
>> -----Original Message-----
>> From: Matthew Dowle [mailto:[hidden email]] On Behalf
>> Of Matthew Dowle
>> Sent: Thursday, April 12, 2012 5:50 PM
>> To: Damian Betebenner
>> Cc: [hidden email]
>> Subject: Re: [datatable-help] Coercian to character
>>
>> It shouldn't coerce. What makes you think it does?
>>
>> > DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
>> > DT[,sum(b),by=a]
>>      a V1
>> [1,] a  1
>> [2,] b  5
>> [3,] c  4
>> > str(DT[,sum(b),by=a])
>> Classes ‘data.table’ and 'data.frame':        3 obs. of  2 variables:
>>  $ a : Factor w/ 3 levels "a","b","c": 1 2 3  $ V1: int  1 5 4
>>
>>
>>
>> On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:
>> > Data tablers
>> >
>> >
>> >
>> > Does data.table now coerce factors to character variables when doing
>> > by summaries?
>> >
>> >
>> >
>> > If so, is there any way to not allow this coercion?
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> >
>> > Damian Betebenner
>> >
>> > Center for Assessment
>> >
>> > PO Box 351
>> >
>> > Dover, NH   03821-0351
>> >
>> >
>> >
>> > Phone (office): (603) 516-7900
>> >
>> > Phone (cell): (857) 234-2474
>> >
>> > Fax: (603) 516-7910
>> >
>> >
>> >
>> > [hidden email]
>> >
>> > www.nciea.org
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > [hidden email]
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatab
>> > le
>> > -help
>>
>>
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Coercian to character

Matthew Dowle
In reply to this post by Joseph Voelkel

Yes but I invisage a work flow where data is loaded up as character
always, then set any columns you'd like as factor to factor afterwards,
using DT[,col:=factor(col)]. Other columns might load up as character that
are actually dates or times (for example), then have a
DT[,col:=as.IDate(col)] applied to them instead for example. The factor()
operation is quite compute intensive (finding unique values, then matching
all data to those unique values) so making the default TRUE to convert to
factor() [which may need to be converted afterwards to something other
than factor, such as date] doesn't seem the right default from the
efficiency perspective.
Another column type example is comma separated lists such as in genomics.
Those really do need to be character and then strsplit() afterwards,
rather than a factor, unfactored and then strsplit. That's what the
dual-delimited fast file loader will handle in data.table, when
implemented.

> Not sure if this one will get resolved.

We could add the 'stringsAsFactors' argument to data.table(), then you
could change it from FALSE to TRUE. Would that work?  The default would be
getOption("datatable.stringsAsFactors") so you could change it globally,
just like base::default.stringsAsFactors().

> Using factors instead of characters also ensures that a table of months or
> days of the week can be listed in the natural (not alphabetic) ordering.

But factor() re-orders levels, and so does stringsAsFactors=TRUE afaik
doesn't it? You have to work a bit harder to get unordered factor levels,
by passing labels to factor "manually".  That's very useful, important and
supported in data.table (mentioned in last NEWS) but not the best default
for data.table(), iiuc.

Matthew

> Statistical work in R (what many us of do, I'd say) prefers factors over
> characters:
>
> # factors
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10))
> (DF.lm<-lm(Y~X,DF))
> predict(DF.lm)
> # all works fine
>
> # characters
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10),
> stringsAsFactors=FALSE)
> (DF.lm<-lm(Y~X,DF)) # warning
> predict(DF.lm) # warning
>
> Not sure if this one will get resolved.
>
> Using factors instead of characters also ensures that a table of months or
> days of the week can be listed in the natural (not alphabetic) ordering.
>
> Joe
>
> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> Matthew Dowle
> Sent: Sunday, April 15, 2012 1:20 PM
> To: Damian Betebenner
> Cc: [hidden email]
> Subject: Re: [datatable-help] Coercian to character
>
> I thought I'd added something to FAQ 2.17 about that, but seems not.
> Will add, thanks. Maybe I only wrote it up in the comment when closing the
> related feature request. It's deliberately different since my guess is
> that most people most of the time (now) want characters left as characters
> and keep setting stringsAsFactors to FALSE. Think the default for
> data.frame was TRUE as a hang over from old versions of R before the
> global string cache was added.
>
> It's not set in stone though so could be changed. In particular there
> could be global default like we've done for other arguments so you could
> change the default if need be.
>
> It won't cause a compatibility issue (same as other differences in faq
> 2.17) or any issues down the road as far I can think, but let me know if
> you think of anything.
>
> Matthew
>
> On Sun, 2012-04-15 at 04:40 -0500, Damian Betebenner wrote:
>> I started having character vectors popping up in places I never had
>> before but upon further investigation that turned out to be an issue
>> with my own setup, not data.table.
>>
>> With regard to characters (and data.tables ability to handle them as a
>> key now), I did notice that data.table and data.frame default to using
>> stringsAsFactors differently:
>>
>> DF <- data.frame(X=letters[1:10], Y=rnorm(10)) sapply(DF, class)
>>
>>         X         Y
>>  "factor" "numeric"
>>
>> DT <- data.table(X=letters[1:10], Y=rnorm(10)) sapply(DT, class)
>>
>> > DT <- data.table(X=rep(letters[1:10], each=2), Y=rnorm(20))
>> > sapply(DT, class)
>>           X           Y
>> "character"   "numeric"
>>
>>
>> Will this inconsistency cause problems down the road?
>>
>> Thanks for all your help,
>>
>> Damian
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH   03821-0351
>>
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> [hidden email]
>> www.nciea.org
>>
>>
>>
>>
>> -----Original Message-----
>> From: Matthew Dowle [mailto:[hidden email]] On Behalf
>> Of Matthew Dowle
>> Sent: Thursday, April 12, 2012 5:50 PM
>> To: Damian Betebenner
>> Cc: [hidden email]
>> Subject: Re: [datatable-help] Coercian to character
>>
>> It shouldn't coerce. What makes you think it does?
>>
>> > DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
>> > DT[,sum(b),by=a]
>>      a V1
>> [1,] a  1
>> [2,] b  5
>> [3,] c  4
>> > str(DT[,sum(b),by=a])
>> Classes ‘data.table’ and 'data.frame': 3 obs. of  2 variables:
>>  $ a : Factor w/ 3 levels "a","b","c": 1 2 3  $ V1: int  1 5 4
>>
>>
>>
>> On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:
>> > Data tablers
>> >
>> >
>> >
>> > Does data.table now coerce factors to character variables when doing
>> > by summaries?
>> >
>> >
>> >
>> > If so, is there any way to not allow this coercion?
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> >
>> > Damian Betebenner
>> >
>> > Center for Assessment
>> >
>> > PO Box 351
>> >
>> > Dover, NH   03821-0351
>> >
>> >
>> >
>> > Phone (office): (603) 516-7900
>> >
>> > Phone (cell): (857) 234-2474
>> >
>> > Fax: (603) 516-7910
>> >
>> >
>> >
>> > [hidden email]
>> >
>> > www.nciea.org
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > [hidden email]
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatab
>> > le
>> > -help
>>
>>
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Coercian to character

Joseph Voelkel
Personally, I am fine with having data.table read in character data as character. I don't find 'stringsAsFactors' particularly useful (although I use it sometimes)--I'll usually have a mix of character data, some of which I'd like as factors and some of which I'd like as character (or, to convert to dates, as your example notes). And some I'm just confused about--I don't like working with factors in general so I prefer to keep character data as character, but then all of a sudden I decide I want to run some character data in some models ...

I like the DT[,col:=factor(col)] approach. It also makes the intent clearer.

"But factor() re-orders levels ..." Yes. An extra step is required. But well worth it to have January as the first month of the year!

Joe

-----Original Message-----
From: Matthew Dowle [mailto:[hidden email]]
Sent: Tuesday, April 17, 2012 12:26 PM
To: Joseph Voelkel
Cc: Damian Betebenner; [hidden email]
Subject: RE: [datatable-help] Coercian to character


Yes but I invisage a work flow where data is loaded up as character always, then set any columns you'd like as factor to factor afterwards, using DT[,col:=factor(col)]. Other columns might load up as character that are actually dates or times (for example), then have a DT[,col:=as.IDate(col)] applied to them instead for example. The factor() operation is quite compute intensive (finding unique values, then matching all data to those unique values) so making the default TRUE to convert to
factor() [which may need to be converted afterwards to something other than factor, such as date] doesn't seem the right default from the efficiency perspective.
Another column type example is comma separated lists such as in genomics.
Those really do need to be character and then strsplit() afterwards, rather than a factor, unfactored and then strsplit. That's what the dual-delimited fast file loader will handle in data.table, when implemented.

> Not sure if this one will get resolved.

We could add the 'stringsAsFactors' argument to data.table(), then you could change it from FALSE to TRUE. Would that work?  The default would be
getOption("datatable.stringsAsFactors") so you could change it globally, just like base::default.stringsAsFactors().

> Using factors instead of characters also ensures that a table of
> months or days of the week can be listed in the natural (not alphabetic) ordering.

But factor() re-orders levels, and so does stringsAsFactors=TRUE afaik doesn't it? You have to work a bit harder to get unordered factor levels, by passing labels to factor "manually".  That's very useful, important and supported in data.table (mentioned in last NEWS) but not the best default for data.table(), iiuc.

Matthew

> Statistical work in R (what many us of do, I'd say) prefers factors
> over
> characters:
>
> # factors
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10))
> (DF.lm<-lm(Y~X,DF))
> predict(DF.lm)
> # all works fine
>
> # characters
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10),
> stringsAsFactors=FALSE)
> (DF.lm<-lm(Y~X,DF)) # warning
> predict(DF.lm) # warning
>
> Not sure if this one will get resolved.
>
> Using factors instead of characters also ensures that a table of
> months or days of the week can be listed in the natural (not alphabetic) ordering.
>
> Joe
>
> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf
> Of Matthew Dowle
> Sent: Sunday, April 15, 2012 1:20 PM
> To: Damian Betebenner
> Cc: [hidden email]
> Subject: Re: [datatable-help] Coercian to character
>
> I thought I'd added something to FAQ 2.17 about that, but seems not.
> Will add, thanks. Maybe I only wrote it up in the comment when closing
> the related feature request. It's deliberately different since my
> guess is that most people most of the time (now) want characters left
> as characters and keep setting stringsAsFactors to FALSE. Think the
> default for data.frame was TRUE as a hang over from old versions of R
> before the global string cache was added.
>
> It's not set in stone though so could be changed. In particular there
> could be global default like we've done for other arguments so you
> could change the default if need be.
>
> It won't cause a compatibility issue (same as other differences in faq
> 2.17) or any issues down the road as far I can think, but let me know
> if you think of anything.
>
> Matthew
>
> On Sun, 2012-04-15 at 04:40 -0500, Damian Betebenner wrote:
>> I started having character vectors popping up in places I never had
>> before but upon further investigation that turned out to be an issue
>> with my own setup, not data.table.
>>
>> With regard to characters (and data.tables ability to handle them as
>> a key now), I did notice that data.table and data.frame default to
>> using stringsAsFactors differently:
>>
>> DF <- data.frame(X=letters[1:10], Y=rnorm(10)) sapply(DF, class)
>>
>>         X         Y
>>  "factor" "numeric"
>>
>> DT <- data.table(X=letters[1:10], Y=rnorm(10)) sapply(DT, class)
>>
>> > DT <- data.table(X=rep(letters[1:10], each=2), Y=rnorm(20))
>> > sapply(DT, class)
>>           X           Y
>> "character"   "numeric"
>>
>>
>> Will this inconsistency cause problems down the road?
>>
>> Thanks for all your help,
>>
>> Damian
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH   03821-0351
>>
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> [hidden email]
>> www.nciea.org
>>
>>
>>
>>
>> -----Original Message-----
>> From: Matthew Dowle [mailto:[hidden email]] On Behalf
>> Of Matthew Dowle
>> Sent: Thursday, April 12, 2012 5:50 PM
>> To: Damian Betebenner
>> Cc: [hidden email]
>> Subject: Re: [datatable-help] Coercian to character
>>
>> It shouldn't coerce. What makes you think it does?
>>
>> > DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
>> > DT[,sum(b),by=a]
>>      a V1
>> [1,] a  1
>> [2,] b  5
>> [3,] c  4
>> > str(DT[,sum(b),by=a])
>> Classes ‘data.table’ and 'data.frame': 3 obs. of  2 variables:
>>  $ a : Factor w/ 3 levels "a","b","c": 1 2 3  $ V1: int  1 5 4
>>
>>
>>
>> On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:
>> > Data tablers
>> >
>> >
>> >
>> > Does data.table now coerce factors to character variables when
>> > doing by summaries?
>> >
>> >
>> >
>> > If so, is there any way to not allow this coercion?
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> >
>> > Damian Betebenner
>> >
>> > Center for Assessment
>> >
>> > PO Box 351
>> >
>> > Dover, NH   03821-0351
>> >
>> >
>> >
>> > Phone (office): (603) 516-7900
>> >
>> > Phone (cell): (857) 234-2474
>> >
>> > Fax: (603) 516-7910
>> >
>> >
>> >
>> > [hidden email]
>> >
>> > www.nciea.org
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > [hidden email]
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datata
>> > b
>> > le
>> > -help
>>
>>
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> -help
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Loading...