Quantcast

Characters and Factor

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Characters and Factor

Damian Betebenner-2

All,

 

Not sure how to characterize this (a new feature or a bug) but the behavior is causing problems in code I’ve written that previously work as I expected. I have integers that are bigger than 2^32 that

I have to encode as factors, after doing some data.table stuff (like below), it recorders the factors as characters and corrupts subsequent merges back to tables where these factors are ordered as

“integers”.

 

Remedies?

 

tmp.dt1 <- data.table(X=as.factor(1:10), Y=rnorm(10), key="X")

tmp.dt2 <- data.table(X=as.factor(101:110), Y=rnorm(10), key="X")

 

rbind(tmp.dt1, tmp.dt2)

 

       V1           Y

[1,]   1  0.47655333

[2,]   2 -0.43962704

[3,]   3 -0.78312270

[4,]   4  1.88935392

[5,]   5 -0.56413463

[6,]   6 -0.69177767

[7,]   7 -0.09942112

[8,]   8  0.21452552

[9,]   9 -0.86136222

[10,]  10  0.55623427

[11,] 101  0.02090036

[12,] 102 -0.41816481

[13,] 103  0.04798975

[14,] 104  0.93709966

[15,] 105 -0.95835181

[16,] 106  0.82207890

[17,] 107  0.85902512

[18,] 108  1.33042023

[19,] 109  0.22596849

[20,] 110  0.99209054

 

data.table(rbind(tmp.dt1, tmp.dt2), key="X")

        X           Y

[1,]   1 -0.16225884

[2,]  10  0.82979617

[3,] 101  0.22412653

[4,] 102 -0.24841475

[5,] 103 -0.09914182

[6,] 104 -1.47982574

[7,] 105 -1.79957210

[8,] 106 -2.01715940

[9,] 107 -0.81900855

[10,] 108  0.26357249

[11,] 109 -1.22742679

[12,] 110  0.64773494

[13,]   2 -0.98312948

[14,]   3  0.99937771

[15,]   4 -1.72355977

[16,]   5 -2.02481542

[17,]   6 -0.07222688

[18,]   7  0.17921321

[19,]   8 -0.92102526

[20,]   9 -0.14129584

 

 

 

data.table(rbind(tmp.dt1, tmp.dt2), key=”X”)

 

 

 

 

Damian Betebenner

Center for Assessment

PO Box 351

Dover, NH   03821-0351

 

Phone (office): (603) 516-7900

Phone (cell): (857) 234-2474

Fax: (603) 516-7910

 

[hidden email]

www.nciea.org

 

 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Characters and Factor

Matthew Dowle

Damian Betebenner <dbetebenner <at> nciea.org> writes:
> All,
>  
> Not sure how to characterize this (a new feature or a bug) but the behavior
is causing problems in code I’ve written that previously work as I expected. I
have integers that are bigger than 2^32 that
> I have to encode as factors, after doing some data.table stuff (like below),
it recorders the factors as characters and corrupts subsequent merges back to
tables where these factors are ordered as

> “integers”.
>  
> Remedies?
>  
> tmp.dt1 <- data.table(X=as.factor(1:10), Y=rnorm(10), key="X")
> tmp.dt2 <- data.table(X=as.factor(101:110), Y=rnorm(10), key="X")
>  
> rbind(tmp.dt1, tmp.dt2)
>  
>        V1           Y
>  [1,]   1  0.47655333
>  [2,]   2 -0.43962704
>  [3,]   3 -0.78312270
>  [4,]   4  1.88935392
>  [5,]   5 -0.56413463
>  [6,]   6 -0.69177767
>  [7,]   7 -0.09942112
>  [8,]   8  0.21452552
>  [9,]   9 -0.86136222
> [10,]  10  0.55623427
> [11,] 101  0.02090036
> [12,] 102 -0.41816481
> [13,] 103  0.04798975
> [14,] 104  0.93709966
> [15,] 105 -0.95835181
> [16,] 106  0.82207890
> [17,] 107  0.85902512
> [18,] 108  1.33042023
> [19,] 109  0.22596849
> [20,] 110  0.99209054
>  
> data.table(rbind(tmp.dt1, tmp.dt2), key="X")
>         X           Y
>  [1,]   1 -0.16225884
>  [2,]  10  0.82979617
>  [3,] 101  0.22412653
>  [4,] 102 -0.24841475
>  [5,] 103 -0.09914182
>  [6,] 104 -1.47982574
>  [7,] 105 -1.79957210
>  [8,] 106 -2.01715940
>  [9,] 107 -0.81900855
> [10,] 108  0.26357249
> [11,] 109 -1.22742679
> [12,] 110  0.64773494
> [13,]   2 -0.98312948
> [14,]   3  0.99937771
> [15,]   4 -1.72355977
> [16,]   5 -2.02481542
> [17,]   6 -0.07222688
> [18,]   7  0.17921321
> [19,]   8 -0.92102526
> [20,]   9 -0.14129584  
>  
Hi,

I've had a quick look but can't quite grasp it. rbind.data.table calls
data.table::c.factor() to concatenate factor columns, and that reorders the
levels on the new combined factor. I guess it shouldn't now that unordered
factor levels are allowed and supported in data.table. But if that's it, it
wouldn't have worked before either and it's always been a problem. Also I don't
see how a corruption could occur since joins between two factor columns with
different levels (each possibly unordered) should work fine. Could you provide
some more details before and after showing the change exactly?

I was going to say 'just use character', but had never considered ordered
integers greater than 2^32 as the use case, so character type wouldn't work for
them. It's a new one on me, so either way some new tests are needed.

Finally, there is this fix in v1.8.1 that might be involved somehow :

o Joining a factor column with unsorted and unused levels to a character
  column now matches properly, fixing #1922. Thanks to Christoph Jäckel for
  the reproducible example. Test added.

Matthew


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Loading...