|
Hi together,
here is the problem I needed dput for: http://www.fileuploadx.de/287440 (sorry, I know that this filehoster is annoying because you have to wait until you can download the file; I hope you have a coffee machine close by ;-)
In this attachment, I basically load in two data.tables DT1 and DT2 that I want to join, i.e. DT2[DT1], according to the keyed columns "Company_Code" and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. However, while "DSCD" is formatted as a character-column, "Company_Code" is formatted as a factor-column. As you can see from the long structure-object, there are plenty of levels here (the actual data.tables are very small).
Now, when I try to join those with DT2[DT1], I get:
DSCD intDatum MONTH MV SICClass
[1,] 997859 151 <NA> NA 44
[2,] 997859 152 <NA> NA 44
[3,] 998064 151 <NA> NA 15
[4,] 998064 152 <NA> NA 15
[5,] 142268 151 <NA> NA 53
[6,] 142268 152 <NA> NA 53
[7,] 142859 151 <NA> NA 56
[8,] 142859 152 <NA> NA 56
[9,] 143415 151 <NA> NA 63
[10,] 143415 152 <NA> NA 63
[11,] 307045 151 <NA> NA 15
[12,] 307045 152 <NA> NA 15Basically, data.table finds no values for MV and MONTH for any DSCD and intDatum combination. However, as DT2[DSCD=="142268"] clearly shows, there are values for that DSCD:
_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Hi. A quick read suggests it's not intended and that's a bug. Just convert
the columns to character for now, and it should work. Character columns are now preferred going forward, so I'd be suggesting that anyway even if it worked. > So in case this is expected behavior, should data.table > give at least a warning that says something like "You join two > data.tables whereby one keyed column is a factor, one is a character. > That is probably not your intention. Convert the factor column to > character or vice versa."? Yes. It should be converting to character (with a warning) in this case. Thought that's what I coded and tested. Will investigate... > Hi together, > > here is the problem I needed dput for: http://www.fileuploadx.de/287440 > (sorry, > I know that this filehoster is annoying because you have to wait until you > can download the file; I hope you have a coffee machine close by ;-) > > In this attachment, I basically load in two data.tables DT1 and DT2 that I > want to join, i.e. DT2[DT1], according to the keyed columns "Company_Code" > and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. However, while > "DSCD" is formatted as a character-column, "Company_Code" is formatted as > a > factor-column. As you can see from the long structure-object, there are > plenty of levels here (the actual data.tables are very small). > > Now, when I try to join those with DT2[DT1], I get: > > > DSCD intDatum MONTH MV SICClass > [1,] 997859 151 <NA> NA 44 > [2,] 997859 152 <NA> NA 44 > [3,] 998064 151 <NA> NA 15 > [4,] 998064 152 <NA> NA 15 > [5,] 142268 151 <NA> NA 53 > [6,] 142268 152 <NA> NA 53 > [7,] 142859 151 <NA> NA 56 > [8,] 142859 152 <NA> NA 56 > [9,] 143415 151 <NA> NA 63 > [10,] 143415 152 <NA> NA 63 > [11,] 307045 151 <NA> NA 15 > [12,] 307045 152 <NA> NA 15 > > > > Basically, data.table finds no values for MV and MONTH for any DSCD > and intDatum combination. However, as DT2[DSCD=="142268"] clearly > shows, there are values for that DSCD: > > > > DSCD MONTH MV intDatum > [1,] 142268 1997-08-28 1901.12 151 > [2,] 142268 1997-09-28 1829.00 152 > > > > Those, however, only show up in the join after i get rid of all the > unused levels (equivalently, I can also transform the Company_Code to > a character column): > > > DT1[, Company_Code := factor(Company_Code)] > DT2[DT1] > > > DSCD intDatum MONTH MV SICClass > [1,] 997859 151 <NA> NA 44 > [2,] 997859 152 <NA> NA 44 > [3,] 998064 151 <NA> NA 15 > [4,] 998064 152 <NA> NA 15 > [5,] 142268 151 1997-08-28 1901.12 53 > [6,] 142268 152 1997-09-28 1829.00 53 > [7,] 142859 151 <NA> NA 56 > [8,] 142859 152 <NA> NA 56 > [9,] 143415 151 <NA> NA 63 > [10,] 143415 152 <NA> NA 63 > [11,] 307045 151 <NA> NA 15 > [12,] 307045 152 <NA> NA 15 > > > > I'm pretty sure this behaviour occurred only with version 1.8.0, > probably because data.table coerced every key to factor before (see > the NEWS to 1.8.0). So my question is: Is what happens here intended > behavior? I'm honest with you: I'm working now for a while with R and > factors are one of those things that I never got. I just don't see > their use and every so often they cause me huge problems (as in this > case). So I'm probably making something stupid here. The nasty thing > about this issue here is that mostly, however, the joins just work as > expected (believe me, I tried to produce a simple example with one > column factor and one character that would reproduce this behavior, > but no matter what I did, the joins afterwards always worked as > expected). So in case this is expected behavior, should data.table > give at least a warning that says something like "You join two > data.tables whereby one keyed column is a factor, one is a character. > That is probably not your intention. Convert the factor column to > character or vice versa."? > > > > Thanks for your help! > > > > Christoph > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Thanks for example and data, very clear. Yes, problem looks to be factors with unused levels, when joined to a character column, as you suggested. Work arounds are to drop the unused levels or convert to character, as you found. A fix is a bit more involved and won't have time for a while. Please could you file a bug report so it doesn't get forgotten. Thanks. On Fri, 2012-03-30 at 16:51 +0100, Matthew Dowle wrote: > Hi. A quick read suggests it's not intended and that's a bug. Just convert > the columns to character for now, and it should work. Character columns > are now preferred going forward, so I'd be suggesting that anyway even if > it worked. > > > So in case this is expected behavior, should data.table > > give at least a warning that says something like "You join two > > data.tables whereby one keyed column is a factor, one is a character. > > That is probably not your intention. Convert the factor column to > > character or vice versa."? > > Yes. It should be converting to character (with a warning) in this case. > Thought that's what I coded and tested. Will investigate... > > > Hi together, > > > > here is the problem I needed dput for: http://www.fileuploadx.de/287440 > > (sorry, > > I know that this filehoster is annoying because you have to wait until you > > can download the file; I hope you have a coffee machine close by ;-) > > > > In this attachment, I basically load in two data.tables DT1 and DT2 that I > > want to join, i.e. DT2[DT1], according to the keyed columns "Company_Code" > > and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. However, while > > "DSCD" is formatted as a character-column, "Company_Code" is formatted as > > a > > factor-column. As you can see from the long structure-object, there are > > plenty of levels here (the actual data.tables are very small). > > > > Now, when I try to join those with DT2[DT1], I get: > > > > > > DSCD intDatum MONTH MV SICClass > > [1,] 997859 151 <NA> NA 44 > > [2,] 997859 152 <NA> NA 44 > > [3,] 998064 151 <NA> NA 15 > > [4,] 998064 152 <NA> NA 15 > > [5,] 142268 151 <NA> NA 53 > > [6,] 142268 152 <NA> NA 53 > > [7,] 142859 151 <NA> NA 56 > > [8,] 142859 152 <NA> NA 56 > > [9,] 143415 151 <NA> NA 63 > > [10,] 143415 152 <NA> NA 63 > > [11,] 307045 151 <NA> NA 15 > > [12,] 307045 152 <NA> NA 15 > > > > > > > > Basically, data.table finds no values for MV and MONTH for any DSCD > > and intDatum combination. However, as DT2[DSCD=="142268"] clearly > > shows, there are values for that DSCD: > > > > > > > > DSCD MONTH MV intDatum > > [1,] 142268 1997-08-28 1901.12 151 > > [2,] 142268 1997-09-28 1829.00 152 > > > > > > > > Those, however, only show up in the join after i get rid of all the > > unused levels (equivalently, I can also transform the Company_Code to > > a character column): > > > > > > DT1[, Company_Code := factor(Company_Code)] > > DT2[DT1] > > > > > > DSCD intDatum MONTH MV SICClass > > [1,] 997859 151 <NA> NA 44 > > [2,] 997859 152 <NA> NA 44 > > [3,] 998064 151 <NA> NA 15 > > [4,] 998064 152 <NA> NA 15 > > [5,] 142268 151 1997-08-28 1901.12 53 > > [6,] 142268 152 1997-09-28 1829.00 53 > > [7,] 142859 151 <NA> NA 56 > > [8,] 142859 152 <NA> NA 56 > > [9,] 143415 151 <NA> NA 63 > > [10,] 143415 152 <NA> NA 63 > > [11,] 307045 151 <NA> NA 15 > > [12,] 307045 152 <NA> NA 15 > > > > > > > > I'm pretty sure this behaviour occurred only with version 1.8.0, > > probably because data.table coerced every key to factor before (see > > the NEWS to 1.8.0). So my question is: Is what happens here intended > > behavior? I'm honest with you: I'm working now for a while with R and > > factors are one of those things that I never got. I just don't see > > their use and every so often they cause me huge problems (as in this > > case). So I'm probably making something stupid here. The nasty thing > > about this issue here is that mostly, however, the joins just work as > > expected (believe me, I tried to produce a simple example with one > > column factor and one character that would reproduce this behavior, > > but no matter what I did, the joins afterwards always worked as > > expected). So in case this is expected behavior, should data.table > > give at least a warning that says something like "You join two > > data.tables whereby one keyed column is a factor, one is a character. > > That is probably not your intention. Convert the factor column to > > character or vice versa."? > > > > > > > > Thanks for your help! > > > > > > > > Christoph > > _______________________________________________ > > datatable-help mailing list > > [hidden email] > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
I have raised a bug report under, which can be found under #1922.
On Mon, Apr 2, 2012 at 2:17 AM, Matthew Dowle <[hidden email]> wrote:
_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Now fixed in v1.8.1.
On Mon, 2012-04-02 at 13:39 +0200, Christoph Jäckel wrote: > I have raised a bug report under, which can be found under #1922. > > On Mon, Apr 2, 2012 at 2:17 AM, Matthew Dowle <[hidden email]> > wrote: > > Thanks for example and data, very clear. > > Yes, problem looks to be factors with unused levels, when > joined to a > character column, as you suggested. Work arounds are to drop > the unused > levels or convert to character, as you found. > > A fix is a bit more involved and won't have time for a while. > Please > could you file a bug report so it doesn't get forgotten. > > Thanks. > > On Fri, 2012-03-30 at 16:51 +0100, Matthew Dowle wrote: > > Hi. A quick read suggests it's not intended and that's a > bug. Just convert > > the columns to character for now, and it should work. > Character columns > > are now preferred going forward, so I'd be suggesting that > anyway even if > > it worked. > > > > > So in case this is expected behavior, should data.table > > > give at least a warning that says something like "You join > two > > > data.tables whereby one keyed column is a factor, one is a > character. > > > That is probably not your intention. Convert the factor > column to > > > character or vice versa."? > > > > Yes. It should be converting to character (with a warning) > in this case. > > Thought that's what I coded and tested. Will investigate... > > > > > Hi together, > > > > > > here is the problem I needed dput for: > http://www.fileuploadx.de/287440 > > > (sorry, > > > I know that this filehoster is annoying because you have > to wait until you > > > can download the file; I hope you have a coffee machine > close by ;-) > > > > > > In this attachment, I basically load in two data.tables > DT1 and DT2 that I > > > want to join, i.e. DT2[DT1], according to the keyed > columns "Company_Code" > > > and "intDatum" in DT1 and "DSCD" and "intDatum" in DT2. > However, while > > > "DSCD" is formatted as a character-column, "Company_Code" > is formatted as > > > a > > > factor-column. As you can see from the long > structure-object, there are > > > plenty of levels here (the actual data.tables are very > small). > > > > > > Now, when I try to join those with DT2[DT1], I get: > > > > > > > > > DSCD intDatum MONTH MV SICClass > > > [1,] 997859 151 <NA> NA 44 > > > [2,] 997859 152 <NA> NA 44 > > > [3,] 998064 151 <NA> NA 15 > > > [4,] 998064 152 <NA> NA 15 > > > [5,] 142268 151 <NA> NA 53 > > > [6,] 142268 152 <NA> NA 53 > > > [7,] 142859 151 <NA> NA 56 > > > [8,] 142859 152 <NA> NA 56 > > > [9,] 143415 151 <NA> NA 63 > > > [10,] 143415 152 <NA> NA 63 > > > [11,] 307045 151 <NA> NA 15 > > > [12,] 307045 152 <NA> NA 15 > > > > > > > > > > > > Basically, data.table finds no values for MV and MONTH for > any DSCD > > > and intDatum combination. However, as DT2[DSCD=="142268"] > clearly > > > shows, there are values for that DSCD: > > > > > > > > > > > > DSCD MONTH MV intDatum > > > [1,] 142268 1997-08-28 1901.12 151 > > > [2,] 142268 1997-09-28 1829.00 152 > > > > > > > > > > > > Those, however, only show up in the join after i get rid > of all the > > > unused levels (equivalently, I can also transform the > Company_Code to > > > a character column): > > > > > > > > > DT1[, Company_Code := factor(Company_Code)] > > > DT2[DT1] > > > > > > > > > DSCD intDatum MONTH MV SICClass > > > [1,] 997859 151 <NA> NA 44 > > > [2,] 997859 152 <NA> NA 44 > > > [3,] 998064 151 <NA> NA 15 > > > [4,] 998064 152 <NA> NA 15 > > > [5,] 142268 151 1997-08-28 1901.12 53 > > > [6,] 142268 152 1997-09-28 1829.00 53 > > > [7,] 142859 151 <NA> NA 56 > > > [8,] 142859 152 <NA> NA 56 > > > [9,] 143415 151 <NA> NA 63 > > > [10,] 143415 152 <NA> NA 63 > > > [11,] 307045 151 <NA> NA 15 > > > [12,] 307045 152 <NA> NA 15 > > > > > > > > > > > > I'm pretty sure this behaviour occurred only with version > 1.8.0, > > > probably because data.table coerced every key to factor > before (see > > > the NEWS to 1.8.0). So my question is: Is what happens > here intended > > > behavior? I'm honest with you: I'm working now for a while > with R and > > > factors are one of those things that I never got. I just > don't see > > > their use and every so often they cause me huge problems > (as in this > > > case). So I'm probably making something stupid here. The > nasty thing > > > about this issue here is that mostly, however, the joins > just work as > > > expected (believe me, I tried to produce a simple example > with one > > > column factor and one character that would reproduce this > behavior, > > > but no matter what I did, the joins afterwards always > worked as > > > expected). So in case this is expected behavior, should > data.table > > > give at least a warning that says something like "You join > two > > > data.tables whereby one keyed column is a factor, one is a > character. > > > That is probably not your intention. Convert the factor > column to > > > character or vice versa."? > > > > > > > > > > > > Thanks for your help! > > > > > > > > > > > > Christoph > > > _______________________________________________ > > > datatable-help mailing list > > > [hidden email] > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > _______________________________________________ > > datatable-help mailing list > > [hidden email] > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
| Powered by Nabble | Edit this page |
