why must a named colClasses in read.table be in correct order

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

why must a named colClasses in read.table be in correct order

Andreas Leha
Hi all,

Apparently, the colClasses argument to read.table needs to be in the
order of the columns *even when it is named*.  Why is that?  And where
would I find it in the documentation?

Here is a MWE:

--8<---------------cut here---------------start------------->8---
kkk <- c("a\tb",
         "3.14\tx")
read.table(textConnection(kkk),
           sep="\t",
           header = TRUE)

cclasses=c(b="character",
           a="numeric")

read.table(textConnection(kkk),
           sep="\t",
           header = TRUE,
           colClasses = cclasses)              ## <--- error

read.table(textConnection(kkk),
           sep="\t",
           header = TRUE,
           colClasses = cclasses[order(names(cclasses))])
--8<---------------cut here---------------end--------------->8---


Thanks,
Andreas

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: why must a named colClasses in read.table be in correct order

Andreas Leha
Hi Henrik,

Thanks for your reply.

I am not (yet) convinced, though.  The help page for read.table
mentions named colClasses and if I specify colClasses for not all
columns, the names are taken into account:

--8<---------------cut here---------------start------------->8---
kkk <- c("a\tb",
         "3.14\tx")
str(read.table(textConnection(kkk),
           sep="\t",
               header = TRUE))

str(read.table(textConnection(kkk),
               sep="\t",
               header = TRUE,
               colClasses=c(b="character")))
--8<---------------cut here---------------end--------------->8---

What am I missing?

Best,
Andreas



On 09/07/2015 02:21, Henrik Bengtsson wrote:

> read.table() does not make use of names(colClasses) - only its values.
> Because of this, ordering is critical, as you noted. It shouldn't be
> too hard to add support for a named `colClasses` argument of
> utils::read.table(), but someone needs to convince the R core team
> that this is a good idea.
>
> As an alternative, see R.filesets::readDataFrame() for a
> read.table()-like function that matches names(colClasses) to column
> names, if they exists.
>
> /Henrik
> (author of R.filesets)
>
> On Wed, Jul 8, 2015 at 5:41 PM, Andreas Leha
> <[hidden email]> wrote:
>> Hi all,
>>
>> Apparently, the colClasses argument to read.table needs to be in the
>> order of the columns *even when it is named*.  Why is that?  And where
>> would I find it in the documentation?
>>
>> Here is a MWE:
>>
>> --8<---------------cut here---------------start------------->8---
>> kkk <- c("a\tb",
>>          "3.14\tx")
>> read.table(textConnection(kkk),
>>            sep="\t",
>>            header = TRUE)
>>
>> cclasses=c(b="character",
>>            a="numeric")
>>
>> read.table(textConnection(kkk),
>>            sep="\t",
>>            header = TRUE,
>>            colClasses = cclasses)              ## <--- error
>>
>> read.table(textConnection(kkk),
>>            sep="\t",
>>            header = TRUE,
>>            colClasses = cclasses[order(names(cclasses))])
>> --8<---------------cut here---------------end--------------->8---
>>
>>
>> Thanks,
>> Andreas
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: why must a named colClasses in read.table be in correct order

Henrik Bengtsson-4
Thanks for insisting; I was wrong and I'm happy to see that there is
indeed code intended for named 'colClasses', which even goes back to
2004.   But as you report, then names only work when
length(colClasses) < cols (which also explains why I though it was not
supported).  I'm not sure if that _strictly less than_  test is
intentional or a mistake, but I would propose the following patch:

[HB-X201]{hb}: svn diff src\library\utils\R\readtable.R
Index: src/library/utils/R/readtable.R
===================================================================
--- src/library/utils/R/readtable.R     (revision 68642)
+++ src/library/utils/R/readtable.R     (working copy)
@@ -139,7 +139,7 @@
     if (rlabp) col.names <- c("row.names", col.names)

     nmColClasses <- names(colClasses)
-    if(length(colClasses) < cols)
+    if(length(colClasses) <= cols)
         if(is.null(nmColClasses)) {
             colClasses <- rep_len(colClasses, cols)
         } else {


Your example works with this patch.  I've made it source():able so you
can try it out (if you cannot source() https://, then download the
file an source it locally):

source("https://gist.githubusercontent.com/HenrikBengtsson/ed1eeb41a1b4d6c43b47/raw/ebe58f76e518dd014423bea466a5c93d2efd3c99/readtable-fix.R")

kkk <- c("a\tb",
         "3.14\tx")

colClasses <- c(a="numeric", b="character")
data <- read.table(textConnection(kkk),
                   sep="\t",
                   header = TRUE,
                   colClasses = colClasses)
str(data)
### 'data.frame':   1 obs. of  2 variables:
### $ a: num 3.14
### $ b: chr "x"

## Does not work with utils::read.table(), but with patch
data <- read.table(textConnection(kkk),
                   sep="\t",
                   header = TRUE,
                   colClasses = rev(colClasses))
str(data)
### 'data.frame':   1 obs. of  2 variables:
### $ a: num 3.14
### $ b: chr "x"

Let's hope that the above is a (10-year old) typo, and changing a < to
a <= adds support for named 'colClasses', which is a really useful
functionality.

/Henrik

On Wed, Jul 8, 2015 at 6:42 PM, Andreas Leha
<[hidden email]> wrote:

> Hi Henrik,
>
> Thanks for your reply.
>
> I am not (yet) convinced, though.  The help page for read.table
> mentions named colClasses and if I specify colClasses for not all
> columns, the names are taken into account:
>
> --8<---------------cut here---------------start------------->8---
> kkk <- c("a\tb",
>          "3.14\tx")
> str(read.table(textConnection(kkk),
>            sep="\t",
>                header = TRUE))
>
> str(read.table(textConnection(kkk),
>                sep="\t",
>                header = TRUE,
>                colClasses=c(b="character")))
> --8<---------------cut here---------------end--------------->8---
>
> What am I missing?
>
> Best,
> Andreas
>
>
>
> On 09/07/2015 02:21, Henrik Bengtsson wrote:
>> read.table() does not make use of names(colClasses) - only its values.
>> Because of this, ordering is critical, as you noted. It shouldn't be
>> too hard to add support for a named `colClasses` argument of
>> utils::read.table(), but someone needs to convince the R core team
>> that this is a good idea.
>>
>> As an alternative, see R.filesets::readDataFrame() for a
>> read.table()-like function that matches names(colClasses) to column
>> names, if they exists.
>>
>> /Henrik
>> (author of R.filesets)
>>
>> On Wed, Jul 8, 2015 at 5:41 PM, Andreas Leha
>> <[hidden email]> wrote:
>>> Hi all,
>>>
>>> Apparently, the colClasses argument to read.table needs to be in the
>>> order of the columns *even when it is named*.  Why is that?  And where
>>> would I find it in the documentation?
>>>
>>> Here is a MWE:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> kkk <- c("a\tb",
>>>          "3.14\tx")
>>> read.table(textConnection(kkk),
>>>            sep="\t",
>>>            header = TRUE)
>>>
>>> cclasses=c(b="character",
>>>            a="numeric")
>>>
>>> read.table(textConnection(kkk),
>>>            sep="\t",
>>>            header = TRUE,
>>>            colClasses = cclasses)              ## <--- error
>>>
>>> read.table(textConnection(kkk),
>>>            sep="\t",
>>>            header = TRUE,
>>>            colClasses = cclasses[order(names(cclasses))])
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>>
>>> Thanks,
>>> Andreas
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: why must a named colClasses in read.table be in correct order

Andreas Leha
Hi Henrik,

Thank you very much for looking into this.  And thanks for the patch!

Yes, let's hope this is a typo that gets fixed.

Regards,
Andreas

Henrik Bengtsson <[hidden email]> writes:

> Thanks for insisting; I was wrong and I'm happy to see that there is
> indeed code intended for named 'colClasses', which even goes back to
> 2004.   But as you report, then names only work when
> length(colClasses) < cols (which also explains why I though it was not
> supported).  I'm not sure if that _strictly less than_  test is
> intentional or a mistake, but I would propose the following patch:
>
> [HB-X201]{hb}: svn diff src\library\utils\R\readtable.R
> Index: src/library/utils/R/readtable.R
> ===================================================================
> --- src/library/utils/R/readtable.R     (revision 68642)
> +++ src/library/utils/R/readtable.R     (working copy)
> @@ -139,7 +139,7 @@
>      if (rlabp) col.names <- c("row.names", col.names)
>
>      nmColClasses <- names(colClasses)
> -    if(length(colClasses) < cols)
> +    if(length(colClasses) <= cols)
>          if(is.null(nmColClasses)) {
>              colClasses <- rep_len(colClasses, cols)
>          } else {
>
>
> Your example works with this patch.  I've made it source():able so you
> can try it out (if you cannot source() https://, then download the
> file an source it locally):
>
> source("https://gist.githubusercontent.com/HenrikBengtsson/ed1eeb41a1b4d6c43b47/raw/ebe58f76e518dd014423bea466a5c93d2efd3c99/readtable-fix.R")
>
> kkk <- c("a\tb",
>          "3.14\tx")
>
> colClasses <- c(a="numeric", b="character")
> data <- read.table(textConnection(kkk),
>                    sep="\t",
>                    header = TRUE,
>                    colClasses = colClasses)
> str(data)
> ### 'data.frame':   1 obs. of  2 variables:
> ### $ a: num 3.14
> ### $ b: chr "x"
>
> ## Does not work with utils::read.table(), but with patch
> data <- read.table(textConnection(kkk),
>                    sep="\t",
>                    header = TRUE,
>                    colClasses = rev(colClasses))
> str(data)
> ### 'data.frame':   1 obs. of  2 variables:
> ### $ a: num 3.14
> ### $ b: chr "x"
>
> Let's hope that the above is a (10-year old) typo, and changing a < to
> a <= adds support for named 'colClasses', which is a really useful
> functionality.
>
> /Henrik
>
> On Wed, Jul 8, 2015 at 6:42 PM, Andreas Leha
> <[hidden email]> wrote:
>> Hi Henrik,
>>
>> Thanks for your reply.
>>
>> I am not (yet) convinced, though.  The help page for read.table
>> mentions named colClasses and if I specify colClasses for not all
>> columns, the names are taken into account:
>>
>> --8<---------------cut here---------------start------------->8---
>> kkk <- c("a\tb",
>>          "3.14\tx")
>> str(read.table(textConnection(kkk),
>>            sep="\t",
>>                header = TRUE))
>>
>> str(read.table(textConnection(kkk),
>>                sep="\t",
>>                header = TRUE,
>>                colClasses=c(b="character")))
>> --8<---------------cut here---------------end--------------->8---
>>
>> What am I missing?
>>
>> Best,
>> Andreas
>>
>>
>>
>> On 09/07/2015 02:21, Henrik Bengtsson wrote:
>>> read.table() does not make use of names(colClasses) - only its values.
>>> Because of this, ordering is critical, as you noted. It shouldn't be
>>> too hard to add support for a named `colClasses` argument of
>>> utils::read.table(), but someone needs to convince the R core team
>>> that this is a good idea.
>>>
>>> As an alternative, see R.filesets::readDataFrame() for a
>>> read.table()-like function that matches names(colClasses) to column
>>> names, if they exists.
>>>
>>> /Henrik
>>> (author of R.filesets)
>>>
>>> On Wed, Jul 8, 2015 at 5:41 PM, Andreas Leha
>>> <[hidden email]> wrote:
>>>> Hi all,
>>>>
>>>> Apparently, the colClasses argument to read.table needs to be in the
>>>> order of the columns *even when it is named*.  Why is that?  And where
>>>> would I find it in the documentation?
>>>>
>>>> Here is a MWE:
>>>>
>>>> --8<---------------cut here---------------start------------->8---
>>>> kkk <- c("a\tb",
>>>>          "3.14\tx")
>>>> read.table(textConnection(kkk),
>>>>            sep="\t",
>>>>            header = TRUE)
>>>>
>>>> cclasses=c(b="character",
>>>>            a="numeric")
>>>>
>>>> read.table(textConnection(kkk),
>>>>            sep="\t",
>>>>            header = TRUE,
>>>>            colClasses = cclasses)              ## <--- error
>>>>
>>>> read.table(textConnection(kkk),
>>>>            sep="\t",
>>>>            header = TRUE,
>>>>            colClasses = cclasses[order(names(cclasses))])
>>>> --8<---------------cut here---------------end--------------->8---
>>>>
>>>>
>>>> Thanks,
>>>> Andreas
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.