Can file size affect how na.strings operates in a read.table call?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Can file size affect how na.strings operates in a read.table call?

R help mailing list-2
Hi,

I have this generic function to read ASCII data files. It is essentially a wrapper around the read.table function. My function is used in a large variety of situations and has no a priori knowledge about the data file it is asked to read. Nothing is known about file size, variable types, variable names, or data table dimensions.

One argument of my function is na.strings which is passed down to read.table.

Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by ~ 160 columns) using na.strings = c('-99', '.') with the intention of interpreting '.' and '-99'
strings as the internal missing data NA. Dots were converted to NA appropriately. However, not all -99 values in the data were interpreted as NA. In some variables, -99 were converted to NA, while in others -99 was read as a number. More surprisingly, when the data file was cut in smaller chunks (ie, by dropping either rows or columns) saved in multiple files, the function calls applied on the new data files resulted in the correct conversion of the -99 values into NAs.

In all cases, the data frames produced by read.table contained the expected number of records.

While, on face value, it appears that file size affects how the na.strings argument operates, I wondering if there is something else at play here.

Unfortunately, I cannot share the data file for confidentiality reason but was wondering if you could suggest some checks I could perform to get to the bottom on this issue.

Thank you in advance for your help and sorry for the lack of reproducible example.


______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Can file size affect how na.strings operates in a read.table call?

Jeff Newmiller
Check for extraneous spaces. You may need more variations of the na.strings.

On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help <[hidden email]> wrote:

>Hi,
>
>I have this generic function to read ASCII data files. It is
>essentially a wrapper around the read.table function. My function is
>used in a large variety of situations and has no a priori knowledge
>about the data file it is asked to read. Nothing is known about file
>size, variable types, variable names, or data table dimensions.
>
>One argument of my function is na.strings which is passed down to
>read.table.
>
>Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
>~ 160 columns) using na.strings = c('-99', '.') with the intention of
>interpreting '.' and '-99'
>strings as the internal missing data NA. Dots were converted to NA
>appropriately. However, not all -99 values in the data were interpreted
>as NA. In some variables, -99 were converted to NA, while in others -99
>was read as a number. More surprisingly, when the data file was cut in
>smaller chunks (ie, by dropping either rows or columns) saved in
>multiple files, the function calls applied on the new data files
>resulted in the correct conversion of the -99 values into NAs.
>
>In all cases, the data frames produced by read.table contained the
>expected number of records.
>
>While, on face value, it appears that file size affects how the
>na.strings argument operates, I wondering if there is something else at
>play here.
>
>Unfortunately, I cannot share the data file for confidentiality reason
>but was wondering if you could suggest some checks I could perform to
>get to the bottom on this issue.
>
>Thank you in advance for your help and sorry for the lack of
>reproducible example.
>
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Can file size affect how na.strings operates in a read.table call?

R help mailing list-2
The data file is a csv file. Some text variables contain spaces.

"Check for extraneous spaces"
Are there specific locations that would be more critical than others?


________________________________
From: Jeff Newmiller <[hidden email]>
Sent: Thursday, November 14, 2019 10:52
To: Sebastien Bihorel <[hidden email]>; Sebastien Bihorel via R-help <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: [R] Can file size affect how na.strings operates in a read.table call?

Check for extraneous spaces. You may need more variations of the na.strings.

On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help <[hidden email]> wrote:

>Hi,
>
>I have this generic function to read ASCII data files. It is
>essentially a wrapper around the read.table function. My function is
>used in a large variety of situations and has no a priori knowledge
>about the data file it is asked to read. Nothing is known about file
>size, variable types, variable names, or data table dimensions.
>
>One argument of my function is na.strings which is passed down to
>read.table.
>
>Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
>~ 160 columns) using na.strings = c('-99', '.') with the intention of
>interpreting '.' and '-99'
>strings as the internal missing data NA. Dots were converted to NA
>appropriately. However, not all -99 values in the data were interpreted
>as NA. In some variables, -99 were converted to NA, while in others -99
>was read as a number. More surprisingly, when the data file was cut in
>smaller chunks (ie, by dropping either rows or columns) saved in
>multiple files, the function calls applied on the new data files
>resulted in the correct conversion of the -99 values into NAs.
>
>In all cases, the data frames produced by read.table contained the
>expected number of records.
>
>While, on face value, it appears that file size affects how the
>na.strings argument operates, I wondering if there is something else at
>play here.
>
>Unfortunately, I cannot share the data file for confidentiality reason
>but was wondering if you could suggest some checks I could perform to
>get to the bottom on this issue.
>
>Thank you in advance for your help and sorry for the lack of
>reproducible example.
>
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Can file size affect how na.strings operates in a read.table call?

Jeff Newmiller
Consider the following sample:

#####
s <- "A,B,C
0,0,0
1,-99,-99
2,-99 ,-99
3, -99, -99
"

dta_notok <- read.csv( text = s
                      , header=TRUE
                      , na.strings = c( "-99", "" )
                      )

dta_ok <- read.csv( text = s
                   , header=TRUE
                   , na.strings = c( "-99", " -99"
                                   , "-99 ", ""
                                   )
                   )

library(data.table)

fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
fdta_ok <- as.data.frame( fdt_ok )
#####

Leading and trailing spaces cause problems. The data.table::fread function
has a strip.white argument that defaults to TRUE, but the resulting object
is a data.table which has different semantics than a data.frame.

On Thu, 14 Nov 2019, Sebastien Bihorel wrote:

> The data file is a csv file. Some text variables contain spaces.
>
> "Check for extraneous spaces"
> Are there specific locations that would be more critical than others?
>
>
> ____________________________________________________________________________
> From: Jeff Newmiller <[hidden email]>
> Sent: Thursday, November 14, 2019 10:52
> To: Sebastien Bihorel <[hidden email]>; Sebastien
> Bihorel via R-help <[hidden email]>; [hidden email]
> <[hidden email]>
> Subject: Re: [R] Can file size affect how na.strings operates in a
> read.table call?  
> Check for extraneous spaces. You may need more variations of the na.strings.
>
> On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
> <[hidden email]> wrote:
> >Hi,
> >
> >I have this generic function to read ASCII data files. It is
> >essentially a wrapper around the read.table function. My function is
> >used in a large variety of situations and has no a priori knowledge
> >about the data file it is asked to read. Nothing is known about file
> >size, variable types, variable names, or data table dimensions.
> >
> >One argument of my function is na.strings which is passed down to
> >read.table.
> >
> >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
> >~ 160 columns) using na.strings = c('-99', '.') with the intention of
> >interpreting '.' and '-99'
> >strings as the internal missing data NA. Dots were converted to NA
> >appropriately. However, not all -99 values in the data were interpreted
> >as NA. In some variables, -99 were converted to NA, while in others -99
> >was read as a number. More surprisingly, when the data file was cut in
> >smaller chunks (ie, by dropping either rows or columns) saved in
> >multiple files, the function calls applied on the new data files
> >resulted in the correct conversion of the -99 values into NAs.
> >
> >In all cases, the data frames produced by read.table contained the
> >expected number of records.
> >
> >While, on face value, it appears that file size affects how the
> >na.strings argument operates, I wondering if there is something else at
> >play here.
> >
> >Unfortunately, I cannot share the data file for confidentiality reason
> >but was wondering if you could suggest some checks I could perform to
> >get to the bottom on this issue.
> >
> >Thank you in advance for your help and sorry for the lack of
> >reproducible example.
> >
> >
> >______________________________________________
> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Can file size affect how na.strings operates in a read.table call?

R help mailing list-2
read.table (and friends) also have the strip.white argument:

> s <- "A,B,C\n0,0,0\n1,-99,-99\n2,-99 ,-99\n3, -99, -99\n"
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=TRUE)
  A  B  C
1 0  0  0
2 1 NA NA
3 2 NA NA
4 3 NA NA
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=FALSE)
  A   B   C
1 0   0   0
2 1  NA  NA
3 2 -99  NA
4 3 -99 -99

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Nov 14, 2019 at 8:35 AM Jeff Newmiller <[hidden email]>
wrote:

> Consider the following sample:
>
> #####
> s <- "A,B,C
> 0,0,0
> 1,-99,-99
> 2,-99 ,-99
> 3, -99, -99
> "
>
> dta_notok <- read.csv( text = s
>                       , header=TRUE
>                       , na.strings = c( "-99", "" )
>                       )
>
> dta_ok <- read.csv( text = s
>                    , header=TRUE
>                    , na.strings = c( "-99", " -99"
>                                    , "-99 ", ""
>                                    )
>                    )
>
> library(data.table)
>
> fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
> fdta_ok <- as.data.frame( fdt_ok )
> #####
>
> Leading and trailing spaces cause problems. The data.table::fread function
> has a strip.white argument that defaults to TRUE, but the resulting object
> is a data.table which has different semantics than a data.frame.
>
> On Thu, 14 Nov 2019, Sebastien Bihorel wrote:
>
> > The data file is a csv file. Some text variables contain spaces.
> >
> > "Check for extraneous spaces"
> > Are there specific locations that would be more critical than others?
> >
> >
> >
> ____________________________________________________________________________
> > From: Jeff Newmiller <[hidden email]>
> > Sent: Thursday, November 14, 2019 10:52
> > To: Sebastien Bihorel <[hidden email]>; Sebastien
> > Bihorel via R-help <[hidden email]>; [hidden email]
> > <[hidden email]>
> > Subject: Re: [R] Can file size affect how na.strings operates in a
> > read.table call?
> > Check for extraneous spaces. You may need more variations of the
> na.strings.
> >
> > On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
> > <[hidden email]> wrote:
> > >Hi,
> > >
> > >I have this generic function to read ASCII data files. It is
> > >essentially a wrapper around the read.table function. My function is
> > >used in a large variety of situations and has no a priori knowledge
> > >about the data file it is asked to read. Nothing is known about file
> > >size, variable types, variable names, or data table dimensions.
> > >
> > >One argument of my function is na.strings which is passed down to
> > >read.table.
> > >
> > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
> > >~ 160 columns) using na.strings = c('-99', '.') with the intention of
> > >interpreting '.' and '-99'
> > >strings as the internal missing data NA. Dots were converted to NA
> > >appropriately. However, not all -99 values in the data were interpreted
> > >as NA. In some variables, -99 were converted to NA, while in others -99
> > >was read as a number. More surprisingly, when the data file was cut in
> > >smaller chunks (ie, by dropping either rows or columns) saved in
> > >multiple files, the function calls applied on the new data files
> > >resulted in the correct conversion of the -99 values into NAs.
> > >
> > >In all cases, the data frames produced by read.table contained the
> > >expected number of records.
> > >
> > >While, on face value, it appears that file size affects how the
> > >na.strings argument operates, I wondering if there is something else at
> > >play here.
> > >
> > >Unfortunately, I cannot share the data file for confidentiality reason
> > >but was wondering if you could suggest some checks I could perform to
> > >get to the bottom on this issue.
> > >
> > >Thank you in advance for your help and sorry for the lack of
> > >reproducible example.
> > >
> > >
> > >______________________________________________
> > >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Sent from my phone. Please excuse my brevity.
> >
> >
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
> Go...
>                                        Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Can file size affect how na.strings operates in a read.table call?

R help mailing list-2
Thanks Bill and Jeff

strip.white did not change the outcomes.

However, your inputs led me to compare the raw content of the files (ie, outside of an IDE) and found difference in how the apparent -99 were stored. In the big file, some -99 are stored as floats rather than integers and thus included a decimal point and trailing zeros.

The creation of the smaller files resulted in the removal of the decimal point and trailing zeros, explaining why read.table provided the "right " response on these smaller files.

So, it looks like this is the problem and that some additional post-processing may be warranted.

Thanks for the hints.

________________________________
From: William Dunlap <[hidden email]>
Sent: Thursday, November 14, 2019 11:51
To: Jeff Newmiller <[hidden email]>
Cc: Sebastien Bihorel <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: [R] Can file size affect how na.strings operates in a read.table call?

read.table (and friends) also have the strip.white argument:

> s <- "A,B,C\n0,0,0\n1,-99,-99\n2,-99 ,-99\n3, -99, -99\n"
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=TRUE)
  A  B  C
1 0  0  0
2 1 NA NA
3 2 NA NA
4 3 NA NA
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=FALSE)
  A   B   C
1 0   0   0
2 1  NA  NA
3 2 -99  NA
4 3 -99 -99

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>


On Thu, Nov 14, 2019 at 8:35 AM Jeff Newmiller <[hidden email]<mailto:[hidden email]>> wrote:
Consider the following sample:

#####
s <- "A,B,C
0,0,0
1,-99,-99
2,-99 ,-99
3, -99, -99
"

dta_notok <- read.csv( text = s
                      , header=TRUE
                      , na.strings = c( "-99", "" )
                      )

dta_ok <- read.csv( text = s
                   , header=TRUE
                   , na.strings = c( "-99", " -99"
                                   , "-99 ", ""
                                   )
                   )

library(data.table)

fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
fdta_ok <- as.data.frame( fdt_ok )
#####

Leading and trailing spaces cause problems. The data.table::fread function
has a strip.white argument that defaults to TRUE, but the resulting object
is a data.table which has different semantics than a data.frame.

On Thu, 14 Nov 2019, Sebastien Bihorel wrote:

> The data file is a csv file. Some text variables contain spaces.
>
> "Check for extraneous spaces"
> Are there specific locations that would be more critical than others?
>
>
> ____________________________________________________________________________
> From: Jeff Newmiller <[hidden email]<mailto:[hidden email]>>
> Sent: Thursday, November 14, 2019 10:52
> To: Sebastien Bihorel <[hidden email]<mailto:[hidden email]>>; Sebastien
> Bihorel via R-help <[hidden email]<mailto:[hidden email]>>; [hidden email]<mailto:[hidden email]>
> <[hidden email]<mailto:[hidden email]>>
> Subject: Re: [R] Can file size affect how na.strings operates in a
> read.table call?
> Check for extraneous spaces. You may need more variations of the na.strings.
>
> On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
> <[hidden email]<mailto:[hidden email]>> wrote:
> >Hi,
> >
> >I have this generic function to read ASCII data files. It is
> >essentially a wrapper around the read.table function. My function is
> >used in a large variety of situations and has no a priori knowledge
> >about the data file it is asked to read. Nothing is known about file
> >size, variable types, variable names, or data table dimensions.
> >
> >One argument of my function is na.strings which is passed down to
> >read.table.
> >
> >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
> >~ 160 columns) using na.strings = c('-99', '.') with the intention of
> >interpreting '.' and '-99'
> >strings as the internal missing data NA. Dots were converted to NA
> >appropriately. However, not all -99 values in the data were interpreted
> >as NA. In some variables, -99 were converted to NA, while in others -99
> >was read as a number. More surprisingly, when the data file was cut in
> >smaller chunks (ie, by dropping either rows or columns) saved in
> >multiple files, the function calls applied on the new data files
> >resulted in the correct conversion of the -99 values into NAs.
> >
> >In all cases, the data frames produced by read.table contained the
> >expected number of records.
> >
> >While, on face value, it appears that file size affects how the
> >na.strings argument operates, I wondering if there is something else at
> >play here.
> >
> >Unfortunately, I cannot share the data file for confidentiality reason
> >but was wondering if you could suggest some checks I could perform to
> >get to the bottom on this issue.
> >
> >Thank you in advance for your help and sorry for the lack of
> >reproducible example.
> >
> >
> >______________________________________________
> >[hidden email]<mailto:[hidden email]> mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]<mailto:[hidden email]>>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
______________________________________________
[hidden email]<mailto:[hidden email]> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.