Locating data source error in large file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Locating data source error in large file

Rich Shepard
   The structure of the dataframe is

str(wy2016)
'data.frame': 8784 obs. of  4 variables:
  $ date  : chr  "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01" ...
  $ time  : chr  "00:00" "01:00" "02:00" "03:00" ...
  $ elev  : num  90.7 90.7 90.7 90.7 90.7 ...
  $ myDate: Date, format: "2015-10-01" "2015-10-01" ...

   The command and results on this dataframe is:
wy2016$myTime <- as.POSIXct(paste(wy2016$date, wy2016$time))
Error in as.POSIXlt.character(x, tz, ...) :
   character string is not in a standard unambiguous format

   Data for other water years do not throw this error. How can I identify
which row(s) among the 8784 have a date or time formatting error?

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

R help mailing list-2
The problem occurs because no commonly used format works on
all your date strings.  If you give as.POSIXlt the format you want to
use then items that don't match the format will be treated as NA's.
Use is.na() to find them.

> d <- c("2017-12-25", "2018-01-01", "10/31/2018")
> as.POSIXlt(d)
Error in as.POSIXlt.character(d) :
  character string is not in a standard unambiguous format
> as.POSIXlt(d, format="%Y-%m-%d")
[1] "2017-12-25 PST" "2018-01-01 PST" NA
> as.POSIXlt(d, format="%m/%d/%Y")
[1] NA               NA               "2018-10-31 PDT"



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Jul 20, 2018 at 10:43 AM, Rich Shepard <[hidden email]>
wrote:

>   The structure of the dataframe is
>
> str(wy2016)
> 'data.frame':   8784 obs. of  4 variables:
>  $ date  : chr  "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01" ...
>  $ time  : chr  "00:00" "01:00" "02:00" "03:00" ...
>  $ elev  : num  90.7 90.7 90.7 90.7 90.7 ...
>  $ myDate: Date, format: "2015-10-01" "2015-10-01" ...
>
>   The command and results on this dataframe is:
> wy2016$myTime <- as.POSIXct(paste(wy2016$date, wy2016$time))
> Error in as.POSIXlt.character(x, tz, ...) :
>   character string is not in a standard unambiguous format
>
>   Data for other water years do not throw this error. How can I identify
> which row(s) among the 8784 have a date or time formatting error?
>
> Rich
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
On Fri, 20 Jul 2018, William Dunlap wrote:

> The problem occurs because no commonly used format works on all your date
> strings. If you give as.POSIXlt the format you want to use then items that
> don't match the format will be treated as NA's. Use is.na() to find them.

Bill,

   No NAs found using both is.na() and scrolling through the source file.
That's why I asked for help: I saw nothing different in the dates or times.

Regards,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Eric Berger
Hi Rich,
This may not be the most efficient but it will identify the offenders.

>  foo <-  paste(wy2016$date, wy2016$time))
> uu <- sapply(1:length(foo),
             function(i) { a <- try(as.POSIXct(foo[i]),silent=TRUE)
             "POSIXct" %in% class(a) })
> which(!uu)

HTH,
Eric



On Fri, Jul 20, 2018 at 9:58 PM, Rich Shepard <[hidden email]>
wrote:

> On Fri, 20 Jul 2018, William Dunlap wrote:
>
> The problem occurs because no commonly used format works on all your date
>> strings. If you give as.POSIXlt the format you want to use then items that
>> don't match the format will be treated as NA's. Use is.na() to find them.
>>
>
> Bill,
>
>   No NAs found using both is.na() and scrolling through the source file.
> That's why I asked for help: I saw nothing different in the dates or times.
>
> Regards,
>
> Rich
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

R help mailing list-2
In reply to this post by Rich Shepard
Which format did you use when you used is.na on the output of
   as.POSIXlt(strings, format=someFormat)
and found none?  Did the resulting dates look OK?  Perhaps
all is well.

Note the the common American format month/day/year is not
one that is tested when you don't supply a format - xx/yy/zzzz
is treated as year/month/day (and it changes the time zone,
presumable because US/Pacific time was not used in the year
10 CE).
  > as.POSIXlt("10/7/1962")
  [1] "0010-07-19 LMT"
  > as.POSIXlt("3/17/1962")
  Error in as.POSIXlt.character("3/17/1962") :
    character string is not in a standard unambiguous format


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Jul 20, 2018 at 11:58 AM, Rich Shepard <[hidden email]>
wrote:

> On Fri, 20 Jul 2018, William Dunlap wrote:
>
> The problem occurs because no commonly used format works on all your date
>> strings. If you give as.POSIXlt the format you want to use then items that
>> don't match the format will be treated as NA's. Use is.na() to find them.
>>
>
> Bill,
>
>   No NAs found using both is.na() and scrolling through the source file.
> That's why I asked for help: I saw nothing different in the dates or times.
>
> Regards,
>
>
> Rich
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

David Winsemius
In reply to this post by Rich Shepard

> On Jul 20, 2018, at 11:58 AM, Rich Shepard <[hidden email]> wrote:
>
> On Fri, 20 Jul 2018, William Dunlap wrote:
>
>> The problem occurs because no commonly used format works on all your date
>> strings. If you give as.POSIXlt the format you want to use then items that
>> don't match the format will be treated as NA's. Use is.na() to find them.
>
> Bill,
>
>  No NAs found using both is.na() and scrolling through the source file.
> That's why I asked for help: I saw nothing different in the dates or times.
>
> Regards,
>
> Rich

I don't think you read Bill's message properly. He was not saying that there were NA's; he was telling you to use a format specification in your as.POSIXct call and the the result of that call would have NA's.

wy2016$dt_time <- with( wy2016, as.POSIXct( paste( date, time ) , format= "%Y-%m-%d %H:%M") )

David.


>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
In reply to this post by R help mailing list-2
On Fri, 20 Jul 2018, William Dunlap wrote:

> Which format did you use when you used is.na on the output of
>   as.POSIXlt(strings, format=someFormat)
> and found none?  Did the resulting dates look OK?  Perhaps
> all is well.

Bill,

   All dates here are kept as yyyy-mm-dd.

   And each dataframe row has this format:
2015-10-01,00:00,90.6689
2015-10-01,01:00,90.6506
2015-10-01,02:00,90.6719
2015-10-01,03:00,90.6506

Thanks,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
In reply to this post by David Winsemius
On Fri, 20 Jul 2018, David Winsemius wrote:

> I don't think you read Bill's message properly.

David,

   Obviously not.

> He was not saying that there were NA's; he was telling you to use a format
> specification in your as.POSIXct call and the the result of that call
> would have NA's.
>
> wy2016$dt_time <- with( wy2016, as.POSIXct( paste( date, time ) , format= "%Y-%m-%d %H:%M") )

   Thank you. This found 24 TRUEs; now to find them in the file.

Thanks,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
In reply to this post by Eric Berger
On Fri, 20 Jul 2018, Eric Berger wrote:

> This may not be the most efficient but it will identify the offenders.
>
>>  foo <-  paste(wy2016$date, wy2016$time))
>> uu <- sapply(1:length(foo),
>             function(i) { a <- try(as.POSIXct(foo[i]),silent=TRUE)
>             "POSIXct" %in% class(a) })
>> which(!uu)

Eric,

   Thank you. Now I know there are NAs there this should help me find them.

Regards,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
In reply to this post by David Winsemius
On Fri, 20 Jul 2018, David Winsemius wrote:

> wy2016$dt_time <- with( wy2016, as.POSIXct( paste( date, time ) , format=
> "%Y-%m-%d %H:%M") )

David/Bill/Eric:

   Thank you all. I found the typos which covered a single day toward the end
of the dataframe.

Carpe weekend,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

R help mailing list-2
In reply to this post by Rich Shepard
 >  And each dataframe row has this format:
>2015-10-01,00:00,90.6689
>2015-10-01,01:00,90.6506
>2015-10-01,02:00,90.6719
>2015-10-01,03:00,90.6506

You mean each line in the file, not row in data.frame, has the form
"year-month-day,hour:min,numericValue".  Try the following, where tfile
names your file:

> df <- read.table(tfile, header=FALSE, sep=",", col.names=c("dateString",
"timeString", "Value"))
> df
  dateString timeString   Value
1 2015-10-01      00:00 90.6689
2 2015-10-01      01:00 90.6506
3 2015-10-01      02:00 90.6719
4 2015-10-01      03:00 90.6506
> transform(df, DateTime = as.POSIXlt(paste(dateString, timeString),
format="%Y-%m-%d %H:%M"), dateString=NULL, timeString=NULL)
    Value            DateTime
1 90.6689 2015-10-01 00:00:00
2 90.6506 2015-10-01 01:00:00
3 90.6719 2015-10-01 02:00:00
4 90.6506 2015-10-01 03:00:00
> str(.Last.value)
'data.frame':   4 obs. of  2 variables:
 $ Value   : num  90.7 90.7 90.7 90.7
 $ DateTime: POSIXct, format: "2015-10-01 00:00:00" "2015-10-01 01:00:00"
"2015-10-01 02:00:00" "2015-10-01 03:00:00"



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Jul 20, 2018 at 1:24 PM, Rich Shepard <[hidden email]>
wrote:

> On Fri, 20 Jul 2018, William Dunlap wrote:
>
> Which format did you use when you used is.na on the output of
>>   as.POSIXlt(strings, format=someFormat)
>> and found none?  Did the resulting dates look OK?  Perhaps
>> all is well.
>>
>
> Bill,
>
>   All dates here are kept as yyyy-mm-dd.
>
>   And each dataframe row has this format:
> 2015-10-01,00:00,90.6689
> 2015-10-01,01:00,90.6506
> 2015-10-01,02:00,90.6719
> 2015-10-01,03:00,90.6506
>
> Thanks,
>
>
> Rich
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
On Fri, 20 Jul 2018, William Dunlap wrote:

> You mean each line in the file, not row in data.frame, has the form
> "year-month-day,hour:min,numericValue". Try the following, where tfile
> names your file:

Bill,

   Yes, I was looking at the data file in one emacs buffer and my R session
in another one.

   The source file was what I posted, the data.frame is different:
  head(wy2016)
         date  time    elev
1 2015-10-01 00:00 90.6689
2 2015-10-01 01:00 90.6506
3 2015-10-01 02:00 90.6719
4 2015-10-01 03:00 90.6506
5 2015-10-01 04:00 90.6597
6 2015-10-01 05:00 90.6841

   It's all fixed now and I've learned how to handle future typos from all of
you.

Thanks,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

R help mailing list-2
In reply to this post by Rich Shepard
To find the lines in the file, tfile, with bogus dates, try
    readLines(tfile)[ is.na(dataFrame$DateTime) ]

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Jul 20, 2018 at 1:30 PM, Rich Shepard <[hidden email]>
wrote:

> On Fri, 20 Jul 2018, David Winsemius wrote:
>
> I don't think you read Bill's message properly.
>>
>
> David,
>
>   Obviously not.
>
> He was not saying that there were NA's; he was telling you to use a format
>> specification in your as.POSIXct call and the the result of that call
>> would have NA's.
>>
>> wy2016$dt_time <- with( wy2016, as.POSIXct( paste( date, time ) , format=
>> "%Y-%m-%d %H:%M") )
>>
>
>   Thank you. This found 24 TRUEs; now to find them in the file.
>
> Thanks,
>
> Rich
>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
On Fri, 20 Jul 2018, William Dunlap wrote:

> To find the lines in the file, tfile, with bogus dates, try
>    readLines(tfile)[ is.na(dataFrame$DateTime) ]

Bill,

   Thanks for another lesson.

Regards,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Locating data source error in large file

Rich Shepard
In reply to this post by Rich Shepard
On Fri, 20 Jul 2018, Rich Shepard wrote:

>  Thank you all. I found the typos which covered a single day toward the end
> of the dataframe.

   FWIW, all these data came from PDF reports and had to be manually
highlighted and pasted into a text file. Given 29 years of hourly (and
sometimes half-hourly) reports I'm not suprised that I made errors now and
then.

Regards,

Rich

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.