Checking for invalid dates: Code works but needs improvement

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Checking for invalid dates: Code works but needs improvement

Paul Miller
Hello Everyone,

Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it.

Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are "missing" in the sense that the month or year are unknown and prints out any remaining invalid date values.

As I see it, the code has at least 4 shortcomings.

1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines.

2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector.

3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like:

Error: Invalid date values in birthDT

"21931-11-23"
"1933-06-31"

Error: Invalid date values in diagnosisDT

"2010-02-30"

4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop.

Thanks,

Paul  

##########################################
#### Code for detecting invalid dates ####
##########################################

#### Test Data ####

connection <- textConnection("
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1940  02/30/2010 03/17/2011
3 06/17/1935  12/20/2008 07/un/2011
4 05/31/1937  01/18/2007 04/30/2011
5 06/31/1933  05/16/2009 11/20/un
")

TestDates <- data.frame(scan(connection,
                 list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))

close(connection)

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

#### List of Date Variables ####

DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")

#### Read Dates ####

for (i in seq(TestDates[DateNames])){
TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
TestDates$Day[TestDates$Day=="un"] <- "15"
TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = "-"))
is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d")
TestDates$Invalid <- ifelse(is.na(TestDates$Date) & !is.na(TestDates[DateNames][[i]]), 1, 0)
if( sum(TestDates$Invalid)==0 )
        { TestDates[DateNames][[i]] <- TestDates$Date } else
        { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, Invalid))
}

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Checking for invalid dates: Code works but needs improvement

Rui Barradas
Hello,

Point 3 is very simple, instead of 'print' use 'cat'.
Unlike 'print' it allows for several arguments and (very) simple formating.

  { cat("Error: Invalid date values in", DateNames[[i]], "\n",
               TestDates[DateNames][[i]][TestDates$Invalid==1], "\n") }

Rui Barradas

Reply | Threaded
Open this post in threaded view
|

Re: Checking for invalid dates: Code works but needs improvement

Gabor Grothendieck
In reply to this post by Paul Miller
On Tue, Jan 24, 2012 at 11:54 AM, Paul Miller <[hidden email]> wrote:

> Hello Everyone,
>
> Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it.
>
> Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are "missing" in the sense that the month or year are unknown and prints out any remaining invalid date values.
>
> As I see it, the code has at least 4 shortcomings.
>
> 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines.
>
> 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector.
>
> 3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like:
>
> Error: Invalid date values in birthDT
>
> "21931-11-23"
> "1933-06-31"
>
> Error: Invalid date values in diagnosisDT
>
> "2010-02-30"
>
> 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop.
>
> Thanks,
>
> Paul
>
> ##########################################
> #### Code for detecting invalid dates ####
> ##########################################
>
> #### Test Data ####
>
> connection <- textConnection("
> 1 11/23/21931 05/23/2009 un/17/2011
> 2 06/20/1940  02/30/2010 03/17/2011
> 3 06/17/1935  12/20/2008 07/un/2011
> 4 05/31/1937  01/18/2007 04/30/2011
> 5 06/31/1933  05/16/2009 11/20/un
> ")
>
> TestDates <- data.frame(scan(connection,
>                 list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))
>
> close(connection)
>
> TestDates
>
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)
>
> #### List of Date Variables ####
>
> DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")
>
> #### Read Dates ####
>
> for (i in seq(TestDates[DateNames])){
> TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
> TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
> TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
> TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
> TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
> TestDates$Day[TestDates$Day=="un"] <- "15"
> TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = "-"))
> is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
> is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
> TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d")
> TestDates$Invalid <- ifelse(is.na(TestDates$Date) & !is.na(TestDates[DateNames][[i]]), 1, 0)
> if( sum(TestDates$Invalid)==0 )
>        { TestDates[DateNames][[i]] <- TestDates$Date } else
>        { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
> TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, Invalid))
> }
>
> TestDates
>
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)

If s is a vector of character strings representing dates then bad is a
logical vector which is TRUE for the bad ones and FALSE for the good
ones (adjust as needed if a different date range is valid) so s[bad]
is the bad inputs and the output d is a "Date" vector with NAs for the
bad ones:

        x <- gsub("un", 15, s)
        d <- as.Date(x, "%m/%d/%Y")
        bad <- is.na(d) | d < as.Date("1900-01-01") | d > Sys.Date()
        d[bad] <- NA

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Checking for invalid dates: Code works but needs improvement

Rui Barradas
In reply to this post by Paul Miller
Hello, again.

I now have a more complete answer to your points.

> 1. It's too long. My understanding is that skilled programmers can usually
> or often complete tasks like this in a few lines.

It's not very shorter but it's more readable. (The programmer is always suspect)

> 2. It's not vectorized. I started out trying to do something that was vectorized
> but ran into problems with the strsplit function. I looked at the help file and
> it appears this function will only accept a single character vector.

All but one instructions are vectorized. And the one that is not only loops for
a few column names.
Use 'unlist' on the 'strsplit' function's output to give a vector.

> 4. There's no way to specify names for input and output data. I imagine this would
> be fairly easy to specify this in the arguments to a function but am not sure how to
> incorporate it into a for loop.

You can now specify any matrix or data.frame, but it will only process the columns with
dates. (This is not true, it will process anything with a '/' on it. Pay attention.)

Near the beginning of your code include the following:


> TestDates <- data.frame(scan(connection,
>                 list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))
>
> close(connection)

TDSaved <- TestDates    # to avoid reopenning the connection

And then, after all of it,

fun <- function(Dat){
    f <- function(jj, DF){
        x <- as.character(DF[, jj])
        x <- unlist(strsplit(x, "/"))
        n <- length(x)
        M <- x[seq(1, n, 3)]
        D <- x[seq(2, n, 3)]
        Y <- x[seq(3, n, 3)]
        D[D == "un"] <- "15"
        Y <- ifelse(nchar(Y) > 4 | Y < 1900, NA, Y)
        x <- as.Date(paste(Y, M, D, sep="-"), format="%Y-%m-%d")
        if(any(is.na(x)))
            cat("Warning: Invalid date values in", jj, "\n",
                as.character(DF[is.na(x), jj]), "\n")
        x
    }
    colinx <- colnames(as.data.frame(Dat))
    Dat <- data.frame(sapply(colinx, function(j) f(j, Dat)))
    for(i in colinx) class(Dat[[i]]) <- "Date"
    Dat
}

TD <- TDSaved

TD[, DateNames] <- fun(TD[, DateNames])

TD

Had fun in writing it.
Good luck.

Rui Barradas