Consistently printing the name of an object passed to a function; & a data-auditing question

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Consistently printing the name of an object passed to a function; & a data-auditing question

andrewH
Dear folks--
I always seem to find that I spend more than half my time making sure my input date is in the right form, properly aligned, with no bizarre features.  You know the drill: five kinds of missing values, three of them documented. An alpha mistype in one numeric field turns 30,000 numbers into factor levels.  SPSS conversion turns 250 factors nicely into R factors, except 3 have levels instead of labels. A few columns in some years of a survey have undocumented differences in units.  Halfway through a 20-year annual survey, they add two more allowable answers to a question. etc.

I'm looking for things to make my data auditing go faster.  One of them is a dopy little function, testX(),  bundling together a variety of r tools to tell me what is in an object.  Here it is:

testX <- function(objectX, bar=TRUE) {    # A useful diagnostic function
    object.name <- deparse(substitute(objectX))
    if(bar) cat("########################\n");  # visual separation between consecutive objects.
    cat("testX(", object.name, "): ");  cat("Class=", class(objectX)); cat("  Mode=", mode(objectX), "\n");
    cat("Summary:\n"); print(summary(objectX))
    cat("Structure:\n");  str(objectX);
    if (is.factor(objectX)) {cat("Levels: ", levels(objectX), "\n"); cat("Length: ", length(objectX), "\n")}
    invisible(object.name)}

This works well when I give it the name of a single object. My problem is when I try to produce descriptions of a bunch of variables in a row, such as the variables in a list of variables, or all the variables that I have clomped together in a data frame.  The output is all side effects. Some ways of passing multiple variables get the name wrong, but the rest right. For example, if I have a list of variables, and do:

> lapply(varList, testX)

I get an output like this:

##################################
testX( X[[1L]] ): Class= factor  Mode= numeric
Summary:
1994 1997 1999 2002 2003 2007 2009
1009 1165  985 2502 2528 2007 3013
Structure:
 Factor w/ 7 levels "1994","1997",..: 1 1 1 1 1 1 1 1 1 1 ...
Levels:  1994 1997 1999 2002 2003 2007 2009
Length:  13209

If instead, I do it with a loop through a the variable names in a data.frame, I get the name wrong _and_ it does not evaluate all the way to an object:

> names(var.df)
 [1] "year"      "YEAR"      "AGE"       "COHORT.5"  "COHORT.10" "ETHNIC"    "EDUC"      "INCOME"    "INTERNET"  "PARTY"     "IDEOL"

>for (sel in 1:length(names(var.df))) testX(names(var.df)[sel])

Gives an output like this:

##################################
testX( names(var.df)[sel] ): Class= character  Mode= character
Summary:
   Length     Class      Mode
        1 character character
Structure:
 chr "year"

Or I can select the column instead of the name of the column. This gives me the right answer on the object description, but not the name, thus:
> for (sel in 1:length(names(var.df))) testX(var.df[[sel]])

##################################
testX( var.df[[sel]] ): Class= integer  Mode= numeric
Summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1994    2002    2003    2003    2007    2009
Structure:
 int [1:13209] 1994 1994 1994 1994 1994 1994 1994 1994 1994 1994 ...

I've tried doing various things to names(var.df)[sel] to get it closer to the object -- as.symbol, eval(substitute() ), several others, but I just get variations on the output above.

So there are actually two questions here:
1.  How can I write this function so that it works when I just give it an object, but I can also use it with an apply-family function and a  list (or vector, or whatever)  of objects, and still have it both treat the object as an object and print its name correctly?  

2.  How can I write the function, or write a loop, or use an apply-family function, to use this function to go through the columns of a data.frame, correctly naming and correctly describing each?

Another way of asking this same question is this: I want to be able to give testX the name of an object, or a reference to a named object, via apply-family function, indexing, or whatever.  (A) How can I get the name I print, object.name, to be the name of the object in both cases? And, (B), how can I make sure that objectX is the actual object that the name refers to, and not the name or the reference, in both cases?

Finally, and this should maybe be another post, I'd love to hear if others have thought through the whole question of efficient data auditing.  Is there a suite of tools, or a standard set of recommendations, that you use and like? I'd love to hear any useful advice about how to accelerate this stage of a project, and get more quickly to its statistical heart.

Most sincerely, andrewH