Bug in plot.lm function (stats package): positioning of labels for extreme points.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Bug in plot.lm function (stats package): positioning of labels for extreme points.

Edward
Hi,

====================
Reproducible example:
====================

data(Animals, package="MASS") # interesting dataset

# Run model
lm1 <- lm(log10(body)~log10(brain), data=Animals)

# Setup 2x2 graphics device
par(mfrow=c(2,2))

# Plot diagnostics, label the two most "extreme" points based on magnitude of residuals
plot(lm1, id.n=2)

==============================
Explanation of resulting plots:
==============================
Notice that the one of the two extreme points corresponding to the two largest dinosaurs
are labelled unintuitively, or counter to what is stated in the documentation for the
"label.pos" argument:

?plot.lm
label.pos: positioning of labels, for the left half and right half of the graph
respectively, for plots 1-3.

The default value for this argument is c(4,2), where 4 means "to the right of" and 2
means "to the left of" as stated in the help page for text (see the 'pos' argument).

The Q-Q plot positions the label for Dipliodocus "to the right", but clearly it should
be placed "to the left" since it is clearly on the right half of the graph. Similarly
for the Leverage plot, the label for Brachiosaurus is placed "to the left" when clearly
it should be placed "to the right".

====================================
Reason for error and possible patch:
====================================
The fix is hard to explain, because changes are required in many places.

On line 85 (or thereabouts) of the plot.lm function, there is a function called text.id
which does the labelling:

text.id <- function(x, y, ind, adj.x = TRUE) {
            labpos <- if (adj.x)
                label.pos[1 + as.numeric(x > mean(range(x)))]
            else 3
            text(x, y, labels.id[ind], cex = cex.id, xpd = TRUE,
                pos = labpos, offset = 0.25)
        }

This text.id function is called for plots corresponding to which==1 (lines 126:128),
which==2 (line 145), for example:

      text.id(qq$x[show.rs], qq$y[show.rs], show.rs)

which==3 (line 163), which==4 (line 180), which==5 (lines 270:272), and which==6 (line
312).

I believe the text.id function should be changed to:

text.id <- function(x, y, ind, adj.x = TRUE) {
      labpos <- if (adj.x)
        label.pos[1 + as.numeric(x[ind] > mean(range(x)))]
      else 3
      text(x[ind], y[ind], labels.id[ind], cex = cex.id, xpd = TRUE,
           pos = labpos, offset = 0.25)
    }

And the repeated calls to this function are changed so that the choice of position is
based on whether the extreme points are greater than the mean of the range of ALL the
data points, not just the extreme ones as it is currently doing. For example, at line
145 for the Q-Q plot (which==2), the [show.rs] index should be removed in the first two
arguments, so the code should be:

      text.id(qq$x, qq$y, show.rs)

and similar changes are required for plots 3, 4, and 5. For plots 1 and 6, the following
changes are needed:

Lines 126:128 (which==1)
      y.id <- r # delete [show.r]
      y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3
      text.id(yh, y.id, show.r) # delete [show.r]

Lines 270:272 (which==6)
        y.id <- rsp # delete [show.rsp]
        y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3
        text.id(xx, y.id, show.rsp) # delete [show.rsp]

I tested these changes and they seem to work without breaking anything. If you want me
to make a patch, then I can try. But I thought that these changes were quite significant
and better left to the experts.

Hope that all makes sense.
--
Edward McNeil
Assistant Professor,
Epidemiology Unit,
Prince of Songkla University,
Hat Yai,
Thailand

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel