

Hi R core team,
I experienced the following issue with the attached data/code snippet,
where the studentized residual for a single observation appears to be NaN
given finite predictors/responses, which appears to be driven by the
glm.influence method in the stats package. I am curious to whether this is
a consequence of the specific implementation used for computing the
influence, which it would appear is the driving force for the NaN influence
for the point, that I was ultimately able to trace back through the
lm.influence method to this specific line
< https://github.com/SurajGupta/rsource/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67>
which
calls C code which calls iminfl.f
< https://github.com/SurajGupta/rsource/blob/master/src/library/stats/src/lminfl.f>
(I
don't know fortran so I can't debug further). My understanding is that the
specific issue would have to do with the leaveoneout variance estimate
associated with this particular point, which it seems based on my
understanding should be finite given finite predictors/responses. Let me
know. Thanks!
Sincerely,

Eric Bridgeford
ericwb.me
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Nothing was attached. The rhelp server strips most attachments. Include
your code inline.
Also note that
> 0/0
[1] NaN
so maybe something like that occurs in the course of your calculations. But
that's just a guess, so feel free to disregard.
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
 Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford < [hidden email]> wrote:
> Hi R core team,
>
> I experienced the following issue with the attached data/code snippet,
> where the studentized residual for a single observation appears to be NaN
> given finite predictors/responses, which appears to be driven by the
> glm.influence method in the stats package. I am curious to whether this is
> a consequence of the specific implementation used for computing the
> influence, which it would appear is the driving force for the NaN influence
> for the point, that I was ultimately able to trace back through the
> lm.influence method to this specific line
> <
> https://github.com/SurajGupta/rsource/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67> >
> which
> calls C code which calls iminfl.f
> <
> https://github.com/SurajGupta/rsource/blob/master/src/library/stats/src/lminfl.f> >
> (I
> don't know fortran so I can't debug further). My understanding is that the
> specific issue would have to do with the leaveoneout variance estimate
> associated with this particular point, which it seems based on my
> understanding should be finite given finite predictors/responses. Let me
> know. Thanks!
>
> Sincerely,
>
> 
> Eric Bridgeford
> ericwb.me
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


How can I add attachments? The following two files were attached in the
initial message
On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]> wrote:
> Nothing was attached. The rhelp server strips most attachments. Include
> your code inline.
>
> Also note that
>
> > 0/0
> [1] NaN
>
> so maybe something like that occurs in the course of your calculations.
> But that's just a guess, so feel free to disregard.
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford < [hidden email]>
> wrote:
>
>> Hi R core team,
>>
>> I experienced the following issue with the attached data/code snippet,
>> where the studentized residual for a single observation appears to be NaN
>> given finite predictors/responses, which appears to be driven by the
>> glm.influence method in the stats package. I am curious to whether this is
>> a consequence of the specific implementation used for computing the
>> influence, which it would appear is the driving force for the NaN
>> influence
>> for the point, that I was ultimately able to trace back through the
>> lm.influence method to this specific line
>> <
>> https://github.com/SurajGupta/rsource/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67>> >
>> which
>> calls C code which calls iminfl.f
>> <
>> https://github.com/SurajGupta/rsource/blob/master/src/library/stats/src/lminfl.f>> >
>> (I
>> don't know fortran so I can't debug further). My understanding is that the
>> specific issue would have to do with the leaveoneout variance estimate
>> associated with this particular point, which it seems based on my
>> understanding should be finite given finite predictors/responses. Let me
>> know. Thanks!
>>
>> Sincerely,
>>
>> 
>> Eric Bridgeford
>> ericwb.me
>> ______________________________________________
>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide
>> http://www.Rproject.org/postingguide.html>> and provide commented, minimal, selfcontained, reproducible code.
>>
>

Eric Bridgeford
ericwb.me
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


I told you already: **Include code inline **
See ?dput for how to include a text version of objects, such as data
frames, inline.
Otherwise, I believe .txt text files are not stripped if you insist on
*attaching* data or code. Others may have better advice.
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
 Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]> wrote:
> How can I add attachments? The following two files were attached in the
> initial message
>
> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]> wrote:
>
>> Nothing was attached. The rhelp server strips most attachments. Include
>> your code inline.
>>
>> Also note that
>>
>> > 0/0
>> [1] NaN
>>
>> so maybe something like that occurs in the course of your calculations.
>> But that's just a guess, so feel free to disregard.
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford < [hidden email]>
>> wrote:
>>
>>> Hi R core team,
>>>
>>> I experienced the following issue with the attached data/code snippet,
>>> where the studentized residual for a single observation appears to be NaN
>>> given finite predictors/responses, which appears to be driven by the
>>> glm.influence method in the stats package. I am curious to whether this
>>> is
>>> a consequence of the specific implementation used for computing the
>>> influence, which it would appear is the driving force for the NaN
>>> influence
>>> for the point, that I was ultimately able to trace back through the
>>> lm.influence method to this specific line
>>> <
>>> https://github.com/SurajGupta/rsource/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67>>> >
>>> which
>>> calls C code which calls iminfl.f
>>> <
>>> https://github.com/SurajGupta/rsource/blob/master/src/library/stats/src/lminfl.f>>> >
>>> (I
>>> don't know fortran so I can't debug further). My understanding is that
>>> the
>>> specific issue would have to do with the leaveoneout variance estimate
>>> associated with this particular point, which it seems based on my
>>> understanding should be finite given finite predictors/responses. Let me
>>> know. Thanks!
>>>
>>> Sincerely,
>>>
>>> 
>>> Eric Bridgeford
>>> ericwb.me
>>> ______________________________________________
>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>> PLEASE do read the posting guide
>>> http://www.Rproject.org/postingguide.html>>> and provide commented, minimal, selfcontained, reproducible code.
>>>
>>
>
> 
> Eric Bridgeford
> ericwb.me
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Also, I suggest you read ?influence which may explain the source of your
NaN's .
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
 Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]> wrote:
> I told you already: **Include code inline **
>
> See ?dput for how to include a text version of objects, such as data
> frames, inline.
>
> Otherwise, I believe .txt text files are not stripped if you insist on
> *attaching* data or code. Others may have better advice.
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]> wrote:
>
>> How can I add attachments? The following two files were attached in the
>> initial message
>>
>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
>> wrote:
>>
>>> Nothing was attached. The rhelp server strips most attachments. Include
>>> your code inline.
>>>
>>> Also note that
>>>
>>> > 0/0
>>> [1] NaN
>>>
>>> so maybe something like that occurs in the course of your calculations.
>>> But that's just a guess, so feel free to disregard.
>>>
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford < [hidden email]>
>>> wrote:
>>>
>>>> Hi R core team,
>>>>
>>>> I experienced the following issue with the attached data/code snippet,
>>>> where the studentized residual for a single observation appears to be
>>>> NaN
>>>> given finite predictors/responses, which appears to be driven by the
>>>> glm.influence method in the stats package. I am curious to whether this
>>>> is
>>>> a consequence of the specific implementation used for computing the
>>>> influence, which it would appear is the driving force for the NaN
>>>> influence
>>>> for the point, that I was ultimately able to trace back through the
>>>> lm.influence method to this specific line
>>>> <
>>>> https://github.com/SurajGupta/rsource/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67>>>> >
>>>> which
>>>> calls C code which calls iminfl.f
>>>> <
>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/stats/src/lminfl.f>>>> >
>>>> (I
>>>> don't know fortran so I can't debug further). My understanding is that
>>>> the
>>>> specific issue would have to do with the leaveoneout variance estimate
>>>> associated with this particular point, which it seems based on my
>>>> understanding should be finite given finite predictors/responses. Let me
>>>> know. Thanks!
>>>>
>>>> Sincerely,
>>>>
>>>> 
>>>> Eric Bridgeford
>>>> ericwb.me
>>>> ______________________________________________
>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>>> PLEASE do read the posting guide
>>>> http://www.Rproject.org/postingguide.html>>>> and provide commented, minimal, selfcontained, reproducible code.
>>>>
>>>
>>
>> 
>> Eric Bridgeford
>> ericwb.me
>>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


I agree the influence documentation suggests NaNs may result; however, as
these can be manually computed and are, indeed, finite/existing (ie,
computing the heldout influence by manually training n models for n points
to obtain n leave one out influence measures), I don't possibly see how the
function SHOULD return NaN, and given that it is returning NaN, that
suggests to me that there should be either a) Providing an alternative
method to compute them that (may be slower) that returns the correct
results in the even that lm.influence does not return a good approximation
(ie, a command line argument for type="approx" that does the approximation
strategy employed currently, or an alternative type="direct" or something
like that that computes them manually), or b) a heuristic to suggest why
NaNs might result from one's particular inputs/what can be done to fix it
(if the approximation strategy is the source of the problem) or what the
issue is with the data that will cause NaNs. Hence I was looking to start a
discussion around the specific strategy employed to compute the elements.
Below is the code:
moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
11L,
12L, 9L, 10L, 4L, 6L, 3L),
.Label = c("Ceres ", "Earth", "Eris ",
"Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune
",
"Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
9.54, 19.22,
30.06, 39.5, 43.35, 45.8, 67.7),
Diameter = c(0.382, 0.949,
1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
0.0025), Moons = c(0L,
0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L), Volume
= c(0.0291869497930152,
0.447504348276571, 0.523598775598299, 0.0788376225681443,
0.000268082573106329, 737.393372232996, 441.729261571372,
33.6865588825666, 30.6549628355953, 0.00305362805928928,
0.00176714586764426, 0.00090477868423386, 0.00359136400182873
)), row.names = c(NA, 13L), class = "data.frame")
fit < glm.nb(Moons ~ Volume, data = moon_data)
rstudent(fit)
fit2 < update(fit, subset = Name != "Jupiter ")
rstudent(fit2)
influence(fit2)$sigma
# 1 2 3 4 5 7 8 9
10 11 12 13
# 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
1.187586 1.181696 1.077954 1.165147
Sincerely,
Eric
On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]> wrote:
> Also, I suggest you read ?influence which may explain the source of your
> NaN's .
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]> wrote:
>
>> I told you already: **Include code inline **
>>
>> See ?dput for how to include a text version of objects, such as data
>> frames, inline.
>>
>> Otherwise, I believe .txt text files are not stripped if you insist on
>> *attaching* data or code. Others may have better advice.
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
>> wrote:
>>
>>> How can I add attachments? The following two files were attached in the
>>> initial message
>>>
>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
>>> wrote:
>>>
>>>> Nothing was attached. The rhelp server strips most attachments.
>>>> Include your code inline.
>>>>
>>>> Also note that
>>>>
>>>> > 0/0
>>>> [1] NaN
>>>>
>>>> so maybe something like that occurs in the course of your calculations.
>>>> But that's just a guess, so feel free to disregard.
>>>>
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep coming along
>>>> and sticking things into it."
>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>
>>>>
>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford < [hidden email]>
>>>> wrote:
>>>>
>>>>> Hi R core team,
>>>>>
>>>>> I experienced the following issue with the attached data/code snippet,
>>>>> where the studentized residual for a single observation appears to be
>>>>> NaN
>>>>> given finite predictors/responses, which appears to be driven by the
>>>>> glm.influence method in the stats package. I am curious to whether
>>>>> this is
>>>>> a consequence of the specific implementation used for computing the
>>>>> influence, which it would appear is the driving force for the NaN
>>>>> influence
>>>>> for the point, that I was ultimately able to trace back through the
>>>>> lm.influence method to this specific line
>>>>> <
>>>>> https://github.com/SurajGupta/rsource/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67>>>>> >
>>>>> which
>>>>> calls C code which calls iminfl.f
>>>>> <
>>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/stats/src/lminfl.f>>>>> >
>>>>> (I
>>>>> don't know fortran so I can't debug further). My understanding is that
>>>>> the
>>>>> specific issue would have to do with the leaveoneout variance
>>>>> estimate
>>>>> associated with this particular point, which it seems based on my
>>>>> understanding should be finite given finite predictors/responses. Let
>>>>> me
>>>>> know. Thanks!
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> 
>>>>> Eric Bridgeford
>>>>> ericwb.me
>>>>> ______________________________________________
>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>>>> PLEASE do read the posting guide
>>>>> http://www.Rproject.org/postingguide.html>>>>> and provide commented, minimal, selfcontained, reproducible code.
>>>>>
>>>>
>>>
>>> 
>>> Eric Bridgeford
>>> ericwb.me
>>>
>>

Eric Bridgeford
ericwb.me
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi Eric,
When I run your code (using the MASS library) I find that
rstudent(fit2) also returns NaN in the seventh position. Perhaps the
problem is occurring there and not in the "influence" function.
Jim
On Wed, Apr 3, 2019 at 9:12 AM Eric Bridgeford < [hidden email]> wrote:
>
> I agree the influence documentation suggests NaNs may result; however, as
> these can be manually computed and are, indeed, finite/existing (ie,
> computing the heldout influence by manually training n models for n points
> to obtain n leave one out influence measures), I don't possibly see how the
> function SHOULD return NaN, and given that it is returning NaN, that
> suggests to me that there should be either a) Providing an alternative
> method to compute them that (may be slower) that returns the correct
> results in the even that lm.influence does not return a good approximation
> (ie, a command line argument for type="approx" that does the approximation
> strategy employed currently, or an alternative type="direct" or something
> like that that computes them manually), or b) a heuristic to suggest why
> NaNs might result from one's particular inputs/what can be done to fix it
> (if the approximation strategy is the source of the problem) or what the
> issue is with the data that will cause NaNs. Hence I was looking to start a
> discussion around the specific strategy employed to compute the elements.
>
> Below is the code:
> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
> 11L,
> 12L, 9L, 10L, 4L, 6L, 3L),
> .Label = c("Ceres ", "Earth", "Eris ",
>
> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune
> ",
>
> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> 9.54, 19.22,
> 30.06, 39.5, 43.35, 45.8, 67.7),
> Diameter = c(0.382, 0.949,
>
> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
>
> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
>
> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
> 0.0025), Moons = c(0L,
>
>
> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L), Volume
> = c(0.0291869497930152,
>
>
>
> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
>
>
>
> 0.000268082573106329, 737.393372232996, 441.729261571372,
>
>
>
> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
>
>
>
> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
>
>
> )), row.names = c(NA, 13L), class = "data.frame")
>
> fit < glm.nb(Moons ~ Volume, data = moon_data)
> rstudent(fit)
>
> fit2 < update(fit, subset = Name != "Jupiter ")
> rstudent(fit2)
>
> influence(fit2)$sigma
>
> # 1 2 3 4 5 7 8 9
> 10 11 12 13
> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
> 1.187586 1.181696 1.077954 1.165147
>
> Sincerely,
> Eric
>
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


rstudent calls influence, to my knowledge, and all of the results passed by
rstudent are dependent on values returned by influence (other than the
weights, which I can't imagine are NaN), so I believe that influence is the
issue. See the line
https://github.com/SurajGupta/rsource/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L135.
Eric
On Tue, Apr 2, 2019 at 6:36 PM Jim Lemon < [hidden email]> wrote:
> Hi Eric,
> When I run your code (using the MASS library) I find that
> rstudent(fit2) also returns NaN in the seventh position. Perhaps the
> problem is occurring there and not in the "influence" function.
>
> Jim
>
> On Wed, Apr 3, 2019 at 9:12 AM Eric Bridgeford < [hidden email]> wrote:
> >
> > I agree the influence documentation suggests NaNs may result; however, as
> > these can be manually computed and are, indeed, finite/existing (ie,
> > computing the heldout influence by manually training n models for n
> points
> > to obtain n leave one out influence measures), I don't possibly see how
> the
> > function SHOULD return NaN, and given that it is returning NaN, that
> > suggests to me that there should be either a) Providing an alternative
> > method to compute them that (may be slower) that returns the correct
> > results in the even that lm.influence does not return a good
> approximation
> > (ie, a command line argument for type="approx" that does the
> approximation
> > strategy employed currently, or an alternative type="direct" or something
> > like that that computes them manually), or b) a heuristic to suggest why
> > NaNs might result from one's particular inputs/what can be done to fix it
> > (if the approximation strategy is the source of the problem) or what the
> > issue is with the data that will cause NaNs. Hence I was looking to
> start a
> > discussion around the specific strategy employed to compute the elements.
> >
> > Below is the code:
> > moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
> > 11L,
> > 12L, 9L, 10L, 4L, 6L, 3L),
> > .Label = c("Ceres ", "Earth", "Eris ",
> >
> > "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> "Neptune
> > ",
> >
> > "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> > Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> > 9.54, 19.22,
> > 30.06, 39.5, 43.35, 45.8, 67.7),
> > Diameter = c(0.382, 0.949,
> >
> > 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> >
> > 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
> >
> > 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
> > 0.0025), Moons = c(0L,
> >
> >
> > 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> Volume
> > = c(0.0291869497930152,
> >
> >
> >
> > 0.447504348276571, 0.523598775598299, 0.0788376225681443,
> >
> >
> >
> > 0.000268082573106329, 737.393372232996, 441.729261571372,
> >
> >
> >
> > 33.6865588825666, 30.6549628355953, 0.00305362805928928,
> >
> >
> >
> > 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> >
> >
> > )), row.names = c(NA, 13L), class = "data.frame")
> >
> > fit < glm.nb(Moons ~ Volume, data = moon_data)
> > rstudent(fit)
> >
> > fit2 < update(fit, subset = Name != "Jupiter ")
> > rstudent(fit2)
> >
> > influence(fit2)$sigma
> >
> > # 1 2 3 4 5 7 8 9
> > 10 11 12 13
> > # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
> > 1.187586 1.181696 1.077954 1.165147
> >
> > Sincerely,
> > Eric
> >
>

Eric Bridgeford
ericwb.me
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Dear Eric,
Have you looked at your data?  for example:
plot(log(Moons) ~ Volume, data = moon_data)
text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, subset = Volume > 400)
The negativebinomial model doesn't look reasonable, does it?
After you eliminate Jupiter there's one very high leverage point left, Saturn. Computing studentized residuals entails an approximation to deleting that as well from the model, so try fitting
fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
summary(fit3)
which runs into numeric difficulties.
Then look at:
plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
Finally, try
plot(log(Moons) ~ log(Volume), data = moon_data)
fit4 < update(fit2, . ~ log(Volume))
rstudent(fit4)
I hope this helps,
John

John Fox
Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: https://socialsciences.mcmaster.ca/jfox/> Original Message
> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
> Bridgeford
> Sent: Tuesday, April 2, 2019 5:01 PM
> To: Bert Gunter < [hidden email]>
> Cc: Rhelp < [hidden email]>
> Subject: Re: [R] Fwd: Potential Issue with lm.influence
>
> I agree the influence documentation suggests NaNs may result; however, as
> these can be manually computed and are, indeed, finite/existing (ie,
> computing the heldout influence by manually training n models for n points
> to obtain n leave one out influence measures), I don't possibly see how the
> function SHOULD return NaN, and given that it is returning NaN, that
> suggests to me that there should be either a) Providing an alternative
> method to compute them that (may be slower) that returns the correct
> results in the even that lm.influence does not return a good approximation
> (ie, a command line argument for type="approx" that does the
> approximation strategy employed currently, or an alternative type="direct"
> or something like that that computes them manually), or b) a heuristic to
> suggest why NaNs might result from one's particular inputs/what can be
> done to fix it (if the approximation strategy is the source of the problem) or
> what the issue is with the data that will cause NaNs. Hence I was looking to
> start a discussion around the specific strategy employed to compute the
> elements.
>
> Below is the code:
> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L, 11L,
> 12L, 9L, 10L, 4L, 6L, 3L), .Label = c("Ceres ", "Earth",
> "Eris ",
>
> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune ",
>
> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2, 9.54, 19.22,
> 30.06, 39.5, 43.35, 45.8, 67.7), Diameter = c(0.382, 0.949,
>
> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
>
> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
>
> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04, 0.0025), Moons = c(0L,
>
>
> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L), Volume =
> c(0.0291869497930152,
>
>
>
> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
>
>
>
> 0.000268082573106329, 737.393372232996, 441.729261571372,
>
>
>
> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
>
>
>
> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
>
>
> )), row.names = c(NA, 13L), class = "data.frame")
>
> fit < glm.nb(Moons ~ Volume, data = moon_data)
> rstudent(fit)
>
> fit2 < update(fit, subset = Name != "Jupiter ")
> rstudent(fit2)
>
> influence(fit2)$sigma
>
> # 1 2 3 4 5 7 8 9
> 10 11 12 13
> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
> 1.187586 1.181696 1.077954 1.165147
>
> Sincerely,
> Eric
>
> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
> wrote:
>
> > Also, I suggest you read ?influence which may explain the source of
> > your NaN's .
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> > and sticking things into it."
> >  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
> wrote:
> >
> >> I told you already: **Include code inline **
> >>
> >> See ?dput for how to include a text version of objects, such as data
> >> frames, inline.
> >>
> >> Otherwise, I believe .txt text files are not stripped if you insist
> >> on
> >> *attaching* data or code. Others may have better advice.
> >>
> >>
> >> Bert Gunter
> >>
> >> "The trouble with having an open mind is that people keep coming
> >> along and sticking things into it."
> >>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>
> >>
> >> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
> >> wrote:
> >>
> >>> How can I add attachments? The following two files were attached in
> >>> the initial message
> >>>
> >>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
> >>> wrote:
> >>>
> >>>> Nothing was attached. The rhelp server strips most attachments.
> >>>> Include your code inline.
> >>>>
> >>>> Also note that
> >>>>
> >>>> > 0/0
> >>>> [1] NaN
> >>>>
> >>>> so maybe something like that occurs in the course of your calculations.
> >>>> But that's just a guess, so feel free to disregard.
> >>>>
> >>>>
> >>>> Bert Gunter
> >>>>
> >>>> "The trouble with having an open mind is that people keep coming
> >>>> along and sticking things into it."
> >>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> >>>> < [hidden email]>
> >>>> wrote:
> >>>>
> >>>>> Hi R core team,
> >>>>>
> >>>>> I experienced the following issue with the attached data/code
> >>>>> snippet, where the studentized residual for a single observation
> >>>>> appears to be NaN given finite predictors/responses, which appears
> >>>>> to be driven by the glm.influence method in the stats package. I
> >>>>> am curious to whether this is a consequence of the specific
> >>>>> implementation used for computing the influence, which it would
> >>>>> appear is the driving force for the NaN influence for the point,
> >>>>> that I was ultimately able to trace back through the lm.influence
> >>>>> method to this specific line <
> >>>>> https://github.com/SurajGupta/r> source/blob/a28e609e72ed7c47f6ddfb
> >>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> >>>>> >
> >>>>> which
> >>>>> calls C code which calls iminfl.f
> >>>>> <
> >>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/sta> >>>>> ts/src/lminfl.f
> >>>>> >
> >>>>> (I
> >>>>> don't know fortran so I can't debug further). My understanding is
> >>>>> that the specific issue would have to do with the leaveoneout
> >>>>> variance estimate associated with this particular point, which it
> >>>>> seems based on my understanding should be finite given finite
> >>>>> predictors/responses. Let me know. Thanks!
> >>>>>
> >>>>> Sincerely,
> >>>>>
> >>>>> 
> >>>>> Eric Bridgeford
> >>>>> ericwb.me
> >>>>> ______________________________________________
> >>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/rhelp> >>>>> PLEASE do read the posting guide
> >>>>> http://www.Rproject.org/postingguide.html> >>>>> and provide commented, minimal, selfcontained, reproducible code.
> >>>>>
> >>>>
> >>>
> >>> 
> >>> Eric Bridgeford
> >>> ericwb.me
> >>>
> >>
>
> 
> Eric Bridgeford
> ericwb.me
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/posting> guide.html
> and provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hey John,
I am aware they are high leverage points, and that the model is not the
best for them. The purpose of this dataset was to explore high leverage
points, and diagnostic statistics through which one would identify them.
What I am saying is that the current behavior of the function seems a
little nonspecific to me; the influence for this problem is
finite/computable manually by fitting n models to n1 points (manually
holding out each point individually to obtain the loovariance, and
computing the influence in the nonapproximate way).
I am just suggesting that it seems the function could be improved by, say,
throwing specific warnings when NaNs may arise. Ie, "Your have points that
are very high leverage. The approximation technique is not numerically
stable for these points and the results should be used with caution"
etc...; I am sure there are other also prehoc approaches to diagnose other
ways in which this function could fail). The approximation technique not
behaving well for points that are ultra high leverage just seems peculiar
that that would return an NaN with no other recommendations/advice/specific
warnings, especially since the influence is frequently used to diagnosing
this specific issue.
Alternatively, one could afford an optional argument type="manual" that
computes the heldout variance manually rather than the approximate
fashion, and add a comment to use this in the help menu when you have high
leverage points (this is what I ended up doing to obtain the true influence
and the externally studentized residual).
I just think some more specificity could be of use for future users, to
make the R:stats community even better :) Does that make sense?
Sincerely,
Eric
On Tue, Apr 2, 2019 at 7:53 PM Fox, John < [hidden email]> wrote:
> Dear Eric,
>
> Have you looked at your data?  for example:
>
> plot(log(Moons) ~ Volume, data = moon_data)
> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> subset = Volume > 400)
>
> The negativebinomial model doesn't look reasonable, does it?
>
> After you eliminate Jupiter there's one very high leverage point left,
> Saturn. Computing studentized residuals entails an approximation to
> deleting that as well from the model, so try fitting
>
> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
> summary(fit3)
>
> which runs into numeric difficulties.
>
> Then look at:
>
> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
>
> Finally, try
>
> plot(log(Moons) ~ log(Volume), data = moon_data)
> fit4 < update(fit2, . ~ log(Volume))
> rstudent(fit4)
>
> I hope this helps,
> John
>
> 
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/>
>
>
>
> > Original Message
> > From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
> > Bridgeford
> > Sent: Tuesday, April 2, 2019 5:01 PM
> > To: Bert Gunter < [hidden email]>
> > Cc: Rhelp < [hidden email]>
> > Subject: Re: [R] Fwd: Potential Issue with lm.influence
> >
> > I agree the influence documentation suggests NaNs may result; however, as
> > these can be manually computed and are, indeed, finite/existing (ie,
> > computing the heldout influence by manually training n models for n
> points
> > to obtain n leave one out influence measures), I don't possibly see how
> the
> > function SHOULD return NaN, and given that it is returning NaN, that
> > suggests to me that there should be either a) Providing an alternative
> > method to compute them that (may be slower) that returns the correct
> > results in the even that lm.influence does not return a good
> approximation
> > (ie, a command line argument for type="approx" that does the
> > approximation strategy employed currently, or an alternative
> type="direct"
> > or something like that that computes them manually), or b) a heuristic to
> > suggest why NaNs might result from one's particular inputs/what can be
> > done to fix it (if the approximation strategy is the source of the
> problem) or
> > what the issue is with the data that will cause NaNs. Hence I was
> looking to
> > start a discussion around the specific strategy employed to compute the
> > elements.
> >
> > Below is the code:
> > moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
> 11L,
> > 12L, 9L, 10L, 4L, 6L,
> 3L), .Label = c("Ceres ", "Earth",
> > "Eris ",
> >
> > "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> "Neptune ",
> >
> > "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> > Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> 9.54, 19.22,
> > 30.06, 39.5, 43.35, 45.8,
> 67.7), Diameter = c(0.382, 0.949,
> >
> > 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> >
> > 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
> >
> > 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
> 0.0025), Moons = c(0L,
> >
> >
> > 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> Volume =
> > c(0.0291869497930152,
> >
> >
> >
> > 0.447504348276571, 0.523598775598299, 0.0788376225681443,
> >
> >
> >
> > 0.000268082573106329, 737.393372232996, 441.729261571372,
> >
> >
> >
> > 33.6865588825666, 30.6549628355953, 0.00305362805928928,
> >
> >
> >
> > 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> >
> >
> > )), row.names = c(NA, 13L), class = "data.frame")
> >
> > fit < glm.nb(Moons ~ Volume, data = moon_data)
> > rstudent(fit)
> >
> > fit2 < update(fit, subset = Name != "Jupiter ")
> > rstudent(fit2)
> >
> > influence(fit2)$sigma
> >
> > # 1 2 3 4 5 7 8 9
> > 10 11 12 13
> > # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
> > 1.187586 1.181696 1.077954 1.165147
> >
> > Sincerely,
> > Eric
> >
> > On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
> > wrote:
> >
> > > Also, I suggest you read ?influence which may explain the source of
> > > your NaN's .
> > >
> > > Bert Gunter
> > >
> > > "The trouble with having an open mind is that people keep coming along
> > > and sticking things into it."
> > >  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >
> > >
> > > On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
> > wrote:
> > >
> > >> I told you already: **Include code inline **
> > >>
> > >> See ?dput for how to include a text version of objects, such as data
> > >> frames, inline.
> > >>
> > >> Otherwise, I believe .txt text files are not stripped if you insist
> > >> on
> > >> *attaching* data or code. Others may have better advice.
> > >>
> > >>
> > >> Bert Gunter
> > >>
> > >> "The trouble with having an open mind is that people keep coming
> > >> along and sticking things into it."
> > >>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>
> > >>
> > >> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
> > >> wrote:
> > >>
> > >>> How can I add attachments? The following two files were attached in
> > >>> the initial message
> > >>>
> > >>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
> > >>> wrote:
> > >>>
> > >>>> Nothing was attached. The rhelp server strips most attachments.
> > >>>> Include your code inline.
> > >>>>
> > >>>> Also note that
> > >>>>
> > >>>> > 0/0
> > >>>> [1] NaN
> > >>>>
> > >>>> so maybe something like that occurs in the course of your
> calculations.
> > >>>> But that's just a guess, so feel free to disregard.
> > >>>>
> > >>>>
> > >>>> Bert Gunter
> > >>>>
> > >>>> "The trouble with having an open mind is that people keep coming
> > >>>> along and sticking things into it."
> > >>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>>>
> > >>>>
> > >>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> > >>>> < [hidden email]>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi R core team,
> > >>>>>
> > >>>>> I experienced the following issue with the attached data/code
> > >>>>> snippet, where the studentized residual for a single observation
> > >>>>> appears to be NaN given finite predictors/responses, which appears
> > >>>>> to be driven by the glm.influence method in the stats package. I
> > >>>>> am curious to whether this is a consequence of the specific
> > >>>>> implementation used for computing the influence, which it would
> > >>>>> appear is the driving force for the NaN influence for the point,
> > >>>>> that I was ultimately able to trace back through the lm.influence
> > >>>>> method to this specific line <
> > >>>>> https://github.com/SurajGupta/r> > source/blob/a28e609e72ed7c47f6ddfb
> > >>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> > >>>>> >
> > >>>>> which
> > >>>>> calls C code which calls iminfl.f
> > >>>>> <
> > >>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/sta> > >>>>> ts/src/lminfl.f
> > >>>>> >
> > >>>>> (I
> > >>>>> don't know fortran so I can't debug further). My understanding is
> > >>>>> that the specific issue would have to do with the leaveoneout
> > >>>>> variance estimate associated with this particular point, which it
> > >>>>> seems based on my understanding should be finite given finite
> > >>>>> predictors/responses. Let me know. Thanks!
> > >>>>>
> > >>>>> Sincerely,
> > >>>>>
> > >>>>> 
> > >>>>> Eric Bridgeford
> > >>>>> ericwb.me
> > >>>>> ______________________________________________
> > >>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> > >>>>> https://stat.ethz.ch/mailman/listinfo/rhelp> > >>>>> PLEASE do read the posting guide
> > >>>>> http://www.Rproject.org/postingguide.html> > >>>>> and provide commented, minimal, selfcontained, reproducible code.
> > >>>>>
> > >>>>
> > >>>
> > >>> 
> > >>> Eric Bridgeford
> > >>> ericwb.me
> > >>>
> > >>
> >
> > 
> > Eric Bridgeford
> > ericwb.me
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list  To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/rhelp> > PLEASE do read the posting guide http://www.Rproject.org/posting> > guide.html
> > and provide commented, minimal, selfcontained, reproducible code.
>

Eric Bridgeford
ericwb.me
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Yes, also notice that
> predict(fit3, new=moon_data, type="resp")
1 2 3 4 5 6
1.060694e+00 1.102008e+00 1.109695e+00 1.065515e+00 1.057896e+00 1.892312e+29
7 8 9 10 11 12
3.531271e+17 2.295015e+01 1.739889e+01 1.058165e+00 1.058041e+00 1.057957e+00
13
1.058217e+00
so the model of fit3 predicts that Jupiter and Saturn should have several bazillions of moons each!
pd
> On 3 Apr 2019, at 01:53 , Fox, John < [hidden email]> wrote:
>
> Dear Eric,
>
> Have you looked at your data?  for example:
>
> plot(log(Moons) ~ Volume, data = moon_data)
> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, subset = Volume > 400)
>
> The negativebinomial model doesn't look reasonable, does it?
>
> After you eliminate Jupiter there's one very high leverage point left, Saturn. Computing studentized residuals entails an approximation to deleting that as well from the model, so try fitting
>
> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
> summary(fit3)
>
> which runs into numeric difficulties.
>
> Then look at:
>
> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
>
> Finally, try
>
> plot(log(Moons) ~ log(Volume), data = moon_data)
> fit4 < update(fit2, . ~ log(Volume))
> rstudent(fit4)
>
> I hope this helps,
> John
>
> 
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/>
>
>
>
>> Original Message
>> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
>> Bridgeford
>> Sent: Tuesday, April 2, 2019 5:01 PM
>> To: Bert Gunter < [hidden email]>
>> Cc: Rhelp < [hidden email]>
>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
>>
>> I agree the influence documentation suggests NaNs may result; however, as
>> these can be manually computed and are, indeed, finite/existing (ie,
>> computing the heldout influence by manually training n models for n points
>> to obtain n leave one out influence measures), I don't possibly see how the
>> function SHOULD return NaN, and given that it is returning NaN, that
>> suggests to me that there should be either a) Providing an alternative
>> method to compute them that (may be slower) that returns the correct
>> results in the even that lm.influence does not return a good approximation
>> (ie, a command line argument for type="approx" that does the
>> approximation strategy employed currently, or an alternative type="direct"
>> or something like that that computes them manually), or b) a heuristic to
>> suggest why NaNs might result from one's particular inputs/what can be
>> done to fix it (if the approximation strategy is the source of the problem) or
>> what the issue is with the data that will cause NaNs. Hence I was looking to
>> start a discussion around the specific strategy employed to compute the
>> elements.
>>
>> Below is the code:
>> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L, 11L,
>> 12L, 9L, 10L, 4L, 6L, 3L), .Label = c("Ceres ", "Earth",
>> "Eris ",
>>
>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune ",
>>
>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2, 9.54, 19.22,
>> 30.06, 39.5, 43.35, 45.8, 67.7), Diameter = c(0.382, 0.949,
>>
>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
>>
>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
>>
>> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04, 0.0025), Moons = c(0L,
>>
>>
>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L), Volume =
>> c(0.0291869497930152,
>>
>>
>>
>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
>>
>>
>>
>> 0.000268082573106329, 737.393372232996, 441.729261571372,
>>
>>
>>
>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
>>
>>
>>
>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
>>
>>
>> )), row.names = c(NA, 13L), class = "data.frame")
>>
>> fit < glm.nb(Moons ~ Volume, data = moon_data)
>> rstudent(fit)
>>
>> fit2 < update(fit, subset = Name != "Jupiter ")
>> rstudent(fit2)
>>
>> influence(fit2)$sigma
>>
>> # 1 2 3 4 5 7 8 9
>> 10 11 12 13
>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
>> 1.187586 1.181696 1.077954 1.165147
>>
>> Sincerely,
>> Eric
>>
>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
>> wrote:
>>
>>> Also, I suggest you read ?influence which may explain the source of
>>> your NaN's .
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
>> wrote:
>>>
>>>> I told you already: **Include code inline **
>>>>
>>>> See ?dput for how to include a text version of objects, such as data
>>>> frames, inline.
>>>>
>>>> Otherwise, I believe .txt text files are not stripped if you insist
>>>> on
>>>> *attaching* data or code. Others may have better advice.
>>>>
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep coming
>>>> along and sticking things into it."
>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>
>>>>
>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
>>>> wrote:
>>>>
>>>>> How can I add attachments? The following two files were attached in
>>>>> the initial message
>>>>>
>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Nothing was attached. The rhelp server strips most attachments.
>>>>>> Include your code inline.
>>>>>>
>>>>>> Also note that
>>>>>>
>>>>>>> 0/0
>>>>>> [1] NaN
>>>>>>
>>>>>> so maybe something like that occurs in the course of your calculations.
>>>>>> But that's just a guess, so feel free to disregard.
>>>>>>
>>>>>>
>>>>>> Bert Gunter
>>>>>>
>>>>>> "The trouble with having an open mind is that people keep coming
>>>>>> along and sticking things into it."
>>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
>>>>>> < [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi R core team,
>>>>>>>
>>>>>>> I experienced the following issue with the attached data/code
>>>>>>> snippet, where the studentized residual for a single observation
>>>>>>> appears to be NaN given finite predictors/responses, which appears
>>>>>>> to be driven by the glm.influence method in the stats package. I
>>>>>>> am curious to whether this is a consequence of the specific
>>>>>>> implementation used for computing the influence, which it would
>>>>>>> appear is the driving force for the NaN influence for the point,
>>>>>>> that I was ultimately able to trace back through the lm.influence
>>>>>>> method to this specific line <
>>>>>>> https://github.com/SurajGupta/r>> source/blob/a28e609e72ed7c47f6ddfb
>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
>>>>>>>>
>>>>>>> which
>>>>>>> calls C code which calls iminfl.f
>>>>>>> <
>>>>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/sta>>>>>>> ts/src/lminfl.f
>>>>>>>>
>>>>>>> (I
>>>>>>> don't know fortran so I can't debug further). My understanding is
>>>>>>> that the specific issue would have to do with the leaveoneout
>>>>>>> variance estimate associated with this particular point, which it
>>>>>>> seems based on my understanding should be finite given finite
>>>>>>> predictors/responses. Let me know. Thanks!
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> 
>>>>>>> Eric Bridgeford
>>>>>>> ericwb.me
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.Rproject.org/postingguide.html>>>>>>> and provide commented, minimal, selfcontained, reproducible code.
>>>>>>>
>>>>>>
>>>>>
>>>>> 
>>>>> Eric Bridgeford
>>>>> ericwb.me
>>>>>
>>>>
>>
>> 
>> Eric Bridgeford
>> ericwb.me
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide http://www.Rproject.org/posting>> guide.html
>> and provide commented, minimal, selfcontained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi Peter,
Yes, that's another reflection of the degree to which Jupiter and Saturn are out of line with the data for the other planet when you fit the very unreasonable negative binomial model with Volume untransformed.
Best,
John
> On Apr 3, 2019, at 5:36 AM, peter dalgaard < [hidden email]> wrote:
>
> Yes, also notice that
>
>> predict(fit3, new=moon_data, type="resp")
> 1 2 3 4 5 6
> 1.060694e+00 1.102008e+00 1.109695e+00 1.065515e+00 1.057896e+00 1.892312e+29
> 7 8 9 10 11 12
> 3.531271e+17 2.295015e+01 1.739889e+01 1.058165e+00 1.058041e+00 1.057957e+00
> 13
> 1.058217e+00
>
>
> so the model of fit3 predicts that Jupiter and Saturn should have several bazillions of moons each!
>
> pd
>
>
>
>> On 3 Apr 2019, at 01:53 , Fox, John < [hidden email]> wrote:
>>
>> Dear Eric,
>>
>> Have you looked at your data?  for example:
>>
>> plot(log(Moons) ~ Volume, data = moon_data)
>> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, subset = Volume > 400)
>>
>> The negativebinomial model doesn't look reasonable, does it?
>>
>> After you eliminate Jupiter there's one very high leverage point left, Saturn. Computing studentized residuals entails an approximation to deleting that as well from the model, so try fitting
>>
>> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
>> summary(fit3)
>>
>> which runs into numeric difficulties.
>>
>> Then look at:
>>
>> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
>>
>> Finally, try
>>
>> plot(log(Moons) ~ log(Volume), data = moon_data)
>> fit4 < update(fit2, . ~ log(Volume))
>> rstudent(fit4)
>>
>> I hope this helps,
>> John
>>
>> 
>> John Fox
>> Professor Emeritus
>> McMaster University
>> Hamilton, Ontario, Canada
>> Web: https://socialsciences.mcmaster.ca/jfox/>>
>>
>>
>>
>>> Original Message
>>> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
>>> Bridgeford
>>> Sent: Tuesday, April 2, 2019 5:01 PM
>>> To: Bert Gunter < [hidden email]>
>>> Cc: Rhelp < [hidden email]>
>>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
>>>
>>> I agree the influence documentation suggests NaNs may result; however, as
>>> these can be manually computed and are, indeed, finite/existing (ie,
>>> computing the heldout influence by manually training n models for n points
>>> to obtain n leave one out influence measures), I don't possibly see how the
>>> function SHOULD return NaN, and given that it is returning NaN, that
>>> suggests to me that there should be either a) Providing an alternative
>>> method to compute them that (may be slower) that returns the correct
>>> results in the even that lm.influence does not return a good approximation
>>> (ie, a command line argument for type="approx" that does the
>>> approximation strategy employed currently, or an alternative type="direct"
>>> or something like that that computes them manually), or b) a heuristic to
>>> suggest why NaNs might result from one's particular inputs/what can be
>>> done to fix it (if the approximation strategy is the source of the problem) or
>>> what the issue is with the data that will cause NaNs. Hence I was looking to
>>> start a discussion around the specific strategy employed to compute the
>>> elements.
>>>
>>> Below is the code:
>>> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L, 11L,
>>> 12L, 9L, 10L, 4L, 6L, 3L), .Label = c("Ceres ", "Earth",
>>> "Eris ",
>>>
>>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune ",
>>>
>>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
>>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2, 9.54, 19.22,
>>> 30.06, 39.5, 43.35, 45.8, 67.7), Diameter = c(0.382, 0.949,
>>>
>>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
>>>
>>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
>>>
>>> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04, 0.0025), Moons = c(0L,
>>>
>>>
>>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L), Volume =
>>> c(0.0291869497930152,
>>>
>>>
>>>
>>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
>>>
>>>
>>>
>>> 0.000268082573106329, 737.393372232996, 441.729261571372,
>>>
>>>
>>>
>>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
>>>
>>>
>>>
>>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
>>>
>>>
>>> )), row.names = c(NA, 13L), class = "data.frame")
>>>
>>> fit < glm.nb(Moons ~ Volume, data = moon_data)
>>> rstudent(fit)
>>>
>>> fit2 < update(fit, subset = Name != "Jupiter ")
>>> rstudent(fit2)
>>>
>>> influence(fit2)$sigma
>>>
>>> # 1 2 3 4 5 7 8 9
>>> 10 11 12 13
>>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
>>> 1.187586 1.181696 1.077954 1.165147
>>>
>>> Sincerely,
>>> Eric
>>>
>>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
>>> wrote:
>>>
>>>> Also, I suggest you read ?influence which may explain the source of
>>>> your NaN's .
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep coming along
>>>> and sticking things into it."
>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>
>>>>
>>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
>>> wrote:
>>>>
>>>>> I told you already: **Include code inline **
>>>>>
>>>>> See ?dput for how to include a text version of objects, such as data
>>>>> frames, inline.
>>>>>
>>>>> Otherwise, I believe .txt text files are not stripped if you insist
>>>>> on
>>>>> *attaching* data or code. Others may have better advice.
>>>>>
>>>>>
>>>>> Bert Gunter
>>>>>
>>>>> "The trouble with having an open mind is that people keep coming
>>>>> along and sticking things into it."
>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>
>>>>>
>>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
>>>>> wrote:
>>>>>
>>>>>> How can I add attachments? The following two files were attached in
>>>>>> the initial message
>>>>>>
>>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Nothing was attached. The rhelp server strips most attachments.
>>>>>>> Include your code inline.
>>>>>>>
>>>>>>> Also note that
>>>>>>>
>>>>>>>> 0/0
>>>>>>> [1] NaN
>>>>>>>
>>>>>>> so maybe something like that occurs in the course of your calculations.
>>>>>>> But that's just a guess, so feel free to disregard.
>>>>>>>
>>>>>>>
>>>>>>> Bert Gunter
>>>>>>>
>>>>>>> "The trouble with having an open mind is that people keep coming
>>>>>>> along and sticking things into it."
>>>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
>>>>>>> < [hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi R core team,
>>>>>>>>
>>>>>>>> I experienced the following issue with the attached data/code
>>>>>>>> snippet, where the studentized residual for a single observation
>>>>>>>> appears to be NaN given finite predictors/responses, which appears
>>>>>>>> to be driven by the glm.influence method in the stats package. I
>>>>>>>> am curious to whether this is a consequence of the specific
>>>>>>>> implementation used for computing the influence, which it would
>>>>>>>> appear is the driving force for the NaN influence for the point,
>>>>>>>> that I was ultimately able to trace back through the lm.influence
>>>>>>>> method to this specific line <
>>>>>>>> https://github.com/SurajGupta/r>>> source/blob/a28e609e72ed7c47f6ddfb
>>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
>>>>>>>>>
>>>>>>>> which
>>>>>>>> calls C code which calls iminfl.f
>>>>>>>> <
>>>>>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/sta>>>>>>>> ts/src/lminfl.f
>>>>>>>>>
>>>>>>>> (I
>>>>>>>> don't know fortran so I can't debug further). My understanding is
>>>>>>>> that the specific issue would have to do with the leaveoneout
>>>>>>>> variance estimate associated with this particular point, which it
>>>>>>>> seems based on my understanding should be finite given finite
>>>>>>>> predictors/responses. Let me know. Thanks!
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>>
>>>>>>>> 
>>>>>>>> Eric Bridgeford
>>>>>>>> ericwb.me
>>>>>>>> ______________________________________________
>>>>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.Rproject.org/postingguide.html>>>>>>>> and provide commented, minimal, selfcontained, reproducible code.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> 
>>>>>> Eric Bridgeford
>>>>>> ericwb.me
>>>>>>
>>>>>
>>>
>>> 
>>> Eric Bridgeford
>>> ericwb.me
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>> PLEASE do read the posting guide http://www.Rproject.org/posting>>> guide.html
>>> and provide commented, minimal, selfcontained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html>> and provide commented, minimal, selfcontained, reproducible code.
>
> 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: [hidden email] Priv: [hidden email]
>
>
>
>
>
>
>
>
>
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hey guys,
I appreciate the replies.
I agree the issue is easy to catch; wouldn't it make sense to make a
warning given that these types of errors (I am sure there are other ways to
make the lm.influence have similar NaN performance, simply due to points
radically not fitting the data) are relatively easy to forecast? Seems like
the output is just a bit vague from lm.influence.
Sincerely,
Eric
On Wed, Apr 3, 2019 at 10:03 AM Fox, John < [hidden email]> wrote:
> Hi Peter,
>
> Yes, that's another reflection of the degree to which Jupiter and Saturn
> are out of line with the data for the other planet when you fit the very
> unreasonable negative binomial model with Volume untransformed.
>
> Best,
> John
>
> > On Apr 3, 2019, at 5:36 AM, peter dalgaard < [hidden email]> wrote:
> >
> > Yes, also notice that
> >
> >> predict(fit3, new=moon_data, type="resp")
> > 1 2 3 4 5
> 6
> > 1.060694e+00 1.102008e+00 1.109695e+00 1.065515e+00 1.057896e+00
> 1.892312e+29
> > 7 8 9 10 11
> 12
> > 3.531271e+17 2.295015e+01 1.739889e+01 1.058165e+00 1.058041e+00
> 1.057957e+00
> > 13
> > 1.058217e+00
> >
> >
> > so the model of fit3 predicts that Jupiter and Saturn should have
> several bazillions of moons each!
> >
> > pd
> >
> >
> >
> >> On 3 Apr 2019, at 01:53 , Fox, John < [hidden email]> wrote:
> >>
> >> Dear Eric,
> >>
> >> Have you looked at your data?  for example:
> >>
> >> plot(log(Moons) ~ Volume, data = moon_data)
> >> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> subset = Volume > 400)
> >>
> >> The negativebinomial model doesn't look reasonable, does it?
> >>
> >> After you eliminate Jupiter there's one very high leverage point left,
> Saturn. Computing studentized residuals entails an approximation to
> deleting that as well from the model, so try fitting
> >>
> >> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
> >> summary(fit3)
> >>
> >> which runs into numeric difficulties.
> >>
> >> Then look at:
> >>
> >> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
> >>
> >> Finally, try
> >>
> >> plot(log(Moons) ~ log(Volume), data = moon_data)
> >> fit4 < update(fit2, . ~ log(Volume))
> >> rstudent(fit4)
> >>
> >> I hope this helps,
> >> John
> >>
> >> 
> >> John Fox
> >> Professor Emeritus
> >> McMaster University
> >> Hamilton, Ontario, Canada
> >> Web: https://socialsciences.mcmaster.ca/jfox/> >>
> >>
> >>
> >>
> >>> Original Message
> >>> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
> >>> Bridgeford
> >>> Sent: Tuesday, April 2, 2019 5:01 PM
> >>> To: Bert Gunter < [hidden email]>
> >>> Cc: Rhelp < [hidden email]>
> >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
> >>>
> >>> I agree the influence documentation suggests NaNs may result; however,
> as
> >>> these can be manually computed and are, indeed, finite/existing (ie,
> >>> computing the heldout influence by manually training n models for n
> points
> >>> to obtain n leave one out influence measures), I don't possibly see
> how the
> >>> function SHOULD return NaN, and given that it is returning NaN, that
> >>> suggests to me that there should be either a) Providing an alternative
> >>> method to compute them that (may be slower) that returns the correct
> >>> results in the even that lm.influence does not return a good
> approximation
> >>> (ie, a command line argument for type="approx" that does the
> >>> approximation strategy employed currently, or an alternative
> type="direct"
> >>> or something like that that computes them manually), or b) a heuristic
> to
> >>> suggest why NaNs might result from one's particular inputs/what can be
> >>> done to fix it (if the approximation strategy is the source of the
> problem) or
> >>> what the issue is with the data that will cause NaNs. Hence I was
> looking to
> >>> start a discussion around the specific strategy employed to compute the
> >>> elements.
> >>>
> >>> Below is the code:
> >>> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L,
> 5L, 11L,
> >>> 12L, 9L, 10L, 4L, 6L,
> 3L), .Label = c("Ceres ", "Earth",
> >>> "Eris ",
> >>>
> >>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> "Neptune ",
> >>>
> >>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> >>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> 9.54, 19.22,
> >>> 30.06, 39.5, 43.35, 45.8,
> 67.7), Diameter = c(0.382, 0.949,
> >>>
> >>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> >>>
> >>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
> >>>
> >>> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
> 0.0025), Moons = c(0L,
> >>>
> >>>
> >>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> Volume =
> >>> c(0.0291869497930152,
> >>>
> >>>
> >>>
> >>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
> >>>
> >>>
> >>>
> >>> 0.000268082573106329, 737.393372232996, 441.729261571372,
> >>>
> >>>
> >>>
> >>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
> >>>
> >>>
> >>>
> >>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> >>>
> >>>
> >>> )), row.names = c(NA, 13L), class = "data.frame")
> >>>
> >>> fit < glm.nb(Moons ~ Volume, data = moon_data)
> >>> rstudent(fit)
> >>>
> >>> fit2 < update(fit, subset = Name != "Jupiter ")
> >>> rstudent(fit2)
> >>>
> >>> influence(fit2)$sigma
> >>>
> >>> # 1 2 3 4 5 7 8
> 9
> >>> 10 11 12 13
> >>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454
> 1.152110
> >>> 1.187586 1.181696 1.077954 1.165147
> >>>
> >>> Sincerely,
> >>> Eric
> >>>
> >>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
> >>> wrote:
> >>>
> >>>> Also, I suggest you read ?influence which may explain the source of
> >>>> your NaN's .
> >>>>
> >>>> Bert Gunter
> >>>>
> >>>> "The trouble with having an open mind is that people keep coming along
> >>>> and sticking things into it."
> >>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
> >>> wrote:
> >>>>
> >>>>> I told you already: **Include code inline **
> >>>>>
> >>>>> See ?dput for how to include a text version of objects, such as data
> >>>>> frames, inline.
> >>>>>
> >>>>> Otherwise, I believe .txt text files are not stripped if you insist
> >>>>> on
> >>>>> *attaching* data or code. Others may have better advice.
> >>>>>
> >>>>>
> >>>>> Bert Gunter
> >>>>>
> >>>>> "The trouble with having an open mind is that people keep coming
> >>>>> along and sticking things into it."
> >>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>>> How can I add attachments? The following two files were attached in
> >>>>>> the initial message
> >>>>>>
> >>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Nothing was attached. The rhelp server strips most attachments.
> >>>>>>> Include your code inline.
> >>>>>>>
> >>>>>>> Also note that
> >>>>>>>
> >>>>>>>> 0/0
> >>>>>>> [1] NaN
> >>>>>>>
> >>>>>>> so maybe something like that occurs in the course of your
> calculations.
> >>>>>>> But that's just a guess, so feel free to disregard.
> >>>>>>>
> >>>>>>>
> >>>>>>> Bert Gunter
> >>>>>>>
> >>>>>>> "The trouble with having an open mind is that people keep coming
> >>>>>>> along and sticking things into it."
> >>>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> >>>>>>> < [hidden email]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi R core team,
> >>>>>>>>
> >>>>>>>> I experienced the following issue with the attached data/code
> >>>>>>>> snippet, where the studentized residual for a single observation
> >>>>>>>> appears to be NaN given finite predictors/responses, which appears
> >>>>>>>> to be driven by the glm.influence method in the stats package. I
> >>>>>>>> am curious to whether this is a consequence of the specific
> >>>>>>>> implementation used for computing the influence, which it would
> >>>>>>>> appear is the driving force for the NaN influence for the point,
> >>>>>>>> that I was ultimately able to trace back through the lm.influence
> >>>>>>>> method to this specific line <
> >>>>>>>> https://github.com/SurajGupta/r> >>> source/blob/a28e609e72ed7c47f6ddfb
> >>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> >>>>>>>>>
> >>>>>>>> which
> >>>>>>>> calls C code which calls iminfl.f
> >>>>>>>> <
> >>>>>>>>
> https://github.com/SurajGupta/rsource/blob/master/src/library/sta> >>>>>>>> ts/src/lminfl.f
> >>>>>>>>>
> >>>>>>>> (I
> >>>>>>>> don't know fortran so I can't debug further). My understanding is
> >>>>>>>> that the specific issue would have to do with the leaveoneout
> >>>>>>>> variance estimate associated with this particular point, which it
> >>>>>>>> seems based on my understanding should be finite given finite
> >>>>>>>> predictors/responses. Let me know. Thanks!
> >>>>>>>>
> >>>>>>>> Sincerely,
> >>>>>>>>
> >>>>>>>> 
> >>>>>>>> Eric Bridgeford
> >>>>>>>> ericwb.me
> >>>>>>>> ______________________________________________
> >>>>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp> >>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.Rproject.org/postingguide.html> >>>>>>>> and provide commented, minimal, selfcontained, reproducible code.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> 
> >>>>>> Eric Bridgeford
> >>>>>> ericwb.me
> >>>>>>
> >>>>>
> >>>
> >>> 
> >>> Eric Bridgeford
> >>> ericwb.me
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/rhelp> >>> PLEASE do read the posting guide http://www.Rproject.org/posting> >>> guide.html
> >>> and provide commented, minimal, selfcontained, reproducible code.
> >>
> >> ______________________________________________
> >> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/rhelp> >> PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> >> and provide commented, minimal, selfcontained, reproducible code.
> >
> > 
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: [hidden email] Priv: [hidden email]
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Eric Bridgeford
ericwb.me
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Dear Eric,
I'm afraid that your argument doesn't make sense to me. As you saw when you tried
fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
glm.nb() effectively wasn't able to estimate the theta parameter of the negative binomial model. So why would it be better to base deletion diagnostics on actually refitting the model?
The lesson to me here is that if you fit a sufficiently unreasonable model to data, the computations may break down. Other than drawing attention to the NaN with an explicit warning, I don't see what more could usefully be done.
Best,
John
> On Apr 2, 2019, at 9:08 PM, Eric Bridgeford < [hidden email]> wrote:
>
> Hey John,
>
> I am aware they are high leverage points, and that the model is not the
> best for them. The purpose of this dataset was to explore high leverage
> points, and diagnostic statistics through which one would identify them.
>
> What I am saying is that the current behavior of the function seems a
> little nonspecific to me; the influence for this problem is
> finite/computable manually by fitting n models to n1 points (manually
> holding out each point individually to obtain the loovariance, and
> computing the influence in the nonapproximate way).
>
> I am just suggesting that it seems the function could be improved by, say,
> throwing specific warnings when NaNs may arise. Ie, "Your have points that
> are very high leverage. The approximation technique is not numerically
> stable for these points and the results should be used with caution"
> etc...; I am sure there are other also prehoc approaches to diagnose other
> ways in which this function could fail). The approximation technique not
> behaving well for points that are ultra high leverage just seems peculiar
> that that would return an NaN with no other recommendations/advice/specific
> warnings, especially since the influence is frequently used to diagnosing
> this specific issue.
>
> Alternatively, one could afford an optional argument type="manual" that
> computes the heldout variance manually rather than the approximate
> fashion, and add a comment to use this in the help menu when you have high
> leverage points (this is what I ended up doing to obtain the true influence
> and the externally studentized residual).
>
> I just think some more specificity could be of use for future users, to
> make the R:stats community even better :) Does that make sense?
>
> Sincerely,
> Eric
>
> On Tue, Apr 2, 2019 at 7:53 PM Fox, John < [hidden email]> wrote:
>
>> Dear Eric,
>>
>> Have you looked at your data?  for example:
>>
>> plot(log(Moons) ~ Volume, data = moon_data)
>> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
>> subset = Volume > 400)
>>
>> The negativebinomial model doesn't look reasonable, does it?
>>
>> After you eliminate Jupiter there's one very high leverage point left,
>> Saturn. Computing studentized residuals entails an approximation to
>> deleting that as well from the model, so try fitting
>>
>> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
>> summary(fit3)
>>
>> which runs into numeric difficulties.
>>
>> Then look at:
>>
>> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
>>
>> Finally, try
>>
>> plot(log(Moons) ~ log(Volume), data = moon_data)
>> fit4 < update(fit2, . ~ log(Volume))
>> rstudent(fit4)
>>
>> I hope this helps,
>> John
>>
>> 
>> John Fox
>> Professor Emeritus
>> McMaster University
>> Hamilton, Ontario, Canada
>> Web: https://socialsciences.mcmaster.ca/jfox/>>
>>
>>
>>
>>> Original Message
>>> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
>>> Bridgeford
>>> Sent: Tuesday, April 2, 2019 5:01 PM
>>> To: Bert Gunter < [hidden email]>
>>> Cc: Rhelp < [hidden email]>
>>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
>>>
>>> I agree the influence documentation suggests NaNs may result; however, as
>>> these can be manually computed and are, indeed, finite/existing (ie,
>>> computing the heldout influence by manually training n models for n
>> points
>>> to obtain n leave one out influence measures), I don't possibly see how
>> the
>>> function SHOULD return NaN, and given that it is returning NaN, that
>>> suggests to me that there should be either a) Providing an alternative
>>> method to compute them that (may be slower) that returns the correct
>>> results in the even that lm.influence does not return a good
>> approximation
>>> (ie, a command line argument for type="approx" that does the
>>> approximation strategy employed currently, or an alternative
>> type="direct"
>>> or something like that that computes them manually), or b) a heuristic to
>>> suggest why NaNs might result from one's particular inputs/what can be
>>> done to fix it (if the approximation strategy is the source of the
>> problem) or
>>> what the issue is with the data that will cause NaNs. Hence I was
>> looking to
>>> start a discussion around the specific strategy employed to compute the
>>> elements.
>>>
>>> Below is the code:
>>> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
>> 11L,
>>> 12L, 9L, 10L, 4L, 6L,
>> 3L), .Label = c("Ceres ", "Earth",
>>> "Eris ",
>>>
>>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
>> "Neptune ",
>>>
>>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
>>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
>> 9.54, 19.22,
>>> 30.06, 39.5, 43.35, 45.8,
>> 67.7), Diameter = c(0.382, 0.949,
>>>
>>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
>>>
>>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
>>>
>>> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
>> 0.0025), Moons = c(0L,
>>>
>>>
>>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
>> Volume =
>>> c(0.0291869497930152,
>>>
>>>
>>>
>>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
>>>
>>>
>>>
>>> 0.000268082573106329, 737.393372232996, 441.729261571372,
>>>
>>>
>>>
>>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
>>>
>>>
>>>
>>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
>>>
>>>
>>> )), row.names = c(NA, 13L), class = "data.frame")
>>>
>>> fit < glm.nb(Moons ~ Volume, data = moon_data)
>>> rstudent(fit)
>>>
>>> fit2 < update(fit, subset = Name != "Jupiter ")
>>> rstudent(fit2)
>>>
>>> influence(fit2)$sigma
>>>
>>> # 1 2 3 4 5 7 8 9
>>> 10 11 12 13
>>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
>>> 1.187586 1.181696 1.077954 1.165147
>>>
>>> Sincerely,
>>> Eric
>>>
>>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
>>> wrote:
>>>
>>>> Also, I suggest you read ?influence which may explain the source of
>>>> your NaN's .
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep coming along
>>>> and sticking things into it."
>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>
>>>>
>>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
>>> wrote:
>>>>
>>>>> I told you already: **Include code inline **
>>>>>
>>>>> See ?dput for how to include a text version of objects, such as data
>>>>> frames, inline.
>>>>>
>>>>> Otherwise, I believe .txt text files are not stripped if you insist
>>>>> on
>>>>> *attaching* data or code. Others may have better advice.
>>>>>
>>>>>
>>>>> Bert Gunter
>>>>>
>>>>> "The trouble with having an open mind is that people keep coming
>>>>> along and sticking things into it."
>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>
>>>>>
>>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
>>>>> wrote:
>>>>>
>>>>>> How can I add attachments? The following two files were attached in
>>>>>> the initial message
>>>>>>
>>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Nothing was attached. The rhelp server strips most attachments.
>>>>>>> Include your code inline.
>>>>>>>
>>>>>>> Also note that
>>>>>>>
>>>>>>>> 0/0
>>>>>>> [1] NaN
>>>>>>>
>>>>>>> so maybe something like that occurs in the course of your
>> calculations.
>>>>>>> But that's just a guess, so feel free to disregard.
>>>>>>>
>>>>>>>
>>>>>>> Bert Gunter
>>>>>>>
>>>>>>> "The trouble with having an open mind is that people keep coming
>>>>>>> along and sticking things into it."
>>>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
>>>>>>> < [hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi R core team,
>>>>>>>>
>>>>>>>> I experienced the following issue with the attached data/code
>>>>>>>> snippet, where the studentized residual for a single observation
>>>>>>>> appears to be NaN given finite predictors/responses, which appears
>>>>>>>> to be driven by the glm.influence method in the stats package. I
>>>>>>>> am curious to whether this is a consequence of the specific
>>>>>>>> implementation used for computing the influence, which it would
>>>>>>>> appear is the driving force for the NaN influence for the point,
>>>>>>>> that I was ultimately able to trace back through the lm.influence
>>>>>>>> method to this specific line <
>>>>>>>> https://github.com/SurajGupta/r>>> source/blob/a28e609e72ed7c47f6ddfb
>>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
>>>>>>>>>
>>>>>>>> which
>>>>>>>> calls C code which calls iminfl.f
>>>>>>>> <
>>>>>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/sta>>>>>>>> ts/src/lminfl.f
>>>>>>>>>
>>>>>>>> (I
>>>>>>>> don't know fortran so I can't debug further). My understanding is
>>>>>>>> that the specific issue would have to do with the leaveoneout
>>>>>>>> variance estimate associated with this particular point, which it
>>>>>>>> seems based on my understanding should be finite given finite
>>>>>>>> predictors/responses. Let me know. Thanks!
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>>
>>>>>>>> 
>>>>>>>> Eric Bridgeford
>>>>>>>> ericwb.me
>>>>>>>> ______________________________________________
>>>>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.Rproject.org/postingguide.html>>>>>>>> and provide commented, minimal, selfcontained, reproducible code.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> 
>>>>>> Eric Bridgeford
>>>>>> ericwb.me
>>>>>>
>>>>>
>>>
>>> 
>>> Eric Bridgeford
>>> ericwb.me
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>> PLEASE do read the posting guide http://www.Rproject.org/posting>>> guide.html
>>> and provide commented, minimal, selfcontained, reproducible code.
>>
>
>
> 
> Eric Bridgeford
> ericwb.me
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hey John,
Seems fair, and, I agree a more explicit or clear (ie, giving users
indications as to why/when the lm.influence is going to misfit the data)
warning makes sense in context.
Sincerely,
Eric
On Wed, Apr 3, 2019 at 10:18 AM Fox, John < [hidden email]> wrote:
> Dear Eric,
>
> I'm afraid that your argument doesn't make sense to me. As you saw when
> you tried
>
> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
>
> glm.nb() effectively wasn't able to estimate the theta parameter of the
> negative binomial model. So why would it be better to base deletion
> diagnostics on actually refitting the model?
>
> The lesson to me here is that if you fit a sufficiently unreasonable model
> to data, the computations may break down. Other than drawing attention to
> the NaN with an explicit warning, I don't see what more could usefully be
> done.
>
> Best,
> John
>
> > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford < [hidden email]> wrote:
> >
> > Hey John,
> >
> > I am aware they are high leverage points, and that the model is not the
> > best for them. The purpose of this dataset was to explore high leverage
> > points, and diagnostic statistics through which one would identify them.
> >
> > What I am saying is that the current behavior of the function seems a
> > little nonspecific to me; the influence for this problem is
> > finite/computable manually by fitting n models to n1 points (manually
> > holding out each point individually to obtain the loovariance, and
> > computing the influence in the nonapproximate way).
> >
> > I am just suggesting that it seems the function could be improved by,
> say,
> > throwing specific warnings when NaNs may arise. Ie, "Your have points
> that
> > are very high leverage. The approximation technique is not numerically
> > stable for these points and the results should be used with caution"
> > etc...; I am sure there are other also prehoc approaches to diagnose
> other
> > ways in which this function could fail). The approximation technique not
> > behaving well for points that are ultra high leverage just seems peculiar
> > that that would return an NaN with no other
> recommendations/advice/specific
> > warnings, especially since the influence is frequently used to diagnosing
> > this specific issue.
> >
> > Alternatively, one could afford an optional argument type="manual" that
> > computes the heldout variance manually rather than the approximate
> > fashion, and add a comment to use this in the help menu when you have
> high
> > leverage points (this is what I ended up doing to obtain the true
> influence
> > and the externally studentized residual).
> >
> > I just think some more specificity could be of use for future users, to
> > make the R:stats community even better :) Does that make sense?
> >
> > Sincerely,
> > Eric
> >
> > On Tue, Apr 2, 2019 at 7:53 PM Fox, John < [hidden email]> wrote:
> >
> >> Dear Eric,
> >>
> >> Have you looked at your data?  for example:
> >>
> >> plot(log(Moons) ~ Volume, data = moon_data)
> >> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> >> subset = Volume > 400)
> >>
> >> The negativebinomial model doesn't look reasonable, does it?
> >>
> >> After you eliminate Jupiter there's one very high leverage point left,
> >> Saturn. Computing studentized residuals entails an approximation to
> >> deleting that as well from the model, so try fitting
> >>
> >> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn
> ")))
> >> summary(fit3)
> >>
> >> which runs into numeric difficulties.
> >>
> >> Then look at:
> >>
> >> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume <
> 400)
> >>
> >> Finally, try
> >>
> >> plot(log(Moons) ~ log(Volume), data = moon_data)
> >> fit4 < update(fit2, . ~ log(Volume))
> >> rstudent(fit4)
> >>
> >> I hope this helps,
> >> John
> >>
> >> 
> >> John Fox
> >> Professor Emeritus
> >> McMaster University
> >> Hamilton, Ontario, Canada
> >> Web: https://socialsciences.mcmaster.ca/jfox/> >>
> >>
> >>
> >>
> >>> Original Message
> >>> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
> >>> Bridgeford
> >>> Sent: Tuesday, April 2, 2019 5:01 PM
> >>> To: Bert Gunter < [hidden email]>
> >>> Cc: Rhelp < [hidden email]>
> >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
> >>>
> >>> I agree the influence documentation suggests NaNs may result; however,
> as
> >>> these can be manually computed and are, indeed, finite/existing (ie,
> >>> computing the heldout influence by manually training n models for n
> >> points
> >>> to obtain n leave one out influence measures), I don't possibly see how
> >> the
> >>> function SHOULD return NaN, and given that it is returning NaN, that
> >>> suggests to me that there should be either a) Providing an alternative
> >>> method to compute them that (may be slower) that returns the correct
> >>> results in the even that lm.influence does not return a good
> >> approximation
> >>> (ie, a command line argument for type="approx" that does the
> >>> approximation strategy employed currently, or an alternative
> >> type="direct"
> >>> or something like that that computes them manually), or b) a heuristic
> to
> >>> suggest why NaNs might result from one's particular inputs/what can be
> >>> done to fix it (if the approximation strategy is the source of the
> >> problem) or
> >>> what the issue is with the data that will cause NaNs. Hence I was
> >> looking to
> >>> start a discussion around the specific strategy employed to compute the
> >>> elements.
> >>>
> >>> Below is the code:
> >>> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
> >> 11L,
> >>> 12L, 9L, 10L, 4L, 6L,
> >> 3L), .Label = c("Ceres ", "Earth",
> >>> "Eris ",
> >>>
> >>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> >> "Neptune ",
> >>>
> >>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> >>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> >> 9.54, 19.22,
> >>> 30.06, 39.5, 43.35, 45.8,
> >> 67.7), Diameter = c(0.382, 0.949,
> >>>
> >>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> >>>
> >>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
> >>>
> >>> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
> >> 0.0025), Moons = c(0L,
> >>>
> >>>
> >>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> >> Volume =
> >>> c(0.0291869497930152,
> >>>
> >>>
> >>>
> >>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
> >>>
> >>>
> >>>
> >>> 0.000268082573106329, 737.393372232996, 441.729261571372,
> >>>
> >>>
> >>>
> >>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
> >>>
> >>>
> >>>
> >>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> >>>
> >>>
> >>> )), row.names = c(NA, 13L), class = "data.frame")
> >>>
> >>> fit < glm.nb(Moons ~ Volume, data = moon_data)
> >>> rstudent(fit)
> >>>
> >>> fit2 < update(fit, subset = Name != "Jupiter ")
> >>> rstudent(fit2)
> >>>
> >>> influence(fit2)$sigma
> >>>
> >>> # 1 2 3 4 5 7 8
> 9
> >>> 10 11 12 13
> >>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454
> 1.152110
> >>> 1.187586 1.181696 1.077954 1.165147
> >>>
> >>> Sincerely,
> >>> Eric
> >>>
> >>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
> >>> wrote:
> >>>
> >>>> Also, I suggest you read ?influence which may explain the source of
> >>>> your NaN's .
> >>>>
> >>>> Bert Gunter
> >>>>
> >>>> "The trouble with having an open mind is that people keep coming along
> >>>> and sticking things into it."
> >>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
> >>> wrote:
> >>>>
> >>>>> I told you already: **Include code inline **
> >>>>>
> >>>>> See ?dput for how to include a text version of objects, such as data
> >>>>> frames, inline.
> >>>>>
> >>>>> Otherwise, I believe .txt text files are not stripped if you insist
> >>>>> on
> >>>>> *attaching* data or code. Others may have better advice.
> >>>>>
> >>>>>
> >>>>> Bert Gunter
> >>>>>
> >>>>> "The trouble with having an open mind is that people keep coming
> >>>>> along and sticking things into it."
> >>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>>> How can I add attachments? The following two files were attached in
> >>>>>> the initial message
> >>>>>>
> >>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Nothing was attached. The rhelp server strips most attachments.
> >>>>>>> Include your code inline.
> >>>>>>>
> >>>>>>> Also note that
> >>>>>>>
> >>>>>>>> 0/0
> >>>>>>> [1] NaN
> >>>>>>>
> >>>>>>> so maybe something like that occurs in the course of your
> >> calculations.
> >>>>>>> But that's just a guess, so feel free to disregard.
> >>>>>>>
> >>>>>>>
> >>>>>>> Bert Gunter
> >>>>>>>
> >>>>>>> "The trouble with having an open mind is that people keep coming
> >>>>>>> along and sticking things into it."
> >>>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> >>>>>>> < [hidden email]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi R core team,
> >>>>>>>>
> >>>>>>>> I experienced the following issue with the attached data/code
> >>>>>>>> snippet, where the studentized residual for a single observation
> >>>>>>>> appears to be NaN given finite predictors/responses, which appears
> >>>>>>>> to be driven by the glm.influence method in the stats package. I
> >>>>>>>> am curious to whether this is a consequence of the specific
> >>>>>>>> implementation used for computing the influence, which it would
> >>>>>>>> appear is the driving force for the NaN influence for the point,
> >>>>>>>> that I was ultimately able to trace back through the lm.influence
> >>>>>>>> method to this specific line <
> >>>>>>>> https://github.com/SurajGupta/r> >>> source/blob/a28e609e72ed7c47f6ddfb
> >>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> >>>>>>>>>
> >>>>>>>> which
> >>>>>>>> calls C code which calls iminfl.f
> >>>>>>>> <
> >>>>>>>>
> https://github.com/SurajGupta/rsource/blob/master/src/library/sta> >>>>>>>> ts/src/lminfl.f
> >>>>>>>>>
> >>>>>>>> (I
> >>>>>>>> don't know fortran so I can't debug further). My understanding is
> >>>>>>>> that the specific issue would have to do with the leaveoneout
> >>>>>>>> variance estimate associated with this particular point, which it
> >>>>>>>> seems based on my understanding should be finite given finite
> >>>>>>>> predictors/responses. Let me know. Thanks!
> >>>>>>>>
> >>>>>>>> Sincerely,
> >>>>>>>>
> >>>>>>>> 
> >>>>>>>> Eric Bridgeford
> >>>>>>>> ericwb.me
> >>>>>>>> ______________________________________________
> >>>>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp> >>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.Rproject.org/postingguide.html> >>>>>>>> and provide commented, minimal, selfcontained, reproducible code.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> 
> >>>>>> Eric Bridgeford
> >>>>>> ericwb.me
> >>>>>>
> >>>>>
> >>>
> >>> 
> >>> Eric Bridgeford
> >>> ericwb.me
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/rhelp> >>> PLEASE do read the posting guide http://www.Rproject.org/posting> >>> guide.html
> >>> and provide commented, minimal, selfcontained, reproducible code.
> >>
> >
> >
> > 
> > Eric Bridgeford
> > ericwb.me
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list  To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/rhelp> > PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> > and provide commented, minimal, selfcontained, reproducible code.
>
>

Eric Bridgeford
ericwb.me
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


fortune nomination.
The lesson to me here is that if you fit a sufficiently unreasonable
model to data, the computations may break down.
On Wed, Apr 3, 2019 at 10:18 AM Fox, John < [hidden email]> wrote:
>
> Dear Eric,
>
> I'm afraid that your argument doesn't make sense to me. As you saw when you tried
>
> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
>
> glm.nb() effectively wasn't able to estimate the theta parameter of the negative binomial model. So why would it be better to base deletion diagnostics on actually refitting the model?
>
> The lesson to me here is that if you fit a sufficiently unreasonable model to data, the computations may break down. Other than drawing attention to the NaN with an explicit warning, I don't see what more could usefully be done.
>
> Best,
> John
>
> > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford < [hidden email]> wrote:
> >
> > Hey John,
> >
> > I am aware they are high leverage points, and that the model is not the
> > best for them. The purpose of this dataset was to explore high leverage
> > points, and diagnostic statistics through which one would identify them.
> >
> > What I am saying is that the current behavior of the function seems a
> > little nonspecific to me; the influence for this problem is
> > finite/computable manually by fitting n models to n1 points (manually
> > holding out each point individually to obtain the loovariance, and
> > computing the influence in the nonapproximate way).
> >
> > I am just suggesting that it seems the function could be improved by, say,
> > throwing specific warnings when NaNs may arise. Ie, "Your have points that
> > are very high leverage. The approximation technique is not numerically
> > stable for these points and the results should be used with caution"
> > etc...; I am sure there are other also prehoc approaches to diagnose other
> > ways in which this function could fail). The approximation technique not
> > behaving well for points that are ultra high leverage just seems peculiar
> > that that would return an NaN with no other recommendations/advice/specific
> > warnings, especially since the influence is frequently used to diagnosing
> > this specific issue.
> >
> > Alternatively, one could afford an optional argument type="manual" that
> > computes the heldout variance manually rather than the approximate
> > fashion, and add a comment to use this in the help menu when you have high
> > leverage points (this is what I ended up doing to obtain the true influence
> > and the externally studentized residual).
> >
> > I just think some more specificity could be of use for future users, to
> > make the R:stats community even better :) Does that make sense?
> >
> > Sincerely,
> > Eric
> >
> > On Tue, Apr 2, 2019 at 7:53 PM Fox, John < [hidden email]> wrote:
> >
> >> Dear Eric,
> >>
> >> Have you looked at your data?  for example:
> >>
> >> plot(log(Moons) ~ Volume, data = moon_data)
> >> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> >> subset = Volume > 400)
> >>
> >> The negativebinomial model doesn't look reasonable, does it?
> >>
> >> After you eliminate Jupiter there's one very high leverage point left,
> >> Saturn. Computing studentized residuals entails an approximation to
> >> deleting that as well from the model, so try fitting
> >>
> >> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
> >> summary(fit3)
> >>
> >> which runs into numeric difficulties.
> >>
> >> Then look at:
> >>
> >> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
> >>
> >> Finally, try
> >>
> >> plot(log(Moons) ~ log(Volume), data = moon_data)
> >> fit4 < update(fit2, . ~ log(Volume))
> >> rstudent(fit4)
> >>
> >> I hope this helps,
> >> John
> >>
> >> 
> >> John Fox
> >> Professor Emeritus
> >> McMaster University
> >> Hamilton, Ontario, Canada
> >> Web: https://socialsciences.mcmaster.ca/jfox/> >>
> >>
> >>
> >>
> >>> Original Message
> >>> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
> >>> Bridgeford
> >>> Sent: Tuesday, April 2, 2019 5:01 PM
> >>> To: Bert Gunter < [hidden email]>
> >>> Cc: Rhelp < [hidden email]>
> >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
> >>>
> >>> I agree the influence documentation suggests NaNs may result; however, as
> >>> these can be manually computed and are, indeed, finite/existing (ie,
> >>> computing the heldout influence by manually training n models for n
> >> points
> >>> to obtain n leave one out influence measures), I don't possibly see how
> >> the
> >>> function SHOULD return NaN, and given that it is returning NaN, that
> >>> suggests to me that there should be either a) Providing an alternative
> >>> method to compute them that (may be slower) that returns the correct
> >>> results in the even that lm.influence does not return a good
> >> approximation
> >>> (ie, a command line argument for type="approx" that does the
> >>> approximation strategy employed currently, or an alternative
> >> type="direct"
> >>> or something like that that computes them manually), or b) a heuristic to
> >>> suggest why NaNs might result from one's particular inputs/what can be
> >>> done to fix it (if the approximation strategy is the source of the
> >> problem) or
> >>> what the issue is with the data that will cause NaNs. Hence I was
> >> looking to
> >>> start a discussion around the specific strategy employed to compute the
> >>> elements.
> >>>
> >>> Below is the code:
> >>> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
> >> 11L,
> >>> 12L, 9L, 10L, 4L, 6L,
> >> 3L), .Label = c("Ceres ", "Earth",
> >>> "Eris ",
> >>>
> >>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> >> "Neptune ",
> >>>
> >>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> >>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> >> 9.54, 19.22,
> >>> 30.06, 39.5, 43.35, 45.8,
> >> 67.7), Diameter = c(0.382, 0.949,
> >>>
> >>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> >>>
> >>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
> >>>
> >>> 95.2, 14.6, 17.2, 0.0022, 7e04, 7e04,
> >> 0.0025), Moons = c(0L,
> >>>
> >>>
> >>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> >> Volume =
> >>> c(0.0291869497930152,
> >>>
> >>>
> >>>
> >>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
> >>>
> >>>
> >>>
> >>> 0.000268082573106329, 737.393372232996, 441.729261571372,
> >>>
> >>>
> >>>
> >>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
> >>>
> >>>
> >>>
> >>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> >>>
> >>>
> >>> )), row.names = c(NA, 13L), class = "data.frame")
> >>>
> >>> fit < glm.nb(Moons ~ Volume, data = moon_data)
> >>> rstudent(fit)
> >>>
> >>> fit2 < update(fit, subset = Name != "Jupiter ")
> >>> rstudent(fit2)
> >>>
> >>> influence(fit2)$sigma
> >>>
> >>> # 1 2 3 4 5 7 8 9
> >>> 10 11 12 13
> >>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
> >>> 1.187586 1.181696 1.077954 1.165147
> >>>
> >>> Sincerely,
> >>> Eric
> >>>
> >>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
> >>> wrote:
> >>>
> >>>> Also, I suggest you read ?influence which may explain the source of
> >>>> your NaN's .
> >>>>
> >>>> Bert Gunter
> >>>>
> >>>> "The trouble with having an open mind is that people keep coming along
> >>>> and sticking things into it."
> >>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
> >>> wrote:
> >>>>
> >>>>> I told you already: **Include code inline **
> >>>>>
> >>>>> See ?dput for how to include a text version of objects, such as data
> >>>>> frames, inline.
> >>>>>
> >>>>> Otherwise, I believe .txt text files are not stripped if you insist
> >>>>> on
> >>>>> *attaching* data or code. Others may have better advice.
> >>>>>
> >>>>>
> >>>>> Bert Gunter
> >>>>>
> >>>>> "The trouble with having an open mind is that people keep coming
> >>>>> along and sticking things into it."
> >>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>>> How can I add attachments? The following two files were attached in
> >>>>>> the initial message
> >>>>>>
> >>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter < [hidden email]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Nothing was attached. The rhelp server strips most attachments.
> >>>>>>> Include your code inline.
> >>>>>>>
> >>>>>>> Also note that
> >>>>>>>
> >>>>>>>> 0/0
> >>>>>>> [1] NaN
> >>>>>>>
> >>>>>>> so maybe something like that occurs in the course of your
> >> calculations.
> >>>>>>> But that's just a guess, so feel free to disregard.
> >>>>>>>
> >>>>>>>
> >>>>>>> Bert Gunter
> >>>>>>>
> >>>>>>> "The trouble with having an open mind is that people keep coming
> >>>>>>> along and sticking things into it."
> >>>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> >>>>>>> < [hidden email]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi R core team,
> >>>>>>>>
> >>>>>>>> I experienced the following issue with the attached data/code
> >>>>>>>> snippet, where the studentized residual for a single observation
> >>>>>>>> appears to be NaN given finite predictors/responses, which appears
> >>>>>>>> to be driven by the glm.influence method in the stats package. I
> >>>>>>>> am curious to whether this is a consequence of the specific
> >>>>>>>> implementation used for computing the influence, which it would
> >>>>>>>> appear is the driving force for the NaN influence for the point,
> >>>>>>>> that I was ultimately able to trace back through the lm.influence
> >>>>>>>> method to this specific line <
> >>>>>>>> https://github.com/SurajGupta/r> >>> source/blob/a28e609e72ed7c47f6ddfb
> >>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> >>>>>>>>>
> >>>>>>>> which
> >>>>>>>> calls C code which calls iminfl.f
> >>>>>>>> <
> >>>>>>>> https://github.com/SurajGupta/rsource/blob/master/src/library/sta> >>>>>>>> ts/src/lminfl.f
> >>>>>>>>>
> >>>>>>>> (I
> >>>>>>>> don't know fortran so I can't debug further). My understanding is
> >>>>>>>> that the specific issue would have to do with the leaveoneout
> >>>>>>>> variance estimate associated with this particular point, which it
> >>>>>>>> seems based on my understanding should be finite given finite
> >>>>>>>> predictors/responses. Let me know. Thanks!
> >>>>>>>>
> >>>>>>>> Sincerely,
> >>>>>>>>
> >>>>>>>> 
> >>>>>>>> Eric Bridgeford
> >>>>>>>> ericwb.me
> >>>>>>>> ______________________________________________
> >>>>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp> >>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.Rproject.org/postingguide.html> >>>>>>>> and provide commented, minimal, selfcontained, reproducible code.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> 
> >>>>>> Eric Bridgeford
> >>>>>> ericwb.me
> >>>>>>
> >>>>>
> >>>
> >>> 
> >>> Eric Bridgeford
> >>> ericwb.me
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/rhelp> >>> PLEASE do read the posting guide http://www.Rproject.org/posting> >>> guide.html
> >>> and provide commented, minimal, selfcontained, reproducible code.
> >>
> >
> >
> > 
> > Eric Bridgeford
> > ericwb.me
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list  To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/rhelp> > PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> > and provide commented, minimal, selfcontained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Second!
Bert Gunter
On Wed, Apr 3, 2019 at 9:35 AM Richard M. Heiberger < [hidden email]> wrote:
> fortune nomination.
>
>
> The lesson to me here is that if you fit a sufficiently unreasonable
> model to data, the computations may break down.
>
> On Wed, Apr 3, 2019 at 10:18 AM Fox, John < [hidden email]> wrote:
> >
> > Dear Eric,
> >
> > I'm afraid that your argument doesn't make sense to me. As you saw when
> you tried
> >
> > fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn
> ")))
> >
> > glm.nb() effectively wasn't able to estimate the theta parameter of the
> negative binomial model. So why would it be better to base deletion
> diagnostics on actually refitting the model?
> >
> > The lesson to me here is that if you fit a sufficiently unreasonable
> model to data, the computations may break down. Other than drawing
> attention to the NaN with an explicit warning, I don't see what more could
> usefully be done.
> >
> > Best,
> > John
> >
> > > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford < [hidden email]>
> wrote:
> > >
> > > Hey John,
> > >
> > > I am aware they are high leverage points, and that the model is not the
> > > best for them. The purpose of this dataset was to explore high leverage
> > > points, and diagnostic statistics through which one would identify
> them.
> > >
> > > What I am saying is that the current behavior of the function seems a
> > > little nonspecific to me; the influence for this problem is
> > > finite/computable manually by fitting n models to n1 points (manually
> > > holding out each point individually to obtain the loovariance, and
> > > computing the influence in the nonapproximate way).
> > >
> > > I am just suggesting that it seems the function could be improved by,
> say,
> > > throwing specific warnings when NaNs may arise. Ie, "Your have points
> that
> > > are very high leverage. The approximation technique is not numerically
> > > stable for these points and the results should be used with caution"
> > > etc...; I am sure there are other also prehoc approaches to diagnose
> other
> > > ways in which this function could fail). The approximation technique
> not
> > > behaving well for points that are ultra high leverage just seems
> peculiar
> > > that that would return an NaN with no other
> recommendations/advice/specific
> > > warnings, especially since the influence is frequently used to
> diagnosing
> > > this specific issue.
> > >
> > > Alternatively, one could afford an optional argument type="manual" that
> > > computes the heldout variance manually rather than the approximate
> > > fashion, and add a comment to use this in the help menu when you have
> high
> > > leverage points (this is what I ended up doing to obtain the true
> influence
> > > and the externally studentized residual).
> > >
> > > I just think some more specificity could be of use for future users, to
> > > make the R:stats community even better :) Does that make sense?
> > >
> > > Sincerely,
> > > Eric
> > >
> > > On Tue, Apr 2, 2019 at 7:53 PM Fox, John < [hidden email]> wrote:
> > >
> > >> Dear Eric,
> > >>
> > >> Have you looked at your data?  for example:
> > >>
> > >> plot(log(Moons) ~ Volume, data = moon_data)
> > >> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> > >> subset = Volume > 400)
> > >>
> > >> The negativebinomial model doesn't look reasonable, does it?
> > >>
> > >> After you eliminate Jupiter there's one very high leverage point left,
> > >> Saturn. Computing studentized residuals entails an approximation to
> > >> deleting that as well from the model, so try fitting
> > >>
> > >> fit3 < update(fit, subset = !(Name %in% c("Jupiter ", "Saturn
> ")))
> > >> summary(fit3)
> > >>
> > >> which runs into numeric difficulties.
> > >>
> > >> Then look at:
> > >>
> > >> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume <
> 400)
> > >>
> > >> Finally, try
> > >>
> > >> plot(log(Moons) ~ log(Volume), data = moon_data)
> > >> fit4 < update(fit2, . ~ log(Volume))
> > >> rstudent(fit4)
> > >>
> > >> I hope this helps,
> > >> John
> > >>
> > >> 
> > >> John Fox
> > >> Professor Emeritus
> > >> McMaster University
> > >> Hamilton, Ontario, Canada
> > >> Web: https://socialsciences.mcmaster.ca/jfox/> > >>
> > >>
> > >>
> > >>
> > >>> Original Message
> > >>> From: Rhelp [mailto: [hidden email]] On Behalf Of Eric
> > >>> Bridgeford
> > >>> Sent: Tuesday, April 2, 2019 5:01 PM
> > >>> To: Bert Gunter < [hidden email]>
> > >>> Cc: Rhelp < [hidden email]>
> > >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
> > >>>
> > >>> I agree the influence documentation suggests NaNs may result;
> however, as
> > >>> these can be manually computed and are, indeed, finite/existing (ie,
> > >>> computing the heldout influence by manually training n models for n
> > >> points
> > >>> to obtain n leave one out influence measures), I don't possibly see
> how
> > >> the
> > >>> function SHOULD return NaN, and given that it is returning NaN, that
> > >>> suggests to me that there should be either a) Providing an
> alternative
> > >>> method to compute them that (may be slower) that returns the correct
> > >>> results in the even that lm.influence does not return a good
> > >> approximation
> > >>> (ie, a command line argument for type="approx" that does the
> > >>> approximation strategy employed currently, or an alternative
> > >> type="direct"
> > >>> or something like that that computes them manually), or b) a
> heuristic to
> > >>> suggest why NaNs might result from one's particular inputs/what can
> be
> > >>> done to fix it (if the approximation strategy is the source of the
> > >> problem) or
> > >>> what the issue is with the data that will cause NaNs. Hence I was
> > >> looking to
> > >>> start a discussion around the specific strategy employed to compute
> the
> > >>> elements.
> > >>>
> > >>> Below is the code:
> > >>> moon_data < structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L,
> 5L,
> > >> 11L,
> > >>> 12L, 9L, 10L, 4L, 6L,
> > >> 3L), .Label = c("Ceres ", "Earth",
> > >>> "Eris ",
> > >>>
> > >>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> > >> "Neptune ",
> > >>>
> > >>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> > >>> Distance = c(0.39, 0.72, 1, 1.52, 2.75,
> 5.2,
> > >> 9.54, 19.22,
> > >>> 30.06, 39.5, 43.35, 45.8,
> > >> 67.7), Diameter = c(0.382, 0.949,
> > >>>
> > >>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> > >>>
> > >>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e04, 317.8,
> > >>>
> > >>> 95.2, 14.6, 17.2, 0.0022, 7e04,
> 7e04,
> > >> 0.0025), Moons = c(0L,
> > >>>
> > >>>
> > >>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> > >> Volume =
> > >>> c(0.0291869497930152,
> > >>>
> > >>>
> > >>>
> > >>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
> > >>>
> > >>>
> > >>>
> > >>> 0.000268082573106329, 737.393372232996, 441.729261571372,
> > >>>
> > >>>
> > >>>
> > >>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
> > >>>
> > >>>
> > >>>
> > >>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> > >>>
> > >>>
> > >>> )), row.names = c(NA, 13L), class = "data.frame")
> > >>>
> > >>> fit < glm.nb(Moons ~ Volume, data = moon_data)
> > >>> rstudent(fit)
> > >>>
> > >>> fit2 < update(fit, subset = Name != "Jupiter ")
> > >>> rstudent(fit2)
> > >>>
> > >>> influence(fit2)$sigma
> > >>>
> > >>> # 1 2 3 4 5 7 8
> 9
> > >>> 10 11 12 13
> > >>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454
> 1.152110
> > >>> 1.187586 1.181696 1.077954 1.165147
> > >>>
> > >>> Sincerely,
> > >>> Eric
> > >>>
> > >>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter < [hidden email]>
> > >>> wrote:
> > >>>
> > >>>> Also, I suggest you read ?influence which may explain the source of
> > >>>> your NaN's .
> > >>>>
> > >>>> Bert Gunter
> > >>>>
> > >>>> "The trouble with having an open mind is that people keep coming
> along
> > >>>> and sticking things into it."
> > >>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>>>
> > >>>>
> > >>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter < [hidden email]>
> > >>> wrote:
> > >>>>
> > >>>>> I told you already: **Include code inline **
> > >>>>>
> > >>>>> See ?dput for how to include a text version of objects, such as
> data
> > >>>>> frames, inline.
> > >>>>>
> > >>>>> Otherwise, I believe .txt text files are not stripped if you insist
> > >>>>> on
> > >>>>> *attaching* data or code. Others may have better advice.
> > >>>>>
> > >>>>>
> > >>>>> Bert Gunter
> > >>>>>
> > >>>>> "The trouble with having an open mind is that people keep coming
> > >>>>> along and sticking things into it."
> > >>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford < [hidden email]
> >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> How can I add attachments? The following two files were attached
> in
> > >>>>>> the initial message
> > >>>>>>
> > >>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter <
> [hidden email]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Nothing was attached. The rhelp server strips most attachments.
> > >>>>>>> Include your code inline.
> > >>>>>>>
> > >>>>>>> Also note that
> > >>>>>>>
> > >>>>>>>> 0/0
> > >>>>>>> [1] NaN
> > >>>>>>>
> > >>>>>>> so maybe something like that occurs in the course of your
> > >> calculations.
> > >>>>>>> But that's just a guess, so feel free to disregard.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Bert Gunter
> > >>>>>>>
> > >>>>>>> "The trouble with having an open mind is that people keep coming
> > >>>>>>> along and sticking things into it."
> > >>>>>>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip
> )
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> > >>>>>>> < [hidden email]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi R core team,
> > >>>>>>>>
> > >>>>>>>> I experienced the following issue with the attached data/code
> > >>>>>>>> snippet, where the studentized residual for a single observation
> > >>>>>>>> appears to be NaN given finite predictors/responses, which
> appears
> > >>>>>>>> to be driven by the glm.influence method in the stats package. I
> > >>>>>>>> am curious to whether this is a consequence of the specific
> > >>>>>>>> implementation used for computing the influence, which it would
> > >>>>>>>> appear is the driving force for the NaN influence for the point,
> > >>>>>>>> that I was ultimately able to trace back through the
> lm.influence
> > >>>>>>>> method to this specific line <
> > >>>>>>>> https://github.com/SurajGupta/r> > >>> source/blob/a28e609e72ed7c47f6ddfb
> > >>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> > >>>>>>>>>
> > >>>>>>>> which
> > >>>>>>>> calls C code which calls iminfl.f
> > >>>>>>>> <
> > >>>>>>>>
> https://github.com/SurajGupta/rsource/blob/master/src/library/sta> > >>>>>>>> ts/src/lminfl.f
> > >>>>>>>>>
> > >>>>>>>> (I
> > >>>>>>>> don't know fortran so I can't debug further). My understanding
> is
> > >>>>>>>> that the specific issue would have to do with the leaveoneout
> > >>>>>>>> variance estimate associated with this particular point, which
> it
> > >>>>>>>> seems based on my understanding should be finite given finite
> > >>>>>>>> predictors/responses. Let me know. Thanks!
> > >>>>>>>>
> > >>>>>>>> Sincerely,
> > >>>>>>>>
> > >>>>>>>> 
> > >>>>>>>> Eric Bridgeford
> > >>>>>>>> ericwb.me
> > >>>>>>>> ______________________________________________
> > >>>>>>>> [hidden email] mailing list  To UNSUBSCRIBE and more,
> see
> > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/rhelp> > >>>>>>>> PLEASE do read the posting guide
> > >>>>>>>> http://www.Rproject.org/postingguide.html> > >>>>>>>> and provide commented, minimal, selfcontained, reproducible
> code.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>> 
> > >>>>>> Eric Bridgeford
> > >>>>>> ericwb.me
> > >>>>>>
> > >>>>>
> > >>>
> > >>> 
> > >>> Eric Bridgeford
> > >>> ericwb.me
> > >>>
> > >>> [[alternative HTML version deleted]]
> > >>>
> > >>> ______________________________________________
> > >>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> > >>> https://stat.ethz.ch/mailman/listinfo/rhelp> > >>> PLEASE do read the posting guide http://www.Rproject.org/posting> > >>> guide.html
> > >>> and provide commented, minimal, selfcontained, reproducible code.
> > >>
> > >
> > >
> > > 
> > > Eric Bridgeford
> > > ericwb.me
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [hidden email] mailing list  To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/rhelp> > > PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> > > and provide commented, minimal, selfcontained, reproducible code.
> >
> > ______________________________________________
> > [hidden email] mailing list  To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/rhelp> > PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> > and provide commented, minimal, selfcontained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


In reply to this post by Richard M. Heiberger
On 4/04/19 5:34 AM, Richard M. Heiberger wrote:
> fortune nomination.
>
>
> The lesson to me here is that if you fit a sufficiently unreasonable
> model to data, the computations may break down.
<SNIP>
I second the nomination!
cheers,
Rolf

Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +6493737599 ext. 88276
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

