External validation for a hurdle model (pscl)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

External validation for a hurdle model (pscl)

Maria Eugenia Utgés
Hi R-list,
We have constructed a hurdle model some time ago.
Now we were able to gather new data in the same city (38 new sites), and
want to do an external validation to see if the model still performs ok.
All the books and lectures I have read say its the best validation option
but...
I have made a (simple) search, but it seems that as having new data for a
model is rare, have not found anything with the depth enough so as to
reproduce it/adapt it to hurdle models.

I have predicted the probability for non-zero counts
nonzero <- 1 - predict(final, newdata = datosnuevos, type = "prob")[, 1]

and the predicted mean from the count component
countmean <- predict(final, newdata = datosnuevos, type = "count")

I understand that "newdata" is taking into account the new values for the
independent variables (environmental variables), is it?

So, I have to compare the predicted values of y (calculated with the new
values of the environmental variables) with the new observed values.

That would be using the model (constructed with the old values), having as
input the new variables, and having as output a "new" prediction, to be
contrasted with the "new" observed y.

These comparison would be by means of AUC, correct classification, and/or
what other options? Results of the external validation would just be a % of
correct predicted values? plots?

Need some guidance, sorry if the explanation was "basic" but needed to
write it in my own words so as not to miss any detail.

Thank you very much in advance,

María Eugenia Utgés

CeNDIE-ANLIS
Buenos Aires
Argentina
a

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: External validation for a hurdle model (pscl)

Bert Gunter-2
This list is (mostly) about R programming. Your query is (mostly) about
statistics. So you should post on a statistics site like
stats.stackexchange.com
not here; I am pretty sure you'll receive lots of answers there.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Jan 8, 2019 at 10:18 AM Maria Eugenia Utgés <[hidden email]>
wrote:

> Hi R-list,
> We have constructed a hurdle model some time ago.
> Now we were able to gather new data in the same city (38 new sites), and
> want to do an external validation to see if the model still performs ok.
> All the books and lectures I have read say its the best validation option
> but...
> I have made a (simple) search, but it seems that as having new data for a
> model is rare, have not found anything with the depth enough so as to
> reproduce it/adapt it to hurdle models.
>
> I have predicted the probability for non-zero counts
> nonzero <- 1 - predict(final, newdata = datosnuevos, type = "prob")[, 1]
>
> and the predicted mean from the count component
> countmean <- predict(final, newdata = datosnuevos, type = "count")
>
> I understand that "newdata" is taking into account the new values for the
> independent variables (environmental variables), is it?
>
> So, I have to compare the predicted values of y (calculated with the new
> values of the environmental variables) with the new observed values.
>
> That would be using the model (constructed with the old values), having as
> input the new variables, and having as output a "new" prediction, to be
> contrasted with the "new" observed y.
>
> These comparison would be by means of AUC, correct classification, and/or
> what other options? Results of the external validation would just be a % of
> correct predicted values? plots?
>
> Need some guidance, sorry if the explanation was "basic" but needed to
> write it in my own words so as not to miss any detail.
>
> Thank you very much in advance,
>
> María Eugenia Utgés
>
> CeNDIE-ANLIS
> Buenos Aires
> Argentina
> a
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: External validation for a hurdle model (pscl)

Jeff Newmiller
That said, the gist of the OP's outline is correct, and the main reason to look elsewhere is to get more thorough advice on what statistical concerns should be addressed than would be on topic here.

One comment: reviewing plots of differences versus various independent variables for systematic biases is a task R is particularly well suited for, but discovering which plots highlight issues with your model or data takes familiarity with your data (explore) and with theory (which you learn elsewhere) and with R (which we can help with if you have more specific questions).

On January 8, 2019 10:50:14 AM PST, Bert Gunter <[hidden email]> wrote:

>This list is (mostly) about R programming. Your query is (mostly) about
>statistics. So you should post on a statistics site like
>stats.stackexchange.com
>not here; I am pretty sure you'll receive lots of answers there.
>
>Cheers,
>Bert
>
>
>Bert Gunter
>
>"The trouble with having an open mind is that people keep coming along
>and
>sticking things into it."
>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>On Tue, Jan 8, 2019 at 10:18 AM Maria Eugenia Utgés
><[hidden email]>
>wrote:
>
>> Hi R-list,
>> We have constructed a hurdle model some time ago.
>> Now we were able to gather new data in the same city (38 new sites),
>and
>> want to do an external validation to see if the model still performs
>ok.
>> All the books and lectures I have read say its the best validation
>option
>> but...
>> I have made a (simple) search, but it seems that as having new data
>for a
>> model is rare, have not found anything with the depth enough so as to
>> reproduce it/adapt it to hurdle models.
>>
>> I have predicted the probability for non-zero counts
>> nonzero <- 1 - predict(final, newdata = datosnuevos, type = "prob")[,
>1]
>>
>> and the predicted mean from the count component
>> countmean <- predict(final, newdata = datosnuevos, type = "count")
>>
>> I understand that "newdata" is taking into account the new values for
>the
>> independent variables (environmental variables), is it?
>>
>> So, I have to compare the predicted values of y (calculated with the
>new
>> values of the environmental variables) with the new observed values.
>>
>> That would be using the model (constructed with the old values),
>having as
>> input the new variables, and having as output a "new" prediction, to
>be
>> contrasted with the "new" observed y.
>>
>> These comparison would be by means of AUC, correct classification,
>and/or
>> what other options? Results of the external validation would just be
>a % of
>> correct predicted values? plots?
>>
>> Need some guidance, sorry if the explanation was "basic" but needed
>to
>> write it in my own words so as not to miss any detail.
>>
>> Thank you very much in advance,
>>
>> María Eugenia Utgés
>>
>> CeNDIE-ANLIS
>> Buenos Aires
>> Argentina
>> a
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: External validation for a hurdle model (pscl)

Maria Eugenia Utgés
Hi Jeff,
Yes, my question is more general perhaps
Not about R programming, data exploration, or statistical theory.
Just that in modelling texts external validation is set as "panacea" but
"unreacheable", so they explain other methods as cross validation,
bootstrapping, etc.
Here I have new data for a previously constructed model (and already
internally validated by bootstrapping), but have not found how to correctly
and sufficiently make the external validation and by which means (all ends
in just a plot? a % of correct classification?)

El mar., 8 ene. 2019 a las 17:08, Jeff Newmiller (<[hidden email]>)
escribió:

> That said, the gist of the OP's outline is correct, and the main reason to
> look elsewhere is to get more thorough advice on what statistical concerns
> should be addressed than would be on topic here.
>
> One comment: reviewing plots of differences versus various independent
> variables for systematic biases is a task R is particularly well suited
> for, but discovering which plots highlight issues with your model or data
> takes familiarity with your data (explore) and with theory (which you learn
> elsewhere) and with R (which we can help with if you have more specific
> questions).
>
> On January 8, 2019 10:50:14 AM PST, Bert Gunter <[hidden email]>
> wrote:
> >This list is (mostly) about R programming. Your query is (mostly) about
> >statistics. So you should post on a statistics site like
> >stats.stackexchange.com
> >not here; I am pretty sure you'll receive lots of answers there.
> >
> >Cheers,
> >Bert
> >
> >
> >Bert Gunter
> >
> >"The trouble with having an open mind is that people keep coming along
> >and
> >sticking things into it."
> >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> >On Tue, Jan 8, 2019 at 10:18 AM Maria Eugenia Utgés
> ><[hidden email]>
> >wrote:
> >
> >> Hi R-list,
> >> We have constructed a hurdle model some time ago.
> >> Now we were able to gather new data in the same city (38 new sites),
> >and
> >> want to do an external validation to see if the model still performs
> >ok.
> >> All the books and lectures I have read say its the best validation
> >option
> >> but...
> >> I have made a (simple) search, but it seems that as having new data
> >for a
> >> model is rare, have not found anything with the depth enough so as to
> >> reproduce it/adapt it to hurdle models.
> >>
> >> I have predicted the probability for non-zero counts
> >> nonzero <- 1 - predict(final, newdata = datosnuevos, type = "prob")[,
> >1]
> >>
> >> and the predicted mean from the count component
> >> countmean <- predict(final, newdata = datosnuevos, type = "count")
> >>
> >> I understand that "newdata" is taking into account the new values for
> >the
> >> independent variables (environmental variables), is it?
> >>
> >> So, I have to compare the predicted values of y (calculated with the
> >new
> >> values of the environmental variables) with the new observed values.
> >>
> >> That would be using the model (constructed with the old values),
> >having as
> >> input the new variables, and having as output a "new" prediction, to
> >be
> >> contrasted with the "new" observed y.
> >>
> >> These comparison would be by means of AUC, correct classification,
> >and/or
> >> what other options? Results of the external validation would just be
> >a % of
> >> correct predicted values? plots?
> >>
> >> Need some guidance, sorry if the explanation was "basic" but needed
> >to
> >> write it in my own words so as not to miss any detail.
> >>
> >> Thank you very much in advance,
> >>
> >> María Eugenia Utgés
> >>
> >> CeNDIE-ANLIS
> >> Buenos Aires
> >> Argentina
> >> a
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.