Why do two different calls to mgcv::gamm give different p-values?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Why do two different calls to mgcv::gamm give different p-values?

Øystein Sørensen
I have longitudinal data where a response y_{ij} has been measured
repeatedly for i=1,...,n individuals, j=1,...,m times for each individual.
Let x_{ij} be the age of individual i at her/his jth measurement. I wish to
see how the response varies with age, and have reason to believe that a
nonlinear fit is needed. I hence wish to model the relationship using an
additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
\epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
\sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
\epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.

Reading the documentation to the mgcv package, I see that such models can
be set up by calling the mgcv::gamm two different ways. One way shown to
set up such a model is with the call
b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
where id is an indicator of the specific individual. In this way, the
random effect is specified in a list. The other way is to set up the random
effect with a smooth of the re:
b2 <- gamm(y ~ s(x) + s(id, bs = "re")).

As far as I can understand, these two setups should create the same
underlying model matrices. However, when running them on my data, the
p-values for the smooth terms, as well as their estimated degrees of
freedom, are different.

Below is a reproducible example on simulated data, which show that these
two types of specifying the models give different p-values and estimated
degrees of freedom. On my real data, these differences are sometimes even
more exaggerated.

My question is: Are not these two calls to gamm suppose to estimate the
same model, or have I misunderstood?

Here is a reproducible example:

library(mgcv)
set.seed(100)
n_sim <- 100
df <- data.frame(
  model1_pval = numeric(n_sim),
  model1_edf = numeric(n_sim),
  model2_pval = numeric(n_sim),
  model2_edf = numeric(n_sim)
)

# Number of observations
n <- 500
# Number of groups
ng <- 100

for(i in 1:n_sim){
  # Draw the fixed effect covariate
  x <- rnorm(n)
  # Set up the group
  id <- rep(1:ng, n/ng)
  # Draw the random effect
  z <- rnorm(ng)
  # Draw the response
  y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)

  # Fit the two different models
  b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
  b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))

  df[i, "model1_pval"] <- anova(b1$gam)$s.pv
  df[i, "model1_edf"] <- anova(b1$gam)$edf
  df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
  df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
}

# Differences between model 1 and model 2 p values
mean(df$model1_pval)
#> [1] 6.790546e-21
mean(df$model2_pval)
#> [1] 3.090694e-14
max(abs(df$model1_pval - df$model2_pval))
#> [1] 2.770545e-12

# Differences between model1 and model 2 estimated degrees of freedom
mean(df$model1_edf)
#> [1] 3.829992
mean(df$model2_edf)
#> [1] 3.731438
max(abs(df$model1_edf - df$model2_edf))
#> [1] 0.6320281


Thanks in advance,
Øystein Sørensen

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Why do two different calls to mgcv::gamm give different p-values?

Simon Wood
Dear Øystein,

In your code 'id' is set up to be numeric. In your first gamm call it
gets coerced to a factor (since nothing else would make sense). In  your
second gamm call it is simply treated as numeric, since that could makes
sense. To make the two models equivalent you just need to substitute:

id <- as.factor(rep(1:ng, n/ng))

into your loop.

best,
Simon

On 01/10/18 08:15, Øystein Sørensen wrote:

> I have longitudinal data where a response y_{ij} has been measured
> repeatedly for i=1,...,n individuals, j=1,...,m times for each individual.
> Let x_{ij} be the age of individual i at her/his jth measurement. I wish to
> see how the response varies with age, and have reason to believe that a
> nonlinear fit is needed. I hence wish to model the relationship using an
> additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
> \epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
> \sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
> \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.
>
> Reading the documentation to the mgcv package, I see that such models can
> be set up by calling the mgcv::gamm two different ways. One way shown to
> set up such a model is with the call
> b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
> where id is an indicator of the specific individual. In this way, the
> random effect is specified in a list. The other way is to set up the random
> effect with a smooth of the re:
> b2 <- gamm(y ~ s(x) + s(id, bs = "re")).
>
> As far as I can understand, these two setups should create the same
> underlying model matrices. However, when running them on my data, the
> p-values for the smooth terms, as well as their estimated degrees of
> freedom, are different.
>
> Below is a reproducible example on simulated data, which show that these
> two types of specifying the models give different p-values and estimated
> degrees of freedom. On my real data, these differences are sometimes even
> more exaggerated.
>
> My question is: Are not these two calls to gamm suppose to estimate the
> same model, or have I misunderstood?
>
> Here is a reproducible example:
>
> library(mgcv)
> set.seed(100)
> n_sim <- 100
> df <- data.frame(
>    model1_pval = numeric(n_sim),
>    model1_edf = numeric(n_sim),
>    model2_pval = numeric(n_sim),
>    model2_edf = numeric(n_sim)
> )
>
> # Number of observations
> n <- 500
> # Number of groups
> ng <- 100
>
> for(i in 1:n_sim){
>    # Draw the fixed effect covariate
>    x <- rnorm(n)
>    # Set up the group
>    id <- rep(1:ng, n/ng)
>    # Draw the random effect
>    z <- rnorm(ng)
>    # Draw the response
>    y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)
>
>    # Fit the two different models
>    b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
>    b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))
>
>    df[i, "model1_pval"] <- anova(b1$gam)$s.pv
>    df[i, "model1_edf"] <- anova(b1$gam)$edf
>    df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
>    df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
> }
>
> # Differences between model 1 and model 2 p values
> mean(df$model1_pval)
> #> [1] 6.790546e-21
> mean(df$model2_pval)
> #> [1] 3.090694e-14
> max(abs(df$model1_pval - df$model2_pval))
> #> [1] 2.770545e-12
>
> # Differences between model1 and model 2 estimated degrees of freedom
> mean(df$model1_edf)
> #> [1] 3.829992
> mean(df$model2_edf)
> #> [1] 3.731438
> max(abs(df$model1_edf - df$model2_edf))
> #> [1] 0.6320281
>
>
> Thanks in advance,
> Øystein Sørensen
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Why do two different calls to mgcv::gamm give different p-values?

Øystein Sørensen
Thanks a lot, Simon. That was a silly mistake on my part. After correcting,
there is still a slight difference between the p-values, but this is so
small that I guess it is just numerics.

Best,
Øystein

On Tue, Oct 2, 2018 at 2:26 PM Simon Wood <[hidden email]> wrote:

> Dear Øystein,
>
> In your code 'id' is set up to be numeric. In your first gamm call it
> gets coerced to a factor (since nothing else would make sense). In  your
> second gamm call it is simply treated as numeric, since that could makes
> sense. To make the two models equivalent you just need to substitute:
>
> id <- as.factor(rep(1:ng, n/ng))
>
> into your loop.
>
> best,
> Simon
>
> On 01/10/18 08:15, Øystein Sørensen wrote:
> > I have longitudinal data where a response y_{ij} has been measured
> > repeatedly for i=1,...,n individuals, j=1,...,m times for each
> individual.
> > Let x_{ij} be the age of individual i at her/his jth measurement. I wish
> to
> > see how the response varies with age, and have reason to believe that a
> > nonlinear fit is needed. I hence wish to model the relationship using an
> > additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
> > \epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
> > \sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
> > \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.
> >
> > Reading the documentation to the mgcv package, I see that such models can
> > be set up by calling the mgcv::gamm two different ways. One way shown to
> > set up such a model is with the call
> > b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
> > where id is an indicator of the specific individual. In this way, the
> > random effect is specified in a list. The other way is to set up the
> random
> > effect with a smooth of the re:
> > b2 <- gamm(y ~ s(x) + s(id, bs = "re")).
> >
> > As far as I can understand, these two setups should create the same
> > underlying model matrices. However, when running them on my data, the
> > p-values for the smooth terms, as well as their estimated degrees of
> > freedom, are different.
> >
> > Below is a reproducible example on simulated data, which show that these
> > two types of specifying the models give different p-values and estimated
> > degrees of freedom. On my real data, these differences are sometimes even
> > more exaggerated.
> >
> > My question is: Are not these two calls to gamm suppose to estimate the
> > same model, or have I misunderstood?
> >
> > Here is a reproducible example:
> >
> > library(mgcv)
> > set.seed(100)
> > n_sim <- 100
> > df <- data.frame(
> >    model1_pval = numeric(n_sim),
> >    model1_edf = numeric(n_sim),
> >    model2_pval = numeric(n_sim),
> >    model2_edf = numeric(n_sim)
> > )
> >
> > # Number of observations
> > n <- 500
> > # Number of groups
> > ng <- 100
> >
> > for(i in 1:n_sim){
> >    # Draw the fixed effect covariate
> >    x <- rnorm(n)
> >    # Set up the group
> >    id <- rep(1:ng, n/ng)
> >    # Draw the random effect
> >    z <- rnorm(ng)
> >    # Draw the response
> >    y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)
> >
> >    # Fit the two different models
> >    b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
> >    b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))
> >
> >    df[i, "model1_pval"] <- anova(b1$gam)$s.pv
> >    df[i, "model1_edf"] <- anova(b1$gam)$edf
> >    df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
> >    df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
> > }
> >
> > # Differences between model 1 and model 2 p values
> > mean(df$model1_pval)
> > #> [1] 6.790546e-21
> > mean(df$model2_pval)
> > #> [1] 3.090694e-14
> > max(abs(df$model1_pval - df$model2_pval))
> > #> [1] 2.770545e-12
> >
> > # Differences between model1 and model 2 estimated degrees of freedom
> > mean(df$model1_edf)
> > #> [1] 3.829992
> > mean(df$model2_edf)
> > #> [1] 3.731438
> > max(abs(df$model1_edf - df$model2_edf))
> > #> [1] 0.6320281
> >
> >
> > Thanks in advance,
> > Øystein Sørensen
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.