Hi,
We are analizing the relationship between the abundance of groupers in line transects and some variables. We are using the quasipoisson distribution. Do we need to include the length of the transects as an offset if they all have the same length?? Also, can we include in the gam models variables that are measured at different spatial scales? We have done an analysis to see what variables are better for different sizes of buffers around the transect lines and some variables are better at different scales. Can we run the gam model with several explanatory variables if they are measured at different spatial scales? Thanks, Lucia |
Could you specify the package you use? If it is mgcv, this one centers your
variables before applying the smooths. That's something to take into account when comparing different models. In any way, If scales are too different, I try rescaling by either : expressing things in different units (meter versus kilometer, gr) On Wed, May 19, 2010 at 10:37 AM, Lucia Rueda <[hidden email]> wrote: > > Hi, > > We are analizing the relationship between the abundance of groupers in line > transects and some variables. We are using the quasipoisson distribution. > Do > we need to include the length of the transects as an offset if they all > have > the same length?? > > Also, can we include in the gam models variables that are measured at > different spatial scales? We have done an analysis to see what variables > are > better for different sizes of buffers around the transect lines and some > variables are better at different scales. Can we run the gam model with > several explanatory variables if they are measured at different spatial > scales? > > Thanks, > > Lucia > -- > View this message in context: > http://r.789695.n4.nabble.com/offset-in-gam-and-spatial-scale-of-variables-tp2222483p2222483.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Lucia Rueda
Could you specify the package you use? If it is mgcv, this one centers your
variables before applying the smooths. That's something to take into account when comparing different models. In any way, If scales are too different, I try rescaling by either : - expressing things in different units (meter versus kilometer, gram versus kilogram) - dividing by the standard deviation to get all variables appx on the same order of magnitude. This does change the interpretation of your model though. But somehow I have the feeling you're not talking about that kind of difference in scales. Could you please explain a bit more in detail what it is exactly you're trying to do? I also suspect some autocorrelation problem, which would direct you towards a gamm method. Cheers Joris On Wed, May 19, 2010 at 10:37 AM, Lucia Rueda <[hidden email]> wrote: > > Hi, > > We are analizing the relationship between the abundance of groupers in line > transects and some variables. We are using the quasipoisson distribution. > Do > we need to include the length of the transects as an offset if they all > have > the same length?? > > Also, can we include in the gam models variables that are measured at > different spatial scales? We have done an analysis to see what variables > are > better for different sizes of buffers around the transect lines and some > variables are better at different scales. Can we run the gam model with > several explanatory variables if they are measured at different spatial > scales? > > Thanks, > > Lucia > -- > View this message in context: > http://r.789695.n4.nabble.com/offset-in-gam-and-spatial-scale-of-variables-tp2222483p2222483.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Hi Joris,
We're using mgcv. We have data on abundance of groupers on line transects that have the same legth. My coworker has selected a bunch of variables and he has calculated them in terms of total area in different sizes of buffers around the centroid of the transect. He has run gam models (quasipoisson, mgcv) for each explanatory variable at each size of buffer. Then he has selected the signifficant variables. Some variables explain a higher percentage of deviance at different sizes of buffers. And now he wants to build a gam model trying the different explanatory variables but using the values that correspond to the size of the buffer where they explain a higher deviance, so one variable might have the values of a smaller scale whereas other might correspond to a higher buffer size (I don't know if I made myself clear). I am wondering if this is correct. Also I don't know if he should include an offset in spite all the transects have the same length. I'm in charge of looking at the spatial correlation once he builds the model. I don't know much about it but I was thinking of doing a Moran test, correlogram and variogram and then if there's spatial autocorrelation doing gamm, sar or gee. Thanks, Lucia |
In reply to this post by Joris FA Meys
On Wednesday 19 May 2010 15:29, Joris Meys wrote:
> Could you specify the package you use? If it is mgcv, this one centers your > variables before applying the smooths. That's something to take into > account when comparing different models. --- er, actually it only centres variables in this way for some smoothing bases, for numerical stability purposes: but this is done in a way that is user transparent and makes absolutely no difference to model interpretation or comparison. Of course the smooths themselves are subject to `centering constraints' (but that's very different to centering the variables) --- these are just identifiability constraints --- all gam fitting packages have to put some identifiability constraints on the smooths, and the centering constraints used by `mgcv' and `gam' have the benefit of minimizing the standard errors on the constrained smooths. best, Simon -- > Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK > +44 1225 386603 www.maths.bath.ac.uk/~sw283 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Lucia Rueda
> We are analizing the relationship between the abundance of groupers in line > transects and some variables. We are using the quasipoisson distribution. > Do we need to include the length of the transects as an offset if they all > have the same length?? --- not just for fitting, I suppose: although I guess you may need some care in interpreting the units of the fitted model predictions, if you leave it out. > Also, can we include in the gam models variables that are measured at > different spatial scales? We have done an analysis to see what variables > are better for different sizes of buffers around the transect lines and > some variables are better at different scales. Can we run the gam model > with several explanatory variables if they are measured at different > spatial scales? --- Do you mean, for example, that that sea surface temperature was measured every in 10km grid squares by satellite, whereas salinity was measured every quarter nautical mile directly? --- If so, I think that you can use such data, but you need a clear method for converting what is measured about the covariate to a covariate value associated with each response measurement. As an example you might have salinity measures that are widely scattered, and do not coincide with the locations of response measurements. One option is to smooth or interpolate the salinity values, and use the resulting predicted salinities at each response datum location as covariates. Of course if you do this sort of thing it's important that only such predicted salinities are used for predicting from the model (i.e. not to switch to direct measurements of salinity for prediction) best, Simon > > Thanks, > > Lucia -- > Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK > +44 1225 386603 www.maths.bath.ac.uk/~sw283 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Simon Wood-4
Thank you for the correction. I was thinking about the difference of using a
variable with a smoother, and comparing that to a model with that variable without smoother. I should specify that I mostly use thin plate regression splines. If the spline itself is not deviating from linearity, you get a nice straight line going through the point (mean(Data),0) if you look at the marginal plots. Use the same variable unchanged but as a simple linear effect in the model, and the same line will run through (0,0). At least, that's what I noticed and also the reason why I center my variables first. The models are essentially the same, the shift is mainly in the intercept. But the centering got a bit a reflex. Cheers Joris On Wed, May 19, 2010 at 8:20 PM, Simon Wood <[hidden email]> wrote: > On Wednesday 19 May 2010 15:29, Joris Meys wrote: > > Could you specify the package you use? If it is mgcv, this one centers > your > > variables before applying the smooths. That's something to take into > > account when comparing different models. > --- er, actually it only centres variables in this way for some smoothing > bases, for numerical stability purposes: but this is done in a way that is > user transparent and makes absolutely no difference to model interpretation > or comparison. Of course the smooths themselves are subject to `centering > constraints' (but that's very different to centering the variables) --- > these are just identifiability constraints --- all gam fitting packages > have > to put some identifiability constraints on the smooths, and the centering > constraints used by `mgcv' and `gam' have the benefit of minimizing the > standard errors on the constrained smooths. > > best, > Simon > > -- > > Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK > > +44 1225 386603 www.maths.bath.ac.uk/~sw283<http://www.maths.bath.ac.uk/%7Esw283> > -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Lucia Rueda
On Wed, May 19, 2010 at 4:51 PM, Lucia Rueda <[hidden email]> wrote:
> > Hi Joris, > > We're using mgcv. > > We have data on abundance of groupers on line transects that have the same > length. I only now realized groupers are actually fish :-). Should work on my english skills... > My coworker has selected a bunch of variables and he has calculated > them in terms of total area in different sizes of buffers around the > centroid of the transect. He has run gam models (quasipoisson, mgcv) for > each explanatory variable at each size of buffer. Here you lost me a bit. How should I imagine those buffers? Is it, as Simon said, some area? Then that would mean you measure eg salinity along the transect, and average the numbers using a window of a specific size? Or am I seeing it wrong? Then he has selected the > signifficant variables. Some variables explain a higher percentage of > deviance at different sizes of buffers. And now he wants to build a gam > model trying the different explanatory variables but using the values that > correspond to the size of the buffer where they explain a higher deviance, > so one variable might have the values of a smaller scale whereas other > might > correspond to a higher buffer size (I don't know if I made myself clear). I > am wondering if this is correct. > It seems not correct to me. Model building in these frameworks, especially when using inference, should be driven by hypothesis, not by any correlation in the data. Especially with smooths one has to be very careful. Another issue is the correlation between environmental variables, They often covary along transects, meaning that you can have confounding and even aliasing in your dataset. This has to be checked and taken into account _before_ building the models. I have the impression that his approach does not take care of this. Next, I believe that data should be used as raw as possible, to not jeopardize the interpretation. If you use different buffer sizes, you can't just say that variable X and Y contribute significantly to the explanation of the variation, but that variable X and Y contributes significantly, depending on the scale it is measured. It also depends on whether your goal is purely predictive, or if you want to do inference. In case you want to conclude something about the significance of the parameters, his approach seems unvalid to me. How to explain that the significance of a variable depends on the scale of measurement? One assumes a continuous relation -unless working with factors- so the scale shouldn't make much of a difference anyway. If you can predict the number of groupers by the amount of bald men in Hong-Kong, by all means, do so. But I wouldn't formulate a scientific conclusion based on the significance of that model, if you get my drift. Also I don't know if he should include an offset in spite all the transects > have the same length. > Do you mean an intercept? In that case I'd always include one, except in very specific cases. > > I'm in charge of looking at the spatial correlation once he builds the > model. I don't know much about it but I was thinking of doing a Moran test, > correlogram and variogram and then if there's spatial autocorrelation doing > gamm, sar or gee. > Gamm is a very powerful tool, but -if I understood Simon's book correctly- you cannot trust the anova's on the gam-component of the gamm-object when using link functions. LR tests can give some information, but there is not a solid statistical framework yet for formal hypothesis testing of those models. I also wonder why building a model without, and then doing the same with the correct variance-covariance structure. Personally, I'd do it the other way around. Not that it will change much about the predictions, but it definitely will change the inference. In any case, all of these are my personal opinions on a problem I do not understand fully. It's some general considerations, feel free to think different. > > Thanks, > > Lucia > -- > View this message in context: > http://r.789695.n4.nabble.com/offset-in-gam-and-spatial-scale-of-variables-tp2222483p2222976.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Hi,
Thanks for the inputs. I talked to my coworker, who has been the one doing the analysis. Perhaps I wasn't making myself clear about the “differences in spatial scales”. Here is what he says: "The truth is that measuring scales (i.e all area related variable are measured in m2) and spatial definition of initial cartography are homogeneous among extracted variables. But all variables (ie. sum of the total rocky bottom in the surrounding area) are computed for each different integration areas (buffer) (i.e in an area of 40squaremeters around the sample, in an area of 80m2, …). The question is then if we can build a model that includes variables measured at different buffers (for example a model that includes 3 variables: 1.- the amount of rocky bottom in an area of 80m2 ; 2- the amount of sandy bottom in an area of 200m2; and the mean depth calculated in a surrounding area of 50m2) considering that each variable may be expressing different ecological processes. I believe that if there is not an ecological constrain in the interpretation of the variables (and their ecological effect over the specie), including them in a model is correct, unless there is not a mathematical constrain." Also, about the spatial correlation I thought from what I've read so far that I had to build the model and then check if there was spatial correlation in the residuals since they are supposed to be i.i.d. And if it turns out that they are then I have to do something about it like gamm, gee, sar, car, etc. Cheers, Lucia |
On Thu, May 20, 2010 at 3:20 PM, Lucia Rueda <[hidden email]> wrote:
> > Hi, > > Thanks for the inputs. I talked to my coworker, who has been the one doing > the analysis. Perhaps I wasn't making myself clear about the differences > in > spatial scales. Here is what he says: > > "The truth is that measuring scales (i.e all area related variable are > measured in m2) and spatial definition of initial cartography are > homogeneous among extracted variables. But all variables (ie. sum of the > total rocky bottom in the surrounding area) are computed for each different > integration areas (buffer) (i.e in an area of 40squaremeters around the > sample, in an area of 80m2, ). > The question is then if we can build a model that includes variables > measured at different buffers (for example a model that includes 3 > variables: 1.- the amount of rocky bottom in an area of 80m2 ; 2- the > amount of sandy bottom in an area of 200m2; and the mean depth calculated > in > a surrounding area of 50m2) considering that each variable may be > expressing > different ecological processes. I believe that if there is not an > ecological > constrain in the interpretation of the variables (and their ecological > effect over the specie), including them in a model is correct, unless there > is not a mathematical constrain." > different buffers, but as you said, you should be able to interprete them in an ecological way. I'd be surprised if depth and bottom have a different effect-scale, as they both are related to the territorium of the animal. Plus, you cannot conclude anything from the difference in deviance explained. You can't say anything about the homerange or so based on the observation that more deviance is explained when looking on a scale of 200m2 for example. So if you have good ecological reasons to include them, you can, but if it's merely because on one scale they explain more of the deviance, I still believe it is a very dangerous approach... > Also, about the spatial correlation I thought from what I've read so far > that I had to build the model and then check if there was spatial > correlation in the residuals since they are supposed to be i.i.d. And if it > turns out that they are then I have to do something about it like gamm, > gee, > sar, car, etc. > That's an approach that is often used. In essence, that's true. Correlation between the raw data can be due to cocorrelation with some other factor in space or time. But a pre-analysis of correlations and autocorrelations can tell you already quite some. In any case, you always have to check the residuals after the model building. My main point was that using the correlation will definitely influence the significance of the parameters. Anyway, good luck with it. I learnt pretty fast that as long as you can explain what you're doing and why you're doing it, there's a big grey zone between right and wrong. Otherwise it wouldn't be statistics, would it? ;-) Cheers Joris > > Cheers, > > Lucia > > > -- > View this message in context: > http://r.789695.n4.nabble.com/offset-in-gam-and-spatial-scale-of-variables-tp2222483p2224528.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |