Simulations of GAM and MARS models : sample size ; Y-outliers and missing X-data

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Simulations of GAM and MARS models : sample size ; Y-outliers and missing X-data

R help mailing list-2
Dear Experts,

I have fitted MARS and GAM models on a real dataset. My goal is prediction. I have run crossvalidation many times to get an idea of the out-of-bag accuracy value. I use the Mean Squared Error (MSE) as an error evaluation criterion. I have published my paper and the reviewers ask me to do simulations.
So, my goal is now to do simulations as simulation studies may be a better alternative for objectively comparing the performances of these 2 algorithms. My goal is to figure out which method (GAM or MARS) performs better (minimizing MSE) in what circumstances.
I want to consider 3 different factors : n (sample size) ; the presence of Y-outliers and the presence of missing data (X-data).
I want to know the influence of the sample size, the influence of the percentage of Y-outliers and the influence of the percentage of X missing data.

Sample size : n=50 ; n=100 ; n=200; n=300 and n=500
Y-outliers : 10% of Y-outliers ; 20% of Y-outliers ; 30% of Y-outliers ; 40% of Y-outliers and 50% of Y-outliers
Missing data : 10% of X missing data ; 20% of X missing data ; 30% of X missing data ; 40% of X missing data and 50% of X missing data

Here below are the reproducible R codes for GAM and MARS I use to calculate the MSE running cross-validation many times. 
How can I modify my R codes to simulate the sample size, the presence of Y-outliers and the presence of missing data ?

###MSE CROSSVALIDATION GAM (gam1)
install.packages("ISLR")
library(ISLR)
install.packages("mgcv")
library(mgcv)
 
set.seed(123)
# Create a list to store the results
lst<-list()
 
# This statement does the repetitions (looping)
for(i in 1 :1000){
 
n=dim(Wage)[1]
 
p=0.667
 
sam=sample(1 :n,floor(p*n),replace=FALSE)
 
Training =Wage [sam,]
Testing = Wage [-sam,]
 
GAM1<-gam(wage ~education+s(age,bs="ps")+year,data=Wage)
 
ypred=predict(GAM1,newdata=Testing)
y=Testing$wage

MSE = mean((y-ypred)^2)
MSE
lst[i]<-MSE
}
mean(unlist(lst))
########

#####MSE CROSSVALIDATION MARS (Mars1)
install.packages("ISLR")
library(ISLR)
install.packages("earth")
library(earth)

set.seed(123)
# Create a list to store the results
lst<-list()
 
# This statement does the repetitions (looping)
for(i in 1 :1000){
 
n=dim(Wage)[1]
 
p=0.667
 
sam=sample(1 :n,floor(p*n),replace=FALSE)
 
Training =Wage [sam,]
Testing = Wage [-sam,]
 
mars1 <- earth(wage~age+as.factor(education)+year, data=Wage)
 
ypred=predict(mars1,newdata=Testing)
y=Testing$wage

MSE = mean((y-ypred)^2)
MSE
lst[i]<-MSE
}
mean(unlist(lst))
#########

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Simulations of GAM and MARS models : sample size ; Y-outliers and missing X-data

Abby Spurdle
> How can I modify my R codes to simulate the sample size, the presence of
Y-outliers and the presence of missing data ?

I don't know what it means for data to have 50% Y-outliers.
That's new to me...

As for the rest of your question.
Modify your code so that a single function, say sim.test() computes your
simulated statistics, for n sample size and m missing values, and returns
the results, say as a two-element list.

Then write a top level script (or function), something like:

+ ns = c (50, 100, 200, 300, 500)
+ ms = (1:5) * 0.1

+ n = rep (ns, each=5)
+ m = rep (ms, times=5)
+ GAM.stat = MARS.stat = numeric (25)

+ for (i in 1:25)
+ {   results = sim.test (n [i], m [i], ...other.args...)
+     GAM.stat [i] = results$GAM.stat
+     MARS.stat [i] = results$MARS.stat
+ }

+ cbind (n, m, GAM.stat, MARS.stat)

Note that from past experience, what you are doing may produce misleading
results.
Because your results are dependent on your simulated data.
(Different simulated data will produce different results, and different end
conclusions).

I haven't checked how the functions, you've used to fit models, handle
missing values.
But assuming that missing values are NAs, this should be easy to do.

Do you want *each* x variable to have m% missing values, or *all* the x
variables (collectively), to have m% missing values?

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Simulations of GAM and MARS models : sample size ; Y-outliers and missing X-data

Abby Spurdle
Sorry.
Some more comments.

(1) If you want an arbitrary sample size, you may need to write a function
to produce a simulated data set, for that given sample size.
Alternatively, you can use a random sample of size n, of an initial data
set, assuming the initial data set is relatively large.
(2) For each combination of n sample size and m% missing values, you may
need to compute your statistic, many (i.e. thousands of) times.
Then compute the mean and variance of your statistic.
(3) I have assumed that you want to start with a data.frame with no missing
values, and then set random subset(s) to missing values.
(4) Note that GAMs can have interaction terms.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Simulations of GAM and MARS models : sample size ; Y-outliers and missing X-data

R help mailing list-2
In reply to this post by Abby Spurdle
Dear Abby,

Many thanks for your response.

To answer your question. For me better all the x variables (collectively), to have m% missing values.

When you tell me : "Modify your code so that a single function say sim.test() computes your simulated statistics, for n sample size and m missing values, and returns the results, say as a two-element list".
I trust you and guess it is a really good idea, but don't know how to do that... :=(







Le jeudi 8 août 2019 à 05:29:55 UTC+2, Abby Spurdle <[hidden email]> a écrit :





> How can I modify my R codes to simulate the sample size, the presence of Y-outliers and the presence of missing data ?


I don't know what it means for data to have 50% Y-outliers.
That's new to me...

As for the rest of your question.
Modify your code so that a single function, say sim.test() computes your simulated statistics, for n sample size and m missing values, and returns the results, say as a two-element list.

Then write a top level script (or function), something like:

+ ns = c (50, 100, 200, 300, 500)
+ ms = (1:5) * 0.1

+ n = rep (ns, each=5)
+ m = rep (ms, times=5)
+ GAM.stat = MARS.stat = numeric (25)

+ for (i in 1:25)
+ {   results = sim.test (n [i], m [i], ...other.args...)
+     GAM.stat [i] = results$GAM.stat
+     MARS.stat [i] = results$MARS.stat
+ }

+ cbind (n, m, GAM.stat, MARS.stat)

Note that from past experience, what you are doing may produce misleading results.
Because your results are dependent on your simulated data.
(Different simulated data will produce different results, and different end conclusions).

I haven't checked how the functions, you've used to fit models, handle missing values.
But assuming that missing values are NAs, this should be easy to do.

Do you want *each* x variable to have m% missing values, or *all* the x variables (collectively), to have m% missing values?

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Simulations of GAM and MARS models : sample size ; Y-outliers and missing X-data

Abby Spurdle
> For me better all the x variables (collectively), to have m% missing
values.

I checked the mgcv documentation.
Observations with (any) missing values are ignored.
(i.e. The entire row, from your input data).

"If there are missing values in the reponse or covariates of a GAM then the
default is simply to use only the ‘complete cases’."

I haven't checked the ISLR package.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.