glm predict issue

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

glm predict issue

bravegag
Hello,

I have tried reading the documentation and googling for the answer but reviewing the online matches I end up more confused than before.

My problem is apparently simple. I fit a glm model (2^k experiment), and then I would like to predict the response variable (Throughput) for unseen factor levels.

When I try to predict I get the following error:
> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000

Of course these are new factor levels, it is exactly what I am trying to achieve i.e. extrapolate the values of Throughput.

Can anyone please advice? Below I include all details.

Thanks in advance,
Best regards,
Giovanni

> # define the extreme (factors and levels)
> experiments <- expand.grid(No_databases   = seq(1000,100,by=-200),
+       Partitioning   = c("sharding", "replication"),
+       No_middlewares = seq(500,100,by=-100),
+       Queue_size     = c(100))
> experiments$No_databases <- as.factor(experiments$No_databases)
> experiments$Partitioning <- as.factor(experiments$Partitioning)
> experiments$No_middlewares <- as.factor(experiments$No_middlewares)
> experiments$Queue_size <- as.factor(experiments$Queue_size)      
> str(experiments)
'data.frame': 50 obs. of  4 variables:
 $ No_databases  : Factor w/ 5 levels "200","400","600",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ Partitioning  : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 2 2 2 2 2 ...
 $ No_middlewares: Factor w/ 5 levels "100","200","300",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Queue_size    : Factor w/ 1 level "100": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  5 2 5 1
  .. ..- attr(*, "names")= chr  "No_databases" "Partitioning" "No_middlewares" "Queue_size"
  ..$ dimnames:List of 4
  .. ..$ No_databases  : chr  "No_databases=1000" "No_databases= 800" "No_databases= 600" "No_databases= 400" ...
  .. ..$ Partitioning  : chr  "Partitioning=sharding" "Partitioning=replication"
  .. ..$ No_middlewares: chr  "No_middlewares=500" "No_middlewares=400" "No_middlewares=300" "No_middlewares=200" ...
  .. ..$ Queue_size    : chr "Queue_size=100"
> head(experiments)
  No_databases Partitioning No_middlewares Queue_size
1         1000     sharding            500        100
2          800     sharding            500        100
3          600     sharding            500        100
4          400     sharding            500        100
5          200     sharding            500        100
6         1000  replication            500        100
> # or
> throughput.fit <- glm(log(Throughput)~(No_databases*No_middlewares)+Partitioning+Queue_size,
+ data=throughput)
> summary(throughput.fit)

Call:
glm(formula = log(Throughput) ~ (No_databases * No_middlewares) +
    Partitioning + Queue_size, data = throughput)

Deviance Residuals:
    Min       1Q   Median       3Q      Max  
-2.5966  -0.6612  -0.1944   0.5548   3.2136  

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.74701    0.09127  62.970  < 2e-16 ***
No_databases4                  0.43309    0.10985   3.943 8.66e-05 ***
No_middlewares2               -1.99374    0.11035 -18.067  < 2e-16 ***
No_middlewares4               -1.23004    0.10969 -11.214  < 2e-16 ***
Partitioningreplication        0.33291    0.06181   5.386 9.15e-08 ***
Queue_size100                  0.15850    0.06181   2.564   0.0105 *  
No_databases4:No_middlewares2  2.71525    0.15262  17.791  < 2e-16 ***
No_databases4:No_middlewares4  1.94191    0.15226  12.754  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.8921778)

    Null deviance: 2175.58  on 936  degrees of freedom
Residual deviance:  828.83  on 929  degrees of freedom
AIC: 2562.2

Number of Fisher Scoring iterations: 2

> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: glm predict issue

Weidong Gu-2
Hi,

This might be due to the fact that factor levels are arbitary unless
they are ordinal, even that quantitative relationships between levels
are unclear. Therefore, the model has no way to predict unseen factor
levels.

Does it make sense to treat 'No_databases' as numeric instead of a
factor variable?

Weidong

On Mon, Dec 26, 2011 at 6:29 AM, Giovanni Azua <[hidden email]> wrote:

> Hello,
>
> I have tried reading the documentation and googling for the answer but reviewing the online matches I end up more confused than before.
>
> My problem is apparently simple. I fit a glm model (2^k experiment), and then I would like to predict the response variable (Throughput) for unseen factor levels.
>
> When I try to predict I get the following error:
>> throughput.pred <- predict(throughput.fit,experiments,type="response")
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
>  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
>
> Of course these are new factor levels, it is exactly what I am trying to achieve i.e. extrapolate the values of Throughput.
>
> Can anyone please advice? Below I include all details.
>
> Thanks in advance,
> Best regards,
> Giovanni
>
>> # define the extreme (factors and levels)
>> experiments <- expand.grid(No_databases   = seq(1000,100,by=-200),
> +                                                  Partitioning   = c("sharding", "replication"),
> +                                                  No_middlewares = seq(500,100,by=-100),
> +                                                  Queue_size     = c(100))
>> experiments$No_databases <- as.factor(experiments$No_databases)
>> experiments$Partitioning <- as.factor(experiments$Partitioning)
>> experiments$No_middlewares <- as.factor(experiments$No_middlewares)
>> experiments$Queue_size <- as.factor(experiments$Queue_size)
>> str(experiments)
> 'data.frame':   50 obs. of  4 variables:
>  $ No_databases  : Factor w/ 5 levels "200","400","600",..: 5 4 3 2 1 5 4 3 2 1 ...
>  $ Partitioning  : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 2 2 2 2 2 ...
>  $ No_middlewares: Factor w/ 5 levels "100","200","300",..: 5 5 5 5 5 5 5 5 5 5 ...
>  $ Queue_size    : Factor w/ 1 level "100": 1 1 1 1 1 1 1 1 1 1 ...
>  - attr(*, "out.attrs")=List of 2
>  ..$ dim     : Named int  5 2 5 1
>  .. ..- attr(*, "names")= chr  "No_databases" "Partitioning" "No_middlewares" "Queue_size"
>  ..$ dimnames:List of 4
>  .. ..$ No_databases  : chr  "No_databases=1000" "No_databases= 800" "No_databases= 600" "No_databases= 400" ...
>  .. ..$ Partitioning  : chr  "Partitioning=sharding" "Partitioning=replication"
>  .. ..$ No_middlewares: chr  "No_middlewares=500" "No_middlewares=400" "No_middlewares=300" "No_middlewares=200" ...
>  .. ..$ Queue_size    : chr "Queue_size=100"
>> head(experiments)
>  No_databases Partitioning No_middlewares Queue_size
> 1         1000     sharding            500        100
> 2          800     sharding            500        100
> 3          600     sharding            500        100
> 4          400     sharding            500        100
> 5          200     sharding            500        100
> 6         1000  replication            500        100
>> # or
>> throughput.fit <- glm(log(Throughput)~(No_databases*No_middlewares)+Partitioning+Queue_size,
> +                                       data=throughput)
>> summary(throughput.fit)
>
> Call:
> glm(formula = log(Throughput) ~ (No_databases * No_middlewares) +
>    Partitioning + Queue_size, data = throughput)
>
> Deviance Residuals:
>    Min       1Q   Median       3Q      Max
> -2.5966  -0.6612  -0.1944   0.5548   3.2136
>
> Coefficients:
>                              Estimate Std. Error t value Pr(>|t|)
> (Intercept)                    5.74701    0.09127  62.970  < 2e-16 ***
> No_databases4                  0.43309    0.10985   3.943 8.66e-05 ***
> No_middlewares2               -1.99374    0.11035 -18.067  < 2e-16 ***
> No_middlewares4               -1.23004    0.10969 -11.214  < 2e-16 ***
> Partitioningreplication        0.33291    0.06181   5.386 9.15e-08 ***
> Queue_size100                  0.15850    0.06181   2.564   0.0105 *
> No_databases4:No_middlewares2  2.71525    0.15262  17.791  < 2e-16 ***
> No_databases4:No_middlewares4  1.94191    0.15226  12.754  < 2e-16 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> (Dispersion parameter for gaussian family taken to be 0.8921778)
>
>    Null deviance: 2175.58  on 936  degrees of freedom
> Residual deviance:  828.83  on 929  degrees of freedom
> AIC: 2562.2
>
> Number of Fisher Scoring iterations: 2
>
>> throughput.pred <- predict(throughput.fit,experiments,type="response")
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
>  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: glm predict issue

bbolker
In reply to this post by bravegag
Giovanni Azua <bravegag <at> gmail.com> writes:

>
> Hello,
>
> I have tried reading the documentation and googling for the answer but
reviewing the online matches I end up
> more confused than before.
>
> My problem is apparently simple. I fit a glm model (2^k experiment), and then
I would like to predict the
> response variable (Throughput) for unseen factor levels.
>
> When I try to predict I get the following error:
> > throughput.pred <- predict(throughput.fit,experiments,type="response")
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) :
>   factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
>
> Of course these are new factor levels, it is exactly what I am trying to
achieve i.e. extrapolate the values
> of Throughput.
>
> Can anyone please advice? Below I include all details.

  Any predictors that you want to treat as continuous
(which would be the only way you can extrapolate to unobserved
values) should be numeric, not factor variables -- use

mydata <- transform(mydata, var=as.numeric(var))

for example.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: glm predict issue

bravegag
Hi Ben,

Yes thanks you are right, I was able to fix it but first I had to fix the data frame over which I built my model to use numeric for those and then making the grid values also numeric it finally worked thanks!

Thank you for your help!
Best regards,
Giovanni

On Dec 26, 2011, at 4:57 PM, Ben Bolker wrote:

> Giovanni Azua <bravegag <at> gmail.com> writes:
>
>>
>> Hello,
>>
>> I have tried reading the documentation and googling for the answer but
> reviewing the online matches I end up
>> more confused than before.
>>
>> My problem is apparently simple. I fit a glm model (2^k experiment), and then
> I would like to predict the
>> response variable (Throughput) for unseen factor levels.
>>
>> When I try to predict I get the following error:
>>> throughput.pred <- predict(throughput.fit,experiments,type="response")
>> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> object$xlevels) :
>>  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
>>
>> Of course these are new factor levels, it is exactly what I am trying to
> achieve i.e. extrapolate the values
>> of Throughput.
>>
>> Can anyone please advice? Below I include all details.
>
>  Any predictors that you want to treat as continuous
> (which would be the only way you can extrapolate to unobserved
> values) should be numeric, not factor variables -- use
>
> mydata <- transform(mydata, var=as.numeric(var))
>
> for example.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.