Random Forest: OOB performance = test set performance?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Random Forest: OOB performance = test set performance?

thebudget72
Hi ML,

For random forest, I thought that the out-of-bag performance should be
the same (or at least very similar) to the performance calculated on a
separated test set.

But this does not seem to be the case.

In the following code, the accuracy computed on out-of-bag sample is
77.81%, while the one computed on a separated test set is 81%.

Can you please check what I am doing wrong?

Thanks in advance and best regards.

library(randomForest)
library(ISLR)

Carseats$High <- ifelse(Carseats$Sales<=8,"No","Yes")
Carseats$High <- as.factor(Carseats$High)

train = sample(1:nrow(Carseats), 200)

rf = randomForest(High~.-Sales,
                   data=Carseats,
                   subset=train,
                   mtry=6,
                   importance=T)

acc <- (rf$confusion[1,1] + rf$confusion[2,2]) / sum(rf$confusion)
print(paste0("Accuracy OOB: ", round(acc*100,2), "%"))

yhat <- predict(rf, newdata=Carseats[-train,])
y <- Carseats[-train,]$High
conftest <- table(y, yhat)
acctest <- (conftest[1,1] + conftest[2,2]) / sum(conftest)
print(paste0("Accuracy test set: ", round(acctest*100,2), "%"))

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Random Forest: OOB performance = test set performance?

plangfelder
I think the only thing you are doing wrong is not setting the random
seed (set.seed()) so your results are not reproducible. Depending on
the random sample used to select the training and test sets, you get
slightly varying accuracy for both, sometimes one is better and
sometimes the other.

HTH,

Peter

On Sat, Apr 10, 2021 at 8:49 PM <[hidden email]> wrote:

>
> Hi ML,
>
> For random forest, I thought that the out-of-bag performance should be
> the same (or at least very similar) to the performance calculated on a
> separated test set.
>
> But this does not seem to be the case.
>
> In the following code, the accuracy computed on out-of-bag sample is
> 77.81%, while the one computed on a separated test set is 81%.
>
> Can you please check what I am doing wrong?
>
> Thanks in advance and best regards.
>
> library(randomForest)
> library(ISLR)
>
> Carseats$High <- ifelse(Carseats$Sales<=8,"No","Yes")
> Carseats$High <- as.factor(Carseats$High)
>
> train = sample(1:nrow(Carseats), 200)
>
> rf = randomForest(High~.-Sales,
>                    data=Carseats,
>                    subset=train,
>                    mtry=6,
>                    importance=T)
>
> acc <- (rf$confusion[1,1] + rf$confusion[2,2]) / sum(rf$confusion)
> print(paste0("Accuracy OOB: ", round(acc*100,2), "%"))
>
> yhat <- predict(rf, newdata=Carseats[-train,])
> y <- Carseats[-train,]$High
> conftest <- table(y, yhat)
> acctest <- (conftest[1,1] + conftest[2,2]) / sum(conftest)
> print(paste0("Accuracy test set: ", round(acctest*100,2), "%"))
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Random Forest: OOB performance = test set performance?

thebudget72
Thanks Peter.

Indeed by setting a seed the two results are similar.

I am self-studying and wanted to make sure I understood the concept of
OOB samples and how much "reliable" were performance metrics calculated
on them.

It seems I did got it. That's good :)

On 4/11/21 6:34 AM, Peter Langfelder wrote:

> I think the only thing you are doing wrong is not setting the random
> seed (set.seed()) so your results are not reproducible. Depending on
> the random sample used to select the training and test sets, you get
> slightly varying accuracy for both, sometimes one is better and
> sometimes the other.
>
> HTH,
>
> Peter
>
> On Sat, Apr 10, 2021 at 8:49 PM<[hidden email]>  wrote:
>> Hi ML,
>>
>> For random forest, I thought that the out-of-bag performance should be
>> the same (or at least very similar) to the performance calculated on a
>> separated test set.
>>
>> But this does not seem to be the case.
>>
>> In the following code, the accuracy computed on out-of-bag sample is
>> 77.81%, while the one computed on a separated test set is 81%.
>>
>> Can you please check what I am doing wrong?
>>
>> Thanks in advance and best regards.
>>
>> library(randomForest)
>> library(ISLR)
>>
>> Carseats$High <- ifelse(Carseats$Sales<=8,"No","Yes")
>> Carseats$High <- as.factor(Carseats$High)
>>
>> train = sample(1:nrow(Carseats), 200)
>>
>> rf = randomForest(High~.-Sales,
>>                     data=Carseats,
>>                     subset=train,
>>                     mtry=6,
>>                     importance=T)
>>
>> acc <- (rf$confusion[1,1] + rf$confusion[2,2]) / sum(rf$confusion)
>> print(paste0("Accuracy OOB: ", round(acc*100,2), "%"))
>>
>> yhat <- predict(rf, newdata=Carseats[-train,])
>> y <- Carseats[-train,]$High
>> conftest <- table(y, yhat)
>> acctest <- (conftest[1,1] + conftest[2,2]) / sum(conftest)
>> print(paste0("Accuracy test set: ", round(acctest*100,2), "%"))
>>
>> ______________________________________________
>> [hidden email]  mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.