Classification Tree Prediction Error

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Classification Tree Prediction Error

Xu Jun
Dear all R experts,

I have a question about using cross-validation to assess results estimated
from a classification tree model. I annotated what each line does in the R
code chunk below. Basically, I split the data, named usedta, into 70% vs.
30%, with the training set having 70% and the test set 30% of the original
cases. After splitting the data, I first run a classification tree off the
training set, and then use the results for cross-validation using the test
set. It turns out that if I don't have any predictors and make predictions
by simply betting on the majority class of the zero-one coding of the
binary response variable, I can do better than what the results from the
classification tree would deliver in the test set. What would this imply
and what would cause this problem? Does it mean that classification tree is
not an appropriate method for my data; or, it's because I have too few
variables? Thanks a lot!

Jun Xu, PhD
Professor
Department of Sociology
Ball State University
Muncie, IN 47306
USA

Using the estimates, I get the following prediction rate (correct
prediction) using the test set. Or we can say the misclassification error
rate is 1-0.837 = 0.163

> (tab[1,1] + tab[2,2]) / sum(tab)[1] 0.837


Without any predictors, I can get the following rate by betting on the
majority class every time, again using data from the test set. In this
case, the misclassification error rate is 1-0.85 = 0.15

> table(h2.test)h2.test
1poorHlth 0goodHlth
      101       575 > 571/(571+101)[1] 0.85



R Code Chunk

# set the seed for random number generator for replication
set.seed(47306)
# have the 7/3 split with 70% of the cases allotted to the training set
# AND create the training set identifier
class.train = sample(1:nrow(usedta), nrow(usedta)*0.7)
# create the test set indicator
class.test = (-class.train)
# create a vector for the binary response variable from the test set
# for future cross-tabulation.
h2.test <- usedta$h2[class.test]
# count the train set cases
Ntrain = length(usedta$h2[class.train])
# run the classification tree model using the training set
# h2 is the binary response and other variables are predictors
tree.h2 <- tree(h2 ~ age + educ + female + white + married + happy,
                data = usedta, subset = class.train,
                control = tree.control(nobs=Ntrain, mindev=0.003))
# summary results
summary(tree.h2)
# make predictions of h2 using the test set
tree.h2.pred <- predict(tree.h2, usedta[class.test,], type="class")
# cross tab the predictions using the test set
table(tree.h2.pred, h2.test)
tab = table(tree.h2.pred, h2.test)
# calculate the ratio for the correctly predicted in the test set
(tab[1,1] + tab[2,2]) / sum(tab)
# calculate the ratio for the correctly predicted using the naive approach
# by betting on the majority category.
table(h2.test)[2]/sum(tab)

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Classification Tree Prediction Error

Bert Gunter-2
Purely statistical questions -- as opposed to R programming queries -- are
generally off topic here.
Here is where they are on topic:  https://stats.stackexchange.com/

Suggestion: when you post, do include the package name where you get tree()
from, as there might be
more than one with this function.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Aug 24, 2020 at 8:58 AM Xu Jun <[hidden email]> wrote:

> Dear all R experts,
>
> I have a question about using cross-validation to assess results estimated
> from a classification tree model. I annotated what each line does in the R
> code chunk below. Basically, I split the data, named usedta, into 70% vs.
> 30%, with the training set having 70% and the test set 30% of the original
> cases. After splitting the data, I first run a classification tree off the
> training set, and then use the results for cross-validation using the test
> set. It turns out that if I don't have any predictors and make predictions
> by simply betting on the majority class of the zero-one coding of the
> binary response variable, I can do better than what the results from the
> classification tree would deliver in the test set. What would this imply
> and what would cause this problem? Does it mean that classification tree is
> not an appropriate method for my data; or, it's because I have too few
> variables? Thanks a lot!
>
> Jun Xu, PhD
> Professor
> Department of Sociology
> Ball State University
> Muncie, IN 47306
> USA
>
> Using the estimates, I get the following prediction rate (correct
> prediction) using the test set. Or we can say the misclassification error
> rate is 1-0.837 = 0.163
>
> > (tab[1,1] + tab[2,2]) / sum(tab)[1] 0.837
>
>
> Without any predictors, I can get the following rate by betting on the
> majority class every time, again using data from the test set. In this
> case, the misclassification error rate is 1-0.85 = 0.15
>
> > table(h2.test)h2.test
> 1poorHlth 0goodHlth
>       101       575 > 571/(571+101)[1] 0.85
>
>
>
> R Code Chunk
>
> # set the seed for random number generator for replication
> set.seed(47306)
> # have the 7/3 split with 70% of the cases allotted to the training set
> # AND create the training set identifier
> class.train = sample(1:nrow(usedta), nrow(usedta)*0.7)
> # create the test set indicator
> class.test = (-class.train)
> # create a vector for the binary response variable from the test set
> # for future cross-tabulation.
> h2.test <- usedta$h2[class.test]
> # count the train set cases
> Ntrain = length(usedta$h2[class.train])
> # run the classification tree model using the training set
> # h2 is the binary response and other variables are predictors
> tree.h2 <- tree(h2 ~ age + educ + female + white + married + happy,
>                 data = usedta, subset = class.train,
>                 control = tree.control(nobs=Ntrain, mindev=0.003))
> # summary results
> summary(tree.h2)
> # make predictions of h2 using the test set
> tree.h2.pred <- predict(tree.h2, usedta[class.test,], type="class")
> # cross tab the predictions using the test set
> table(tree.h2.pred, h2.test)
> tab = table(tree.h2.pred, h2.test)
> # calculate the ratio for the correctly predicted in the test set
> (tab[1,1] + tab[2,2]) / sum(tab)
> # calculate the ratio for the correctly predicted using the naive approach
> # by betting on the majority category.
> table(h2.test)[2]/sum(tab)
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Classification Tree Prediction Error

Xu Jun
Thank you for your comment! This tree function is from the tree package.
Although it might be a pure statistical question, it could be related to
how the tree function is used. I will explore the site that you suggested.
But if there is anyone who can figure it out off the top of their head, I'd
very much appreciate it.

Jun

On Mon, Aug 24, 2020 at 1:01 PM Bert Gunter <[hidden email]> wrote:

> Purely statistical questions -- as opposed to R programming queries -- are
> generally off topic here.
> Here is where they are on topic:  https://stats.stackexchange.com/
>
> Suggestion: when you post, do include the package name where you get
> tree() from, as there might be
> more than one with this function.
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Aug 24, 2020 at 8:58 AM Xu Jun <[hidden email]> wrote:
>
>> Dear all R experts,
>>
>> I have a question about using cross-validation to assess results estimated
>> from a classification tree model. I annotated what each line does in the R
>> code chunk below. Basically, I split the data, named usedta, into 70% vs.
>> 30%, with the training set having 70% and the test set 30% of the original
>> cases. After splitting the data, I first run a classification tree off the
>> training set, and then use the results for cross-validation using the test
>> set. It turns out that if I don't have any predictors and make predictions
>> by simply betting on the majority class of the zero-one coding of the
>> binary response variable, I can do better than what the results from the
>> classification tree would deliver in the test set. What would this imply
>> and what would cause this problem? Does it mean that classification tree
>> is
>> not an appropriate method for my data; or, it's because I have too few
>> variables? Thanks a lot!
>>
>> Jun Xu, PhD
>> Professor
>> Department of Sociology
>> Ball State University
>> Muncie, IN 47306
>> USA
>>
>> Using the estimates, I get the following prediction rate (correct
>> prediction) using the test set. Or we can say the misclassification error
>> rate is 1-0.837 = 0.163
>>
>> > (tab[1,1] + tab[2,2]) / sum(tab)[1] 0.837
>>
>>
>> Without any predictors, I can get the following rate by betting on the
>> majority class every time, again using data from the test set. In this
>> case, the misclassification error rate is 1-0.85 = 0.15
>>
>> > table(h2.test)h2.test
>> 1poorHlth 0goodHlth
>>       101       575 > 571/(571+101)[1] 0.85
>>
>>
>>
>> R Code Chunk
>>
>> # set the seed for random number generator for replication
>> set.seed(47306)
>> # have the 7/3 split with 70% of the cases allotted to the training set
>> # AND create the training set identifier
>> class.train = sample(1:nrow(usedta), nrow(usedta)*0.7)
>> # create the test set indicator
>> class.test = (-class.train)
>> # create a vector for the binary response variable from the test set
>> # for future cross-tabulation.
>> h2.test <- usedta$h2[class.test]
>> # count the train set cases
>> Ntrain = length(usedta$h2[class.train])
>> # run the classification tree model using the training set
>> # h2 is the binary response and other variables are predictors
>> tree.h2 <- tree(h2 ~ age + educ + female + white + married + happy,
>>                 data = usedta, subset = class.train,
>>                 control = tree.control(nobs=Ntrain, mindev=0.003))
>> # summary results
>> summary(tree.h2)
>> # make predictions of h2 using the test set
>> tree.h2.pred <- predict(tree.h2, usedta[class.test,], type="class")
>> # cross tab the predictions using the test set
>> table(tree.h2.pred, h2.test)
>> tab = table(tree.h2.pred, h2.test)
>> # calculate the ratio for the correctly predicted in the test set
>> (tab[1,1] + tab[2,2]) / sum(tab)
>> # calculate the ratio for the correctly predicted using the naive approach
>> # by betting on the majority category.
>> table(h2.test)[2]/sum(tab)
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Classification Tree Prediction Error

John Smith-5
As Bert advised correctly, this is not an R programming question. There is
some misunderstanding on how training//test data work together
in predictions. Suppose your test data has only one class. Therefore, you can
get the following rate by betting on the majority class every time, again
using data from the test set. In this case, the misclassification rate is
0! Of course no classification algorithm can beat that prediction for which
you already utilize the truth in the test data. In conclusion, the tree
model you provided has accuracy 0.837, which is very close to 0.85. I would
not complain.

On Tue, Aug 25, 2020 at 9:19 AM Xu Jun <[hidden email]> wrote:

> Thank you for your comment! This tree function is from the tree package.
> Although it might be a pure statistical question, it could be related to
> how the tree function is used. I will explore the site that you suggested.
> But if there is anyone who can figure it out off the top of their head, I'd
> very much appreciate it.
>
> Jun
>
> On Mon, Aug 24, 2020 at 1:01 PM Bert Gunter <[hidden email]>
> wrote:
>
> > Purely statistical questions -- as opposed to R programming queries --
> are
> > generally off topic here.
> > Here is where they are on topic:  https://stats.stackexchange.com/
> >
> > Suggestion: when you post, do include the package name where you get
> > tree() from, as there might be
> > more than one with this function.
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> and
> > sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Mon, Aug 24, 2020 at 8:58 AM Xu Jun <[hidden email]> wrote:
> >
> >> Dear all R experts,
> >>
> >> I have a question about using cross-validation to assess results
> estimated
> >> from a classification tree model. I annotated what each line does in
> the R
> >> code chunk below. Basically, I split the data, named usedta, into 70%
> vs.
> >> 30%, with the training set having 70% and the test set 30% of the
> original
> >> cases. After splitting the data, I first run a classification tree off
> the
> >> training set, and then use the results for cross-validation using the
> test
> >> set. It turns out that if I don't have any predictors and make
> predictions
> >> by simply betting on the majority class of the zero-one coding of the
> >> binary response variable, I can do better than what the results from the
> >> classification tree would deliver in the test set. What would this imply
> >> and what would cause this problem? Does it mean that classification tree
> >> is
> >> not an appropriate method for my data; or, it's because I have too few
> >> variables? Thanks a lot!
> >>
> >> Jun Xu, PhD
> >> Professor
> >> Department of Sociology
> >> Ball State University
> >> Muncie, IN 47306
> >> USA
> >>
> >> Using the estimates, I get the following prediction rate (correct
> >> prediction) using the test set. Or we can say the misclassification
> error
> >> rate is 1-0.837 = 0.163
> >>
> >> > (tab[1,1] + tab[2,2]) / sum(tab)[1] 0.837
> >>
> >>
> >> Without any predictors, I can get the following rate by betting on the
> >> majority class every time, again using data from the test set. In this
> >> case, the misclassification error rate is 1-0.85 = 0.15
> >>
> >> > table(h2.test)h2.test
> >> 1poorHlth 0goodHlth
> >>       101       575 > 571/(571+101)[1] 0.85
> >>
> >>
> >>
> >> R Code Chunk
> >>
> >> # set the seed for random number generator for replication
> >> set.seed(47306)
> >> # have the 7/3 split with 70% of the cases allotted to the training set
> >> # AND create the training set identifier
> >> class.train = sample(1:nrow(usedta), nrow(usedta)*0.7)
> >> # create the test set indicator
> >> class.test = (-class.train)
> >> # create a vector for the binary response variable from the test set
> >> # for future cross-tabulation.
> >> h2.test <- usedta$h2[class.test]
> >> # count the train set cases
> >> Ntrain = length(usedta$h2[class.train])
> >> # run the classification tree model using the training set
> >> # h2 is the binary response and other variables are predictors
> >> tree.h2 <- tree(h2 ~ age + educ + female + white + married + happy,
> >>                 data = usedta, subset = class.train,
> >>                 control = tree.control(nobs=Ntrain, mindev=0.003))
> >> # summary results
> >> summary(tree.h2)
> >> # make predictions of h2 using the test set
> >> tree.h2.pred <- predict(tree.h2, usedta[class.test,], type="class")
> >> # cross tab the predictions using the test set
> >> table(tree.h2.pred, h2.test)
> >> tab = table(tree.h2.pred, h2.test)
> >> # calculate the ratio for the correctly predicted in the test set
> >> (tab[1,1] + tab[2,2]) / sum(tab)
> >> # calculate the ratio for the correctly predicted using the naive
> approach
> >> # by betting on the majority category.
> >> table(h2.test)[2]/sum(tab)
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.