Odd results from rpart classification tree

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Odd results from rpart classification tree

Marshall, Jonathan
The following code produces a tree with only a root. However, clearly the tree with a split at x=0.5 is better. rpart doesn't seem to want to produce it.

Running the following produces a tree with only root.

y <- c(rep(0,65),rep(1,15),rep(0,20))
x <- c(rep(0,70),rep(1,30))
f <- rpart(y ~ x, method='class', minsplit=1, cp=0.0001, parms=list(split='gini'))

Computing the improvement for a split at x=0.5 manually:

obs_L <- y[x<.5]
obs_R <- y[x>.5]
n_L <- sum(x<.5)
n_R <- sum(x>.5)
gini <- function(p) {sum(p*(1-p))}
impurity_root <- gini(prop.table(table(y)))
impurity_L <- gini(prop.table(table(obs_L)))
impurity_R <- gini(prop.table(table(obs_R)))
impurity <- impurity_root * n - (n_L*impurity_L + n_R*impurity_R) # 2.880952

Thus, an improvement of 2.88 should result in a split. It does not.

Why?

Jonathan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd results from rpart classification tree

Therneau, Terry M., Ph.D.
You are mixing up two of the steps in rpart.  1: how to find the best candidate split and
2: evaluation of that split.

With the "class" method we use the information or Gini criteria for step 1.  The code
finds a worthwhile candidate split at 0.5 using exactly the calculations you outline.  For
step 2 the criteria is the "decision theory" loss.  In your data the estimated rate is 0
for the left node and 15/45 = .333 for the right node.  As a decision rule both predict
y=0 (since both are < 1/2).  The split predicts 0 on the left and 0 on the right, so does
nothing.

The CART book (Brieman, Freidman, Olshen and Stone) on which rpart is based highlights the
difference between odds-regression (for which the final prediction is a percent, and error
is Gini) and classification.  For the former treat y as continuous.

Terry T.


On 05/15/2017 05:00 AM, [hidden email] wrote:

> The following code produces a tree with only a root. However, clearly the tree with a split at x=0.5 is better. rpart doesn't seem to want to produce it.
>
> Running the following produces a tree with only root.
>
> y <- c(rep(0,65),rep(1,15),rep(0,20))
> x <- c(rep(0,70),rep(1,30))
> f <- rpart(y ~ x, method='class', minsplit=1, cp=0.0001, parms=list(split='gini'))
>
> Computing the improvement for a split at x=0.5 manually:
>
> obs_L <- y[x<.5]
> obs_R <- y[x>.5]
> n_L <- sum(x<.5)
> n_R <- sum(x>.5)
> gini <- function(p) {sum(p*(1-p))}
> impurity_root <- gini(prop.table(table(y)))
> impurity_L <- gini(prop.table(table(obs_L)))
> impurity_R <- gini(prop.table(table(obs_R)))
> impurity <- impurity_root * n - (n_L*impurity_L + n_R*impurity_R) # 2.880952
>
> Thus, an improvement of 2.88 should result in a split. It does not.
>
> Why?
>
> Jonathan
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd results from rpart classification tree

Marshall, Jonathan
In reply to this post by Marshall, Jonathan
Thanks Terry!

I managed to figure that out shortly after posting (as is the way!) Adding an additional covariate that splits below one of the x branches but not the other and means the class proportion to go over 0.5 means the x split is retained.

However, I now have another conundrum, this time with rpart in anova mode...

library(rpart)
test_split <- function(offset) {
  y <- c(rep(0,10),rep(0.5,2)) + offset
  x <- c(rep(0,10),rep(1,2))
  if (is.null(rpart(y ~ x, minsplit=1, cp=0, xval=0)$splits)) 0 else 1
}

sum(replicate(1000, test_split(0))) # 1000, i.e. always splits
sum(replicate(1000, test_split(0.5))) # 2-12, i.e. splits only sometimes...

Adding a constant to y and getting different trees is a bit strange, particularly stochastically.

Will see if I can track down a copy of the CART book.

Jonathan

________________________________________
From: Therneau, Terry M., Ph.D. [[hidden email]]
Sent: 16 May 2017 00:43
To: [hidden email]; Marshall, Jonathan
Subject: Re: Odd results from rpart classification tree

You are mixing up two of the steps in rpart.  1: how to find the best candidate split and
2: evaluation of that split.

With the "class" method we use the information or Gini criteria for step 1.  The code
finds a worthwhile candidate split at 0.5 using exactly the calculations you outline.  For
step 2 the criteria is the "decision theory" loss.  In your data the estimated rate is 0
for the left node and 15/45 = .333 for the right node.  As a decision rule both predict
y=0 (since both are < 1/2).  The split predicts 0 on the left and 0 on the right, so does
nothing.

The CART book (Brieman, Freidman, Olshen and Stone) on which rpart is based highlights the
difference between odds-regression (for which the final prediction is a percent, and error
is Gini) and classification.  For the former treat y as continuous.

Terry T.


On 05/15/2017 05:00 AM, [hidden email] wrote:

> The following code produces a tree with only a root. However, clearly the tree with a split at x=0.5 is better. rpart doesn't seem to want to produce it.
>
> Running the following produces a tree with only root.
>
> y <- c(rep(0,65),rep(1,15),rep(0,20))
> x <- c(rep(0,70),rep(1,30))
> f <- rpart(y ~ x, method='class', minsplit=1, cp=0.0001, parms=list(split='gini'))
>
> Computing the improvement for a split at x=0.5 manually:
>
> obs_L <- y[x<.5]
> obs_R <- y[x>.5]
> n_L <- sum(x<.5)
> n_R <- sum(x>.5)
> gini <- function(p) {sum(p*(1-p))}
> impurity_root <- gini(prop.table(table(y)))
> impurity_L <- gini(prop.table(table(obs_L)))
> impurity_R <- gini(prop.table(table(obs_R)))
> impurity <- impurity_root * n - (n_L*impurity_L + n_R*impurity_R) # 2.880952
>
> Thus, an improvement of 2.88 should result in a split. It does not.
>
> Why?
>
> Jonathan
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.