Let me give an outline of how to answer Alfredo's question via an example.

I will split the data set "lung" into two peices. For these subjects with

advanced lung cancer the physician's assessment of ECOG performance status

(ph.ecog) is one of the most powerful indicators of outcome. Try to

predict it from other variables.

library(survival) # for the test data set

library(rpart)

data1 <- lung[1:125,]

data2 <- lung[126:228,]

rfit1 <- rpart(ph.ecog ~ ., data=data1)

printcp(rfit1)

CP nsplit rel error xerror xstd

1 0.565788 0 1.00000 1.04037 0.100516

2 0.098516 1 0.43421 0.44906 0.045001

3 0.042708 2 0.33570 0.35134 0.041692

4 0.031032 3 0.29299 0.37610 0.042971

5 0.019949 4 0.26196 0.37753 0.044692

6 0.010000 5 0.24201 0.39166 0.050332

# Validate using data2. First get predictions for each of the pruned trees

cpvalues <- rfit1$cptable[,1]

pmat <- matrix(0, nrow(data2), length(cpvalues))

for (i in 1:length(cpvalues))

pmat[,i] <- predict(prune(rfit1, cpvalues[i]), newdata=data2)

Now, we need to decide what on a measure of error. Try simple squared error.

error <- colMeans((data2$ph.ecog - pmat)^2)

round(error, 3)

[1] 0.493 0.280 0.210 0.225 0.186 0.198

This is simple, but other cases are more complex. The performace score is

actually an integer from 0-4 (5= dead), see

http://ecog-acrin.org/resources/ecog-performance-statustable(lung$ph.ecog)

0 1 2 3

63 113 50 1

Suppose instead we fit a model and treat the response as categorical?

The total number of nested models is a bit smaller.

rfit2 <- rpart(ph.ecog ~ ., data=data1, method="class")

printcp(rfit2)

CP nsplit rel error xerror xstd

1 0.35938 0 1.00000 1.00000 0.086951

2 0.12500 1 0.64062 0.64062 0.081854

3 0.06250 2 0.51562 0.70312 0.083662

4 0.03125 4 0.39062 0.57812 0.079610

5 0.01000 5 0.35938 0.56250 0.078977

predict(rfit2, newdata=data2)[1:5,]

0 1 2 3

126 0.03125 0.9375 0.03125 0

127 0.03125 0.9375 0.03125 0

128 0.03125 0.9375 0.03125 0

129 0.03125 0.9375 0.03125 0

130 0.37500 0.6250 0.00000 0

Now, we can ask for predicted probabilities for each class (default), which is a vector

of length 4 for each subject, or for the predicted class, which is a single value. Which

do we want, and then what is the best measure of prediction error?

If three subjects with value 0 had prediction class vectors of (.8, .2, 0, 0),

(.8, .1, .1, 0) and (.45, .25, .2, .1), one outlook would say they all are the

same (all pick 0 as the best), others would give them different errors. Is

the second prediction worse than the first?

What if the single subject with ph.ecog=3 had ended up in the validation data

set; how should we judge their prediction?

This complexity is one reason that there is not a simple function for

"validation" with a new data set.

On 02/27/2017 09:48 AM, Alfredo wrote:

> Thank you, Terry, for your answer.

>

> I’ll try to explain better my question. When you create a classification or regression

> tree you first grow a tree based on a splitting criteria: this usually results in a large

> tree that provides a good fit to the training data. The problem with this tree is its

> potential for overfitting the data: the tree can be tailored too specifically to the

> training data and not generalize well to new data. The solution (apart cross-validation)

> is to find a smaller subtree that results in a low error rate on *holdout or validation data.*

>

> Hope it helps to clarity my question.

>

> Best,

>

> Alfredo

>

> -----Messaggio originale-----

> Da: Therneau, Terry M., Ph.D. [mailto:

[hidden email]]

>

> You will need to give more detail of exactly what you mean by "prune using a validation

> set". THe prune.rpart function will prune at any value you want, what I suspect you are

> looking for is to compute the error of each possible tree, using a validation data set,

> then find the best one, and then prune there.

>

> How do you define "best"?

>

______________________________________________

[hidden email] mailing list -- To UNSUBSCRIBE and more, see

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.