Trees (and Forests) with packages 'party' vs. 'partykit': Different results

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Trees (and Forests) with packages 'party' vs. 'partykit': Different results

apeshifter
Dear all,

I'm currently exploring a dataset with the help of conditional inference trees (still very much a beginner with this technique & log. reg. methods as a whole t.b.h.), since they explained more variation in my dataset than a binary logistic regression with glm. I started out with the party package, but after I while I ran into the 'updated' partykit package and tried this out, too. Now, the strange thing is that both trees look quite different - actually even the very first split is different. So I did some research and came across the 'forest' concept. However, it seems that the varImp function does not yet work in the partykit implementation, which raises the question for me how I should evaluate the partykit forest - how can I find out whether the variables are important in the forest as in my partykit tree? Is there some way to do this or some other solution for this problem? I'd prefer to continue the partykit implementation of ctree, since it allows more settings for the final plot, which I'd need to get the final (large) plot into a readable form.

Related to this project, I'd also like to give statistics for the overall model, e.g. overall significance, Nagelkerke's R², a C-value. After a 'regular' binary log. reg., I would use the lrm function to get these values, but I am unsure whether it would be correct to also apply this method to my tree data.

Any help would be greatly appreciated!

-- Christopher
Reply | Threaded
Open this post in threaded view
|

Re: Trees (and Forests) with packages 'party' vs. 'partykit': Different results

Achim Zeileis-4
Christopher,

thanks for you interest.

> I'm currently exploring a dataset with the help of conditional inference
> trees (still very much a beginner with this technique & log. reg.
> methods as a whole t.b.h.), since they explained more variation in my
> dataset than a binary logistic regression with /glm/. I started out with
> the /party /package, but after I while I ran into the 'updated'
> /partykit /package and tried this out, too.

If you want to use individual trees (as opposed to forests), then the
"partykit" package is recommended because it contains much improved
re-implementations of ctree() and mob() as well as the mob() convenience
interfaces lmtree() and glmtree(). For forests see below.

> Now, the strange thing is that both trees look quite different -
> actually even the very first split is different.

This might be due to several partitioning variables being associated with
tiny p-values in the root node. The re-implementation in partykit
internally computes with log-p-values and hence should be numerically more
stable. In the old implementation it could happen that from several highly
significant variables, always the first is chosen because the p-values
were essentially indistinguishable for the computer.

If you think that this is not the problem, then please contact the package
maintainer with a reproducible example.

Except for bug fixes like the one above, the trees grown by
partykit::ctree and party::ctree should be the same.

> So I did some research and came across the 'forest' concept. However, it
> seems that the /varImp /function does not yet work in the /partykit
> /implementation,

Correct. While the ctree() implementation in partykit is better than that
in party, the same is _not_ true for cforest(). The new partykit::cforest
is currently still a basic implementation which doesn't offer as many
features as the party::cforest implementation. More work is needed
especially for variable importance measures and different kinds of
predictions.

> which raises the question for me how I should evaluate the /partykit
> /forest - how can I find out whether the variables are important in the
> forest as in my /partykit /tree? Is there some way to do this or some
> other solution for this problem? I'd prefer to continue the /partykit
> /implementation of ctree, since it allows more settings for the final
> plot, which I'd need to get the final (large) plot into a readable form.
>
> Related to this project, I'd also like to give statistics for the overall
> model, e.g. overall significance, Nagelkerke's R², a C-value. After a
> 'regular' binary log. reg., I would use the lrm function to get these
> values, but I am unsure whether it would be correct to also apply this
> method to my tree data.

Overall significance is difficult because you have done model selection
when growing the tree. As for pseudo R-squared or information criteria
etc., it is relatively easy to compute these "by hand" based on the
observed and fitted responses. An example for this is provided at:
http://stackoverflow.com/questions/29524670/how-to-find-the-the-deviance-of-an-as-party-object-converted-from-rpart-tree-in/29693223#29693223

> Any help would be greatly appreciated!
>
> -- Christopher
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Trees (and Forests) with packages 'party' vs. 'partykit': Different results

apeshifter
Achim,

thank you very much for your help, this really cleared up a number of issues.

As for the differences in results between the party and partykit implementations of ctree, I guess that the situation is indeed as you assumed. Four out of five variables have p-values <2.2e-16. (However, it is not the first of these variables that is selected but the one in the second column.) I will just continue using the newer implementation.

-- Christopher