Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I understand that the common practice to select CP value is by choosing the lowest level with the minimum
xerror
value. However, in my following case, using
cp <- fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
will give me
0.17647059
which will result in no split or just root after pruning with this value.
> myFormula <- Kyphosis~Age+Number+Start
> set.seed(1)
> fit <- rpart(myFormula,data=data,method="class",control=rpart.control(minsplit=20,xval=10,cp=0.01))
> fit$cptable
CP nsplit rel error xerror xstd
1 0.17647059 0 1.0000000 1.000000 0.2155872
2 0.01960784 1 0.8235294 1.000000 0.2155872
3 0.01000000 4 0.7647059 1.058824 0.2200975
Is there any other alternative/ good practice to select the CP value?
Generally, a cptable like the one you have, is a warning that the tree is probably no use at all and probably not able to generalise well on to future data. So the answer is not to find another way to choose cp but rather to create a useful tree if you can, or to admit defeat and say that based on the examples and features that we have, we cannot create a model that is predictive of kyphosis.
In your case, all is not - necessarily - lost. The data is very small and the cross validation which gives rise to the xerror column is very volatile. If you seed your seed to 2 or to 3 you will see very different answers in that column (some even worse).
So one thing which is interesting on this data, is to increase the number of cross-validation folds to the number of observations (so that you get LOOCV). If you do this:
myFormula <- Kyphosis ~ Age + Number + Start
rpart_1 <- rpart(myFormula, data = kyphosis,
method = "class",
control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))
rpart_1$cptable
you will find a CP table that you will like better! (Note that setting a seed is not necessary any more since the folds are the same each time).
–
In general (and considering parsimony) you should prefer the smaller tree from those with minimum xerror value, this is, any of those whose xerror value is within [min(xerror) - xstd; min(xerror) + xstd].
According to rpart vignette: "Any risk within one standard error of the achieved minimum is marked as being equivalent to the minimum (i.e. considered to be part of the flat plateau). Then the simplest model, among all those “tied” on the plateau, is chosen."
See: https://stackoverflow.com/a/15318542/2052738
You can select the most appropriate cp value (to prune the initial your.tree
, overfitted with rpart
) with an ad-hoc function such as:
cp.select <- function(big.tree) {
min.x <- which.min(big.tree$cptable[, 4]) #column 4 is xerror
for(i in 1:nrow(big.tree$cptable)) {
if(big.tree$cptable[i, 4] < big.tree$cptable[min.x, 4] + big.tree$cptable[min.x, 5]) return(big.tree$cptable[i, 1]) #column 5: xstd, column 1: cp
pruned.tree <- prune(your.tree, cp = cp.select(your.tree))
[In your particular example, all trees are equivalent so size 1 (no splits) is to be preferred, as the selected response already explained]
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.