Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I understand that the common practice to select CP value is by choosing the lowest level with the minimum xerror value. However, in my following case, using cp <- fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"] will give me 0.17647059 which will result in no split or just root after pruning with this value.

> myFormula <- Kyphosis~Age+Number+Start
> set.seed(1)
> fit <- rpart(myFormula,data=data,method="class",control=rpart.control(minsplit=20,xval=10,cp=0.01))
> fit$cptable
          CP nsplit rel error   xerror      xstd
1 0.17647059      0 1.0000000 1.000000 0.2155872
2 0.01960784      1 0.8235294 1.000000 0.2155872
3 0.01000000      4 0.7647059 1.058824 0.2200975

Is there any other alternative/ good practice to select the CP value?

Generally, a cptable like the one you have, is a warning that the tree is probably no use at all and probably not able to generalise well on to future data. So the answer is not to find another way to choose cp but rather to create a useful tree if you can, or to admit defeat and say that based on the examples and features that we have, we cannot create a model that is predictive of kyphosis.

In your case, all is not - necessarily - lost. The data is very small and the cross validation which gives rise to the xerror column is very volatile. If you seed your seed to 2 or to 3 you will see very different answers in that column (some even worse).

So one thing which is interesting on this data, is to increase the number of cross-validation folds to the number of observations (so that you get LOOCV). If you do this:

myFormula <- Kyphosis ~ Age + Number + Start
rpart_1 <- rpart(myFormula, data = kyphosis,
                 method = "class", 
                 control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))
rpart_1$cptable

you will find a CP table that you will like better! (Note that setting a seed is not necessary any more since the folds are the same each time).

If you got computing time to spare, control = rpart.control(xval = [data.length], minsplit = 2, minbucket = 1, cp = 0) will give you the most overfitted sequence of trees with the most informative k-fold cross-validation. With plotcp(model) and printcp(model) you can explore the whole range of possible trees – FairMiles Sep 25, 2018 at 16:42

In general (and considering parsimony) you should prefer the smaller tree from those with minimum xerror value, this is, any of those whose xerror value is within [min(xerror) - xstd; min(xerror) + xstd].

According to rpart vignette: "Any risk within one standard error of the achieved minimum is marked as being equivalent to the minimum (i.e. considered to be part of the flat plateau). Then the simplest model, among all those “tied” on the plateau, is chosen."

See: https://stackoverflow.com/a/15318542/2052738

You can select the most appropriate cp value (to prune the initial your.tree, overfitted with rpart) with an ad-hoc function such as:

cp.select <- function(big.tree) {
  min.x <- which.min(big.tree$cptable[, 4]) #column 4 is xerror
  for(i in 1:nrow(big.tree$cptable)) {
    if(big.tree$cptable[i, 4] < big.tree$cptable[min.x, 4] + big.tree$cptable[min.x, 5]) return(big.tree$cptable[i, 1]) #column 5: xstd, column 1: cp 
pruned.tree <- prune(your.tree, cp = cp.select(your.tree))

[In your particular example, all trees are equivalent so size 1 (no splits) is to be preferred, as the selected response already explained]

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.