Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

Ask Question
num_classes = len(np.unique(y_train))
y_train_categorical = keras.utils.to_categorical(y_train, num_classes)
kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=999)
# splitting data into different folds
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical)):
    x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
    y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.

keras.utils.to_categorical produces a one-hot encoded class vector, i.e. the multilabel-indicator mentioned in the error message. StratifiedKFold is not designed to work with such input; from the split method docs:

split(X, y, groups=None)

[...]

y : array-like, shape (n_samples,)

The target variable for supervised learning problems. Stratification is done based on the y labels.

i.e. your y must be a 1-D array of your class labels.

Essentially, what you have to do is simply to invert the order of the operations: split first (using your intial y_train), and convert to_categorical afterwards.

i din't think that this is a good idea, because in a unbalanced dataset with multi-class classiffication problem, maybe the validation part what you want to convert it's labels doesn't contain all the classes. So, when you call to_categorical(val, n_class) it will raise an error .. – Minions Nov 8, 2018 at 14:55 @Minion this is not correct; StratifiedKFold takes care that "The folds are made by preserving the percentage of samples for each class" (docs). In very special cases where some of the classes are very under-represented some extra caution (and manual checks) is obviously recommended, but the answer here is about the general case only and not for other, hypothetical ones... – desertnaut Nov 8, 2018 at 15:53 I have tried inverting the order, and still get the same error. Any ideas? I have made a topic for my problem, if you could check that, thanks. – Murilo Apr 7 at 8:25 @Murilo please open a new question with the details and a minimum reproducible example; link here if necessary – desertnaut Apr 7 at 14:30

Call to split() like this:

for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical.argmax(1))):
    x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
    y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]

If your target variable is continuous then use simple KFold cross validation instead of StratifiedKFold.

from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

I bumped into the same problem and found out that you can check the type of the target with this util function:

from sklearn.utils.multiclass import type_of_target
type_of_target(y)
'multilabel-indicator'

From its docstring:

  • 'binary': y contains <= 2 discrete values and is 1d or a column vector.
  • 'multiclass': y contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector.
  • 'multiclass-multioutput': y is a 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1.
  • 'multilabel-indicator': y is a label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values.
  • With LabelEncoder you can transform your classes into an 1d array of numbers (given your target labels are in an 1d array of categoricals/object):

    from sklearn.preprocessing import LabelEncoder
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(target_labels)
    

    In my case, x was a 2D matrix, and y was also a 2d matrix, i.e. indeed a multi-class multi-output case. I just passed a dummy np.zeros(shape=(n,1)) for the y and the x as usual. Full code example:

    import numpy as np
    from sklearn.model_selection import RepeatedStratifiedKFold
    X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [3, 7], [9, 4]])
    # y = np.array([0, 0, 1, 1, 0, 1]) # <<< works
    y = X # does not work if passed into `.split`
    rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=36851234)
    for train_index, test_index in rskf.split(X, np.zeros(shape=(X.shape[0], 1))):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
                    What is the point of using StratifiedKFold if you do not pass the labels to it? Simply use KFold instead.
    – Mehraban
                    May 29, 2018 at 8:31
                    StratifiedKFold would normally use the target, but in my particular shortcut, I'm passing 0's for the target, so you're right
    – Shadi
                    May 29, 2018 at 8:48
    

    Complementing what @desertnaut said, in order to convert your one-hot-encoding back to 1-D array you will only need to do is:

    class_labels = np.argmax(y_train, axis=1)
    

    This will convert back to the initial representation of your classes.

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.