Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

Ask Question

num_classes = len(np.unique(y_train))
y_train_categorical = keras.utils.to_categorical(y_train, num_classes)
kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=999)
# splitting data into different folds
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical)):
    x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
    y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
keras.utils.to_categorical produces a one-hot encoded class vector, i.e. the multilabel-indicator mentioned in the error message. StratifiedKFold is not designed to work with such input; from the split method docs:
  split(X, y, groups=None)

  [...]
  y : array-like, shape (n_samples,)
  The target variable for supervised learning problems. Stratification is done based on the y labels.
i.e. your y must be a 1-D array of your class labels.
Essentially, what you have to do is simply to invert the order of the operations: split first (using your intial y_train), and convert to_categorical afterwards.
                i din't think that this is a good idea, because in a unbalanced dataset with multi-class classiffication problem, maybe the validation part what you want to convert it's labels doesn't contain all the classes. So, when you call to_categorical(val, n_class) it will raise an error ..
– Minions
                Nov 8, 2018 at 14:55
                @Minion this is not correct; StratifiedKFold takes care that "The folds are made by preserving the percentage of samples for each class" (docs). In very special cases where some of the classes are very under-represented some extra caution (and manual checks) is obviously recommended, but the answer here is about the general case only and not for other, hypothetical ones...
– desertnaut
                Nov 8, 2018 at 15:53
                I have tried inverting the order, and still get the same error. Any ideas? I have made a topic for my problem, if you could check that, thanks.
– Murilo
                Apr 7 at 8:25
                @Murilo please open a new question with the details and a minimum reproducible example; link here if necessary
– desertnaut
                Apr 7 at 14:30
Call to split() like this:
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical.argmax(1))):
    x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
    y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
If your target variable is continuous then use simple KFold cross validation instead of StratifiedKFold.
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
I bumped into the same problem and found out that you can check the type of the target with this util function:
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
'multilabel-indicator'
From its docstring:
  'binary': y contains <= 2 discrete values and is 1d or a column
  vector.
  'multiclass': y contains more than two discrete values, is not a
  sequence of sequences, and is 1d or a column vector.
  'multiclass-multioutput': y is a 2d array that contains more
  than two discrete values, is not a sequence of sequences, and both
  dimensions are of size > 1.
  'multilabel-indicator': y is a label indicator matrix, an array
  of two dimensions with at least two columns, and at most 2 unique
  values.
With LabelEncoder you can transform your classes into an 1d array of numbers (given your target labels are in an 1d array of categoricals/object):
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(target_labels)
In my case, x was a 2D matrix, and y was also a 2d matrix, i.e. indeed a multi-class multi-output case. I just passed a dummy np.zeros(shape=(n,1)) for the y and the x as usual. Full code example:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [3, 7], [9, 4]])
# y = np.array([0, 0, 1, 1, 0, 1]) # <<< works
y = X # does not work if passed into `.split`
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=36851234)
for train_index, test_index in rskf.split(X, np.zeros(shape=(X.shape[0], 1))):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
                What is the point of using StratifiedKFold if you do not pass the labels to it? Simply use KFold instead.
– Mehraban
                May 29, 2018 at 8:31
                StratifiedKFold would normally use the target, but in my particular shortcut, I'm passing 0's for the target, so you're right
– Shadi
                May 29, 2018 at 8:48
Complementing what @desertnaut said, in order to convert your one-hot-encoding back to 1-D array you will only need to do is:
class_labels = np.argmax(y_train, axis=1)
This will convert back to the initial representation of your classes.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.

推荐文章

爱玩的跑步鞋 · 为iframe正名，你可能并不需要微前端 - 阿里巴巴终端技术 -
1 月前

豁达的羊肉串 · log/canal/canal.log大量错误Connection to node -1 could-问答-阿里云开发者社区-阿里云
1 年前

玩滑板的野马 · tabbar简单实现消息提示(小红点)_radiogroup 红点_barnett_y的博客-CSDN博客
1 年前

不敢表白的椰子 · mysql将字符串分割成多列值-掘金
1 年前