python - Creating train/test/val split with StratifiedKFold

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I'm trying to use StratifiedKFold to create train/test/val splits for use in a non-sklearn machine learning work flow. So, the DataFrame needs to be split and then stay that way.
I'm trying to do it like the following, using .values because I'm passing pandas DataFrames:
skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)
for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]
This fails with: 
ValueError: not enough values to unpack (expected 3, got 2).
I read through all of the sklearn docs and ran the example code, but did not gain a better understanding of how to use stratified k fold splits outside of a sklearn cross-validation scenario. 
EDIT:
I also tried like this:
# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)
Which seems to work, although I imagine I'm messing with the stratification by doing so.
                What is your question exactly? In what way does this behave differently from your expectations?
– Ryan Stout
                Jul 20, 2017 at 17:56
I'm not exactly sure if this question is about KFold or just stratified splits, but I wrote this quick wrapper for StratifiedKFold with a cross validation set.
from sklearn.model_selection import StratifiedKFold, train_test_split
class StratifiedKFold3(StratifiedKFold):
    def split(self, X, y, groups=None):
        s = super().split(X, y, groups)
        for train_indxs, test_indxs in s:
            y_train = y[train_indxs]
            train_indxs, cv_indxs = train_test_split(train_indxs,stratify=y_train, test_size=(1 / (self.n_splits - 1)))
            yield train_indxs, cv_indxs, test_indxs
It can be used like this:
X = np.random.rand(100)
y = np.random.choice([0,1],100)
g = KFold3(10).split(X,y)
train, cv, test = next(g)
train.shape, cv.shape, test.shape
>> ((80,), (10,), (10,))
StratifiedKFold can only be used to split your dataset into two parts per fold. You are getting an error because the split() method will only yield a tuple of train_index and test_index (see https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py#L94).
For this use case you should first split your data into validation and rest, and then split the rest again into test and train like such:
X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)
                I'm not sure the use of stratify='column' here, but when I run your code on my data I get: TypeError: Singleton array array('column', dtype='<U6') cannot be considered a valid collection.
– tw0000
                Jul 25, 2017 at 4:05
In stratify parameter, pass the target to stratify. First, inform the complete target array (y in my case). Then, in the next split, inform the target that was split (y_train in my case):
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)
Here is my stab at it, by nesting another StratifiedGroupKFold inside the the first split. First we look at how much to split so we can get the train indices, we then look at the ratio between val and test and get its split accordingly.
Note that there is some caveats here that I did not check for, such as when the number of groups is quite low, we may "run out of" groups before we reach the test-val split. For example when we have 10 groups, and we use 0.9 and 0.05 and 0.05 splits. The train set will use up 9 groups, leaving only 1 to share between test and val.
Furthermore this code does not work if the requested train ratio is not the largest. In that case you should invert again the train and val-test as I did with the inner val and test split.
import numpy as np
from sklearn.model_selection import StratifiedGroupKFold
# set the ratios for train, validation, and test splits
train_ratio = 0.5
val_ratio = 0.1
test_ratio = 0.4
assert train_ratio >= 0.5, "This code only works when train_ratio is the biggest"
num_splits = int(1 / (val_ratio + test_ratio))
N = 10000
X = np.random.rand(N, 10)
groups = np.random.randint(0, 100, N)
y = np.random.randint(0, 10, N)
num_folds = 3
for fold in range(num_folds):
    # We instantiate a new one every time since we control the number of folds ourselves
    sgkf = StratifiedGroupKFold(n_splits=num_splits, random_state=fold, shuffle=True)
    for train_indices, val_test_indices in sgkf.split(X, y, groups):
        X_train = X[train_indices]
        y_train = y[train_indices]
        groups_train = groups[train_indices]
        X_val_test = X[val_test_indices]
        y_val_test = y[val_test_indices]
        groups_val_test = groups[val_test_indices]
        # Now we have to split it based on the ratio between test and val
        split_ratio = test_ratio / val_ratio
        test_val_order = True
        if split_ratio < 1: # In this case we invert the ratio and the assignment of test-val / val-test
            test_val_order = False
            split_ratio = 1 / split_ratio
        split_ratio = int(split_ratio) + 1
        sgkf2 = StratifiedGroupKFold(n_splits=split_ratio)
        i1, i2 = next(sgkf2.split(X_val_test, y_val_test, groups_val_test))
        if test_val_order:
            test_indices = i1
            val_indices = i2
        else:
            test_indices = i2
            val_indices = i1
        X_val = X_val_test[val_indices]
        groups_val = groups_val_test[val_indices]
        X_test = X_val_test[test_indices]
        groups_test = groups_val_test[test_indices]
        print("train groups = ", np.unique(groups_train))
        print("val groups =", np.unique(groups_val))
        print("test groups =", np.unique(groups_test))
        print(X_train.shape, X_val.shape, X_test.shape)
    print()
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.