python - Number of trainings done with Pipeline and GridSearchCV

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm reading this tutorial that combines PCA and then logistic regression in a pipeline and after then apply cross validation with a defined set of parameters for PCA and Logistic Regression. Here is what I understood from the example and then I will ask my question.

I understood:

When GridSearchCV is executed it first has a default of 3 folds. So it starts by computing PCA with 20 components and then transform the data and let it go into Logistic regression for training. Now for each of the values of the logistic regression C parameter it will apply 3 folds cross validation and see which values, thus will end up with 3*3=9 trainings for logistic regression because we have 3 values of C parameters and 3 folds of cross validation for each parameter value.

After that it will do the same with the second parameter for PCA which is 40, so other 9 trainings. And then also 9 trainings for the last parameter of PCA 64. So in total we will have 9 * 3 = 27 trainings for logistic regression.

My question: is my understanding correct for the procedure?

Yes, this is entirely correct. You can easily check it by setting the grid search procedure in verbose mode:

>>> estimator = GridSearchCV(pipe, dict(pca__n_components=n_components,
...                                     logistic__C=Cs),
...                          verbose=1)
>>> estimator.fit(X_digits, y_digits)
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[...snip...]
More generally, the number of fit calls is the product of the number of value per parameter, times k, +1 if you refit the best parameters on the full training set (which happens by default).
                Note that for the current version of scikit-learn (0.18.1), the PCA step is performed inside each fold of the GridSearchCV cross-validation, even if its output doesn't change with the parameter C... This should be corrected in version 19 of scikit-learn.
– Victor Deplasse
                May 13, 2017 at 11:31
                @VictorDeplasse Do you mean that after version 10 there would be 9 fits only per cv iteration? (Three logistic regression fits for each PCA fit)
– drake
                Nov 16, 2018 at 14:13
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.