相关文章推荐
粗眉毛的电池  ·  Linux ...·  1 年前    · 
有腹肌的啄木鸟  ·  perl print printf ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a data set with 100 columns of continuous features, and a continuous label, and I want to run SVR; extracting features of relevance, tuning hyper parameters, and then cross-validating my model that is fit to my data.

I wrote this code:

X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the pipeline to evaluate
model = SVR()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
# define the grid
grid = dict()
#How many features to try
grid['estimator__sel__k'] = [i for i in range(1, X_train.shape[1]+1)]
# define the grid search
#search = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)
search = GridSearchCV(
        pipeline,
#        estimator=SVR(kernel='rbf'),
        param_grid={
            'estimator__svr__C': [0.1, 1, 10, 100, 1000],
            'estimator__svr__epsilon': [0.0001, 0.0005,  0.001, 0.005,  0.01, 0.05, 1, 5, 10],
            'estimator__svr__gamma': [0.0001, 0.0005,  0.001, 0.005,  0.01, 0.05, 1, 5, 10]
        scoring='neg_mean_squared_error',
        verbose=1,
        n_jobs=-1)
for param in search.get_params().keys():
    print(param)
# perform the search
results = search.fit(X_train, y_train)
# summarize best
print('Best MAE: %.3f' % results.best_score_)
print('Best Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
    print(">%.3f with: %r" % (mean, param))

I get the error:

ValueError: Invalid parameter estimator for estimator Pipeline(memory=None,
         steps=[('sel',
                 SelectKBest(k=10,
                             score_func=<function mutual_info_regression at 0x7fd2ff649cb0>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.

When I print estimator.get_params().keys(), as suggested in the error message, I get:

error_score estimator__memory estimator__steps estimator__verbose estimator__sel estimator__svr estimator__sel__k estimator__sel__score_func estimator__svr__C estimator__svr__cache_size estimator__svr__coef0 estimator__svr__degree estimator__svr__epsilon estimator__svr__gamma estimator__svr__kernel estimator__svr__max_iter estimator__svr__shrinking estimator__svr__tol estimator__svr__verbose estimator n_jobs param_grid pre_dispatch refit return_train_score scoring verbose Fitting 5 folds for each of 405 candidates, totalling 2025 fits

But when I change the line:

pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
pipeline = Pipeline(steps=[('estimator__sel',fs), ('estimator__svr', model)])

I get the error:

ValueError: Estimator names must not contain __: got ['estimator__sel', 'estimator__svr']

Could someone explain what I'm doing wrong, i.e. how do I combine the pipeline/feature selection step into the GridSearchCV?

As a side note, if I comment out pipeline in the GridSearchCV, and uncomment estimator=SVR(kernal='rbf'), the cell runs without issue, but in that case, I presume I am not incorporating the feature selection in, as it's not called anywhere. I have seen some previous SO questions, e.g. here, but they don't seem to answer this specific question.

Is there a cleaner way to write this?

The first error message is about the pipeline parameters, not the search parameters, and indicates that your param_grid is bad, not the pipeline step names. Running pipeline.get_params().keys() should show you the right parameter names. Your grid should be:

        param_grid={
            'svr__C': [0.1, 1, 10, 100, 1000],
            'svr__epsilon': [0.0001, 0.0005,  0.001, 0.005,  0.01, 0.05, 1, 5, 10],
            'svr__gamma': [0.0001, 0.0005,  0.001, 0.005,  0.01, 0.05, 1, 5, 10]

I don't know how substituting the plain SVR for the pipeline runs; your parameter grid doesn't specify the right things there either...

Thanks. When I change the param_grid to your code, it runs (with a really bad Spearman between y_pred and y_test). That's fine if that's really the case, but I wanted to check that wasn't because of an improperly made model. I don't understand 'I don't know how substituting the plain SVR for the pipeline runs; your parameter grid doesn't specify the right things there either...'. Are you saying this code should not run? I have the code in a file by itself just with some data read in before it, and when I print y_pred = results.predict(X_test); print y_pred[0:10], it prints 10 predictions. – Slowat_Kela Mar 3, 2021 at 16:29 @Slowat_Kela I was referring to your "side note" paragraph. If estimator=SVR(...), then having estimator__svr__C in the param_grid should fail with the same error as you originally reported; in this version, the parameter name should be just C. – Ben Reiniger Mar 3, 2021 at 17:10 Oh sorry that's my fault, I wasn't clear. I had estimator=SVR(...) in, when I just had C, epsilon and gamma in param grid sorry (not the estimator__svr_C). Sorry I wasn't clear, I meant that I can get this piece of code generally to run if I just use plain SVR, but not if I swap it over to a pipeline. – Slowat_Kela Mar 3, 2021 at 17:14

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.