gridSearchCV（网格搜索）的参数、方法及示例__gridsearchcv参数scoring

1. 简介

GridSearchCV 的 sklearn 官方网址：

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

GridSearchCV ，它存在的意义就是自动调参，只要把参数输进去，就能给出最优化的结果和参数。但是这个方法适合于小数据集，一旦数据的量级上去了，很难得出结果。这个时候就是需要动脑筋了。数据量比较大的时候可以使用一个快速调优的方法——坐标下降。它其实是一种贪心算法：拿当前对模型影响最大的参数调优，直到最优化；再拿下一个影响最大的参数调优，如此下去，直到所有的参数调整完毕。这个方法的缺点就是可能会调到局部最优而不是全局最优，但是省时间省力，巨大的优势面前，还是试一试吧，后续可以再拿 bagging 再优化。

通常算法不够好，需要调试参数时必不可少。比如 SVM 的惩罚因子 C ，核函数 kernel ， gamma 参数等，对于不同的数据使用不同的参数，结果效果可能差 1-5 个点， sklearn 为我们提供专门调试参数的函数 grid_search 。

2. 参数说明

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

（1） estimator

选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个 scoring 参数，或者 score 方法：如 estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features='sqrt',random_state=10),

（2） param_grid

需要最优化的参数的取值，值为字典或者列表，例如： param_grid =param_test1 ， param_test1 = {'n_estimators':range(10,71,10)} 。

（3） scoring=None

模型评价标准，默认 None, 这时需要使用 score 函数；或者如 scoring='roc_auc' ，根据所选模型不同，评价准则不同。字符串（函数名），或是可调用对象，需要其函数签名形如： scorer(estimator, X, y) ；如果是 None ，则使用 estimator 的误差估计函数。具体值的选取看本篇第三节内容。

（4） fit_params=None

（5） n_jobs=1

n_jobs: 并行数， int ：个数 ,-1 ：跟 CPU 核数一致 , 1: 默认值

（6） iid=True

iid : 默认 True, 为 True 时，默认为各个样本 fold 概率分布一致，误差估计为所有样本之和，而非各个 fold 的平均。

（7） refit=True

默认为 True, 程序将会以交叉验证训练集得到的最佳参数，重新对所有可用的训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再次 fit 一遍全部数据集。

（8） cv=None

交叉验证参数，默认 None ，使用三折交叉验证。指定 fold 数量，默认为 3 ，也可以是 yield 训练 / 测试数据的生成器。

（9） verbose=0 , scoring=None

verbose ：日志冗长度， int ：冗长度， 0 ：不输出训练过程， 1 ：偶尔输出， >1 ：对每个子模型都输出。

（10） pre_dispatch=‘2*n_jobs’

指定总共分发的并行任务数。当 n_jobs 大于 1 时，数据将在每个运行点进行复制，这可能导致 OOM ，而设置 pre_dispatch 参数，则可以预先划分总共的 job 数量，使数据最多被复制 pre_dispatch 次

（11） error_score=’raise’

（12） return_train_score=’warn’

如果 “False” ， cv_results_ 属性将不包括训练分数。

回到 sklearn 里面的 GridSearchCV ， GridSearchCV 用于系统地遍历多种参数组合，通过交叉验证确定最佳效果参数。

3. Scoring parameter: 评价标准参数详细说明

Model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV ) rely on an internal scoring strategy. This is discussed in the section The scoring parameter: defining model evaluation rules .

For the most common use cases, you can designate a scorer object with the scoring parameter; the table below shows all possible values. All scorer objects follow the convention that higher return values are better than lower return values . Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error , are available as neg_mean_squared_error which return the negated value of the metric.

4. 属性

（ 1 ） cv_results_ : dict of numpy (masked) ndarrays

具有键作为列标题和值作为列的 dict ，可以导入到 DataFrame 中。注意， “params” 键用于存储所有参数候选项的参数设置列表。

（ 2 ） best_estimator_ : estimator

通过搜索选择的估计器，即在左侧数据上给出最高分数（或指定的最小损失）的估计器。如果 refit = False ，则不可用。

（ 3 ） best_score_ : float best_estimator 的分数

（ 4 ） best_params_ : dict 在保存数据上给出最佳结果的参数设置

（ 5 ） best_index_ : int 对应于最佳候选参数设置的索引（ cv_results_ 数组）。
search.cv_results _ ['params'] [search.best_index_] 中的 dict 给出了最佳模型的参数设置，给出了最高的平均分数（ search.best_score_ ）。

（ 6 ） scorer_ : function

Scorer function used on the held out data to choose the best parameters for the model.

（ 7 ） n_splits_ : int

The number of cross-validation splits (folds/iterations).

5. 进行预测的常用方法和属性

grid.fit() ：运行网格搜索

grid_scores_ ：给出不同参数情况下的评价结果

best_params_ ：描述了已取得最佳结果的参数的组合

best_score_ ：提供优化过程期间观察到的最好的评分

6. 网格搜索实例

6.1 随机森林

（ 1 ） http://ju.outofmemory.cn/entry/329943 转自 “ 蓝鲸网站分析博客 ” 。

（ 2 ）

http://blog.csdn.net/u010159842/article/details/79317828 利用 sklearn api gridsearch 进行 keras 参数调优

[python] view plain copy

1. param_test1 ={ 'n_estimators' :range(10,71,10)}

2. gsearch1= GridSearchCV(estimator =RandomForestClassifier(min_samples_split=100,

3. min_samples_leaf=20,max_depth= 8,max_features= 'sqrt' ,random_state=10),

4. param_grid =param_test1,scoring= 'roc_auc' ,cv=5)

5. gsearch1.fit(X,y)

6. gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

输出结果如下：

([mean: 0.80681, std:0.02236, params: {'n_estimators': 10},

mean: 0.81600, std: 0.03275, params:{'n_estimators': 20},

mean: 0.81818, std: 0.03136, params:{'n_estimators': 30},

mean: 0.81838, std: 0.03118, params:{'n_estimators': 40},

mean: 0.82034, std: 0.03001, params:{'n_estimators': 50},

mean: 0.82113, std: 0.02966, params:{'n_estimators': 60},

mean: 0.81992, std: 0.02836, params:{'n_estimators': 70}],

{'n_estimators': 60},

0.8211334476626017)

如果有transform,使用 Pipeline 简化系统搭建流程，将transform与分类器串联起来（Pipelineof transforms with a final estimator）

1. pipeline= Pipeline([( "features" , combined_features), ( "svm" , svm)])

2. param_grid= dict(features__pca__n_components=[1, 2, 3],

3. features__univ_select__k=[1,2],

4. svm__C=[0.1, 1, 10])

6. grid_search= GridSearchCV(pipeline, param_grid=param_grid, verbose=10)

7. grid_search.fit(X,y)

8. print (grid_search.best_estimator_)

http://scikit-learn.org/stable/modules/model_evaluation.html