Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I want to grid search over a set of hyper parameters to tune a clustering model. GridSearchCV offers a bunch of scoring functions for unsupervised learning but I want to use a function that's not in there, e.g. silhouette score .

The documentation on how to implement my custom function is unclear on how we should define our scoring function. The example there shows simply importing a custom scorer and using make_scorer to create a custom scoring function. However, make_scorer seems to require the true values (which doesn't exist in unsupervised learning), so it's not clear how to use it.

Here's what I have so far:

from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, make_scorer
Z, _ = make_blobs()
gs = GridSearchCV(estimator=DBSCAN(), 
                  param_grid={'n_clusters': range(2, 5)}, 
                  cv=5, 
                  scoring=make_scorer(my_custom_function)
gs.fit(Z)

I attempted to write my_custom_function in various ways but I get warnings or errors such as the following:

TypeError: __call__() missing 1 required positional argument: 'y_true'
ValueError: Found input variables with inconsistent numbers of samples: [20, 80]

How do I correctly define my custom scoring function?

Your model must have a .fit_predict() method to get the labels (using .labels_ won't work). Then your scorer function must return a single value where greater value is better1. All clustering algorithms on scikit-learn implement .fit_predict(), so no problem on that front.

For example, to implement silhouette score as a scoring metric for DBSCAN, define it like the following and pass it directly into GridSearchCV as a scoring argument. Note that silhouette score doesn't work correctly if there's a single label, so we need to include a check for that.

def my_silhouette_score(model, X, y=None):
    preds = model.fit_predict(X)
    return silhouette_score(X, preds) if len(set(preds)) > 1 else float('nan')
model = DBSCAN()
pgrid = {
    'eps': np.linspace(0.01, 0.5, 10),
    'min_samples': np.arange(2, 10)
gs = GridSearchCV(model, pgrid, scoring=my_silhouette_score).fit(Z)
best_estimator = gs.best_estimator_
highest_silhouette_score = gs.score(Z)

1: This is not a problem really; simply changing the sign of the outcome should either make the function minimizer or a maximizer.

The grid search is fitting the model on training folds and then using the scoring method on test folds, so you should not have the scoring method call fit_predict, just predict. – Ben Reiniger Nov 2, 2022 at 3:09 @BenReiniger but dbscan doesn't implement predict (kmeans etc. implement it though); I thought it implements fit_predict instead? If fit_predict is not the right method, then a custom function is just not feasible? Can you post an answer that works if this one is incorrect? – cottontail Nov 2, 2022 at 3:16

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.