Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I want to grid search over a set of hyper parameters to tune a clustering model.
GridSearchCV
offers a bunch of scoring functions for unsupervised learning but I want to use a function that's not in there, e.g.
silhouette score
.
The
documentation
on how to implement my custom function is unclear on how we should define our scoring function. The example there shows simply importing a custom scorer and using
make_scorer
to create a custom scoring function. However,
make_scorer
seems to require the true values (which doesn't exist in unsupervised learning), so it's not clear how to use it.
Here's what I have so far:
from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, make_scorer
Z, _ = make_blobs()
gs = GridSearchCV(estimator=DBSCAN(),
param_grid={'n_clusters': range(2, 5)},
cv=5,
scoring=make_scorer(my_custom_function)
gs.fit(Z)
I attempted to write my_custom_function
in various ways but I get warnings or errors such as the following:
TypeError: __call__() missing 1 required positional argument: 'y_true'
ValueError: Found input variables with inconsistent numbers of samples: [20, 80]
How do I correctly define my custom scoring function?
Your model must have a .fit_predict()
method to get the labels (using .labels_
won't work). Then your scorer function must return a single value where greater value is better1. All clustering algorithms on scikit-learn implement .fit_predict()
, so no problem on that front.
For example, to implement silhouette score as a scoring metric for DBSCAN, define it like the following and pass it directly into GridSearchCV
as a scoring argument. Note that silhouette score doesn't work correctly if there's a single label, so we need to include a check for that.
def my_silhouette_score(model, X, y=None):
preds = model.fit_predict(X)
return silhouette_score(X, preds) if len(set(preds)) > 1 else float('nan')
model = DBSCAN()
pgrid = {
'eps': np.linspace(0.01, 0.5, 10),
'min_samples': np.arange(2, 10)
gs = GridSearchCV(model, pgrid, scoring=my_silhouette_score).fit(Z)
best_estimator = gs.best_estimator_
highest_silhouette_score = gs.score(Z)
1: This is not a problem really; simply changing the sign of the outcome should either make the function minimizer or a maximizer.
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.