python - sklearn: KDE not working for small values

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am struggling to implement the scikit-learn implementation of KDE for small input ranges. The following code works. Increasing the divisor variable to 100 and KDE struggles:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.neighbors import KernelDensity
# make data:
np.random.seed(0)
divisor = 1
gaussian1 = (3 * np.random.randn(1700))/divisor
gaussian2 = (9 + 1.5 * np.random.randn(300)) / divisor
gaussian_mixture = np.hstack([gaussian1, gaussian2])
# illustrate proper KDE with seaborn:
sns.distplot(gaussian_mixture);
x_grid = np.linspace(min(gaussian1), max(gaussian2), 200)
kde_skl = KernelDensity(bandwidth=0.5)
kde_skl.fit(gaussian_mixture[:, np.newaxis])
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
pdf = np.exp(log_pdf)
fig, ax = plt.subplots(1, 1, sharey=True, figsize=(7, 4))
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5)
Kernel Density Estimation is called a nonparametric-method, but actually it has a parameter called bandwidth.
Every application of KDE needs this parameter set!
When you do the seaborn-plot:
sns.distplot(gaussian_mixture);
you are not giving any bandwidth and seaborn uses default heuristics (scott or silverman). These are using the data to choose some bandwidth in a dependent way.
The sklearn-code of you looks like:
kde_skl = KernelDensity(bandwidth=0.5)
There is a fixed/constant bandwidth! This might give you trouble and might be the reason here. But it's at least something to look at. In general one would combine sklearn's KDE with GridSearchCV as cross-validation tool to select a good bandwidth. In many cases this is slower, but better than those heuristics above.
Sadly you did not explain why you want to use sklearn's KDE. My personal rating of the 3 popular candidates is statsmodels > sklearn > scipy.
                Hi @sascha. I've just implemented by own KDE using a method similar to seaborne. It uses the Silverman reference rule for bandwidth and seems to estimate the bandwidth suitably:   x = gaussian_mixture  bandwidth = 1.06 * x.std() * x.size ** (-1 / 5.)  However, I want to be able to use a cross-validation method in sklearn, so will use this as an initial value and then do a small grid search around that value to find a local minimum in the cost function. Thanks.
– EB88
                Jul 11, 2017 at 21:21
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.