Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

Ask Question

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors

ValueError: the input matrix must be positive semidefinite

raise np.linalg.LinAlgError('singular matrix')

Can someone suggest any solution for this?

#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
    pdfmain=[[] for i in range(0,len(data))]
    for i in range(0,len(data)):
        sum1=0
        pdfeach=[[] for m in range(0,len(centroids))]
        pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
        pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
        pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
        sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
        pdfeach[:] = [x / sum1 for x in pdfeach]
        pdfmain[i]=pdfeach
    global old_pdfmain
    if old_pdfmain==pdfmain:
        return
    old_pdfmain=copy.deepcopy(pdfmain)
    softcounts=[sum(i) for i in zip(*pdfmain)]
    calculate_cluster_weights(data,centroids,pdfmain,soft counts)

Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.

The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).

How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:

  • Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
  • Use nicer estimator like LedoitWolf estimator from scikit-learn.
  • Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.

    This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.