Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am working on my first major data science project. I am attempting to match names between a large list of data from one source, to a cleansed dictionary in another. I am using this string matching blog as a guide.

I am attempting to use two different data sets. Unfortunately, I can't seem to get good results and I think I am not applying this appropriately.

Code:

import pandas as pd, numpy as np, re, sparse_dot_topn.sparse_dot_topn as ct
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
df_dirty = {"name":["gogle","bing","amazn","facebook","fcbook","abbasasdfzz","zsdfzl","gogle","bing","amazn","facebook","fcbook","abbasasdfzz","zsdfzl"]}
df_clean = {"name":["google","bing","amazon","facebook"]}
print (df_dirty["name"])
print (df_clean["name"])
def ngrams(string, n=3):
    string = (re.sub(r'[,-./]|\sBD',r'', string)).upper()
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]
def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
    idx_dtype = np.int32
    nnz_max = M * ntop
    indptr = np.zeros(M + 1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)
    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)
    return csr_matrix((data, indices, indptr), shape=(M, N))
def get_matches_df(sparse_matrix, name_vector, top=5):
    non_zeros = sparse_matrix.nonzero()
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    if top:
        print (top)
        nr_matches = top
    else:
        print (sparsecols.size)
        nr_matches = sparsecols.size
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]
    return pd.DataFrame({'left_side': left_side,
                         'right_side': right_side,
                         'similairity': similairity})
company_names = df_clean['name']
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(company_names)
import time
t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 4, 0.8)
t = time.time()-t1
print("SELFTIMED:", t)
matches_df = get_matches_df(matches, company_names, top=4)
matches_df = matches_df[matches_df['similairity'] < 0.99999] # Remove all exact matches
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(matches_df)

The expected result is as follows:

  • gogle = google
  • amazn = amazon
  • fcbook = facebook
  • You can import awesome_cossim_top function directly from the sparse_dot_topn lib.

    Change the function get_matches_df with this:

    def get_matches_df(sparse_matrix, A, B, top=100):
        non_zeros = sparse_matrix.nonzero()
        sparserows = non_zeros[0]
        sparsecols = non_zeros[1]
        if top:
            nr_matches = top
        else:
            nr_matches = sparsecols.size
        left_side = np.empty([nr_matches], dtype=object)
        right_side = np.empty([nr_matches], dtype=object)
        similairity = np.zeros(nr_matches)
        for index in range(0, nr_matches):
            left_side[index] = A[sparserows[index]]
            right_side[index] = B[sparsecols[index]]
            similairity[index] = sparse_matrix.data[index]
        return pd.DataFrame({'left_side': left_side,
                             'right_side': right_side,
                             'similairity': similairity})
    

    Now you can execute your code as below:

    df_dirty = {"name":["gogle","bing","amazn","facebook","fcbook","abbasasdfzz","zsdfzl"]}
    df_clean = {"name":["google","bing","amazon","facebook"]}
    print (df_dirty["name"])
    print (df_clean["name"])
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
    tf_idf_matrix_clean = vectorizer.fit_transform(df_clean['name'])
    tf_idf_matrix_dirty = vectorizer.transform(df_dirty['name'])
    t1 = time.time()
    matches = awesome_cossim_top(tf_idf_matrix_dirty, tf_idf_matrix_clean.transpose(), 1, 0)
    t = time.time()-t1
    print("SELFTIMED:", t)
    matches_df = get_matches_df(matches, df_dirty['name'], df_clean['name'], top=0)
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        print(matches_df)
    

    Basically the example you found identifies duplicates in its own array and you want to use 2 sources instead of one.

    Hope it helps!

    @pieter, when the data frame returns, there are duplicate matches... although in a different order. The matches are not grouping together and ofter time you will see the result of: Row1="Facebook" "Fcbook" and Row2="Fcbook" "Facebook" – Matthew Metros May 10, 2020 at 15:37 getting error while calling awesome_cossim_topn method : ValueError: A matrix multiplication will be operated. A.shape[1] must be equal to B.shape[0]! – Raushan Sep 15, 2022 at 9:10

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.