Day 29 [Python ML、資料清理] 處理輸入資料不一致 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

相关文章推荐
英姿勃勃的饺子 · WindowsOptionalFeature ...· 2 年前 ·
怕老婆的扁豆 · swift 跨平台 gui-掘金· 2 年前 ·
活泼的莴苣 · GStreamer系列 - 基本介绍 - ...· 2 年前 ·
英勇无比的双杠 · selenium ...· 3 年前 ·
professors = pd.read_csv("./pakistan_intellectual_capital.csv") # set seed for reproducibility np.random.seed(0)
/opt/conda/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Do some preliminary text pre-processing
我們快速的看過資料
professors.head()
我們現在先看Country這個column中，有沒有哪邊有資料不一致的問題
# get all the unique values in the 'Country' column
countries = professors['Country'].unique()
# sort them alphabetically and then take a closer look
countries.sort()
countries
array([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia',
       'Austria', 'Canada', 'China', 'Finland', 'France', 'Greece',
       'HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia',
       'Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan',
       'Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland',
       'Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden',
       'Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],
      dtype=object)
從資料中我們可以看出有一些不一致的問題，例如: Germany、germany,  New Zealand、New Zealand
第一步是要先將所有資料轉為小寫，並且將空格移除掉
在文本數據中，大寫和空格的問題非常常見，這樣處理過後，可以修復80%以上的不一致問題
# convert to lower case
professors['Country'] = professors['Country'].str.lower()
# remove trailing white spaces
professors['Country'] = professors['Country'].str.strip()
然後我們來處理更困難的不一致問題
Use fuzzy matching to correct inconsistent data entry
然後我們現在來看一下前面處理過後的資料
# get all the unique values in the 'Country' column
countries = professors['Country'].unique()
# sort them alphabetically and then take a closer look
countries.sort()
countries
array(['australia', 'austria', 'canada', 'china', 'finland', 'france',
       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',
       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',
       'norway', 'pakistan', 'portugal', 'russian federation',
       'saudi arabia', 'scotland', 'singapore', 'south korea',
       'southkorea', 'spain', 'sweden', 'thailand', 'turkey', 'uk',
       'urbana', 'usa', 'usofa'], dtype=object)
現在又有出現其他的不一致，例如說south korea、southkorea
我們將會使用fuzzywuzzy套件來幫助我們確定那些字符彼此是接近的，只要數據集夠小，就可以很好的糾正錯誤。
Fuzzy mathcing:自動尋找非常相似的字串的過程，稱為Fuzzy matching。這個方法沒辦法100%的解決問題，但是至少可以節省一些時間
Fuzzywuzzy會返回兩個字串之間的比例，比例越接近100，就代表說兩個字串越相向
# get the top 10 closest matches to "south korea"
matches = fuzzywuzzy.process.extract("south korea", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# take a look at them
matches
[('south korea', 100),
 ('southkorea', 48),
 ('saudi arabia', 43),
 ('norway', 35),
 ('ireland', 33),
 ('portugal', 32),
 ('singapore', 30),
 ('netherland', 29),
 ('macau', 25),
 ('usofa', 25)]
從上面的結果可以看出，south korea和southkorea是非常接近的，我們將相似程度>47的資料替換成south korea
我們將這個步驟寫成一個function，比較方便後續的處理
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
    # get a list of unique strings
    strings = df[column].unique()
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)
    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    # let us know the function's done
    print("All done!")
# use the function we just wrote to replace close matches to "south korea" with "south korea"
replace_matches_in_column(df=professors, column='Country', string_to_match="south korea")
All done!
# get all the unique values in the 'Country' column
countries = professors['Country'].unique()
# sort them alphabetically and then take a closer look
countries.sort()
countries
array(['australia', 'austria', 'canada', 'china', 'finland', 'france',
       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',
       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',
       'norway', 'pakistan', 'portugal', 'russian federation',
       'saudi arabia', 'scotland', 'singapore', 'south korea', 'spain',
       'sweden', 'thailand', 'turkey', 'uk', 'urbana', 'usa', 'usofa'],
      dtype=object)
將資料替換過後再過來檢查，確認資料已經被替換了