Day14 - Feature Engineering -- 5. 異常值 (Outlier)(1) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

5. Outlier

由於資料收集測量方法的變異、人為的疏失或是實驗誤差，在資料集中一個數值與其他數值相比差異非常大，我們稱這個數值為異常值(Outlier)。異常值在統計分析上會引起各種問題，可能對期望值和標準差產生顯著的影響。

處理離群值得方法：
5.1 Outlier detection and removal(異常值偵測和移除)
5.2 Treating outliers as missing values
5.3 Top / bottom / zero coding
5.4 Discretisation

5.1 Outlier detection and removal(異常值偵測和移除)

異常值偵測和移除是指移除資料集內的異常值，本質上，異常值個數不會很多，所以這個程序應該不會顯著的破壞資料的完整性，但是假如異常值橫跨多個欄位，那我們可能會移除一大部分的資料。

以下列方法找出異常值(Outlier)：

IQR interquantile range(四分位數間距)

Percentile(百分位數)

z score

Scatter plots

Box plot

異常值偵測和移除 - 使用IQR interquantile range(四分位數間距)

一組數值由小到大排序後，再將這數列分成四等份，而處於三個分割點位置的數值就是四分位數，我們稱這三個分割點為第一、第二、第三分位數，以Q1、Q2和Q3表示。其中第三四分位數與第一四分位數之間的差，稱為四分位數間距。

以 Kaggle 的 Titanic 資料集中的"年齡"變數來說明：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', None)
data = pd.read_csv('../input/titanic/train.csv')
data.head()
Name: Age, dtype: float64
假設資料是常態分布，以3個標準差原則，找出上下邊界值，超出這個範圍的資料就是異常值。
Upper_boundary_limit = data.Age.mean() + 3* data.Age.std()
Lower_boundary_limit = data.Age.mean() - 3* data.Age.std()
Upper_boundary_limit, Lower_boundary_limit
(73.27860964406095, -13.88037434994331)

Age的上邊界是73-74，至於下邊界是負數，在這裡並沒有意義，因為年齡不可能為負數；會出現這種情形是因為資料不是常態分布。
使用IQR(Inter Quantile Range)計算上下邊界值

IQR = 

下邊界 = 

上邊界 = 
IQR = data.Age.quantile(0.75) - data.Age.quantile(0.25)
Lower_quantile_lower = data.Age.quantile(0.25) - (IQR * 1.5)
Upper_quantile_lower = data.Age.quantile(0.75) + (IQR * 1.5)
Upper_quantile_lower, Lower_quantile_lower, IQR
(64.8125, -6.6875, 17.875)

使用1.5倍IQR計算出的上下邊界和前例(使用3個標準差)差不多。
讓我們看一個較極端的例子。
IQR = data.Age.quantile(0.75) - data.Age.quantile(0.25)
Lower_quantile = data.Age.quantile(0.25) - (IQR * 3)
Upper_quantile = data.Age.quantile(0.75) + (IQR * 3)
Upper_quantile, Lower_quantile, IQR
(91.625, -33.5, 17.875)

使用3倍IQR計算出的上下邊界則高出人類平均壽命值。
現在我們可以根據上述邊界值，找出超出邊界的異常值。
移除missing data
data = data.dropna(subset=['Age'])
查看乘客數目
total_passengers = np.float(data.shape[0])
print('大於73歲乘客人數占全體百分比 (常態分布方法): {}'.format(data[data.Age > 73].shape[0] / total_passengers))
print('大於65歲乘客人數占全體百分比 (1.5倍IQR): {}'.format(data[data.Age > 65].shape[0] / total_passengers))
print('大於91歲乘客人數占全體百分比 (3倍IQR): {}'.format(data[data.Age >= 91].shape[0] / total_passengers))
大於73歲乘客人數占全體百分比 (常態分布方法): 0.0028011204481792717

大於65歲乘客人數占全體百分比 (1.5倍IQR): 0.011204481792717087

大於91歲乘客人數占全體百分比 (3倍IQR): 0.0
年紀很大的乘客大約占0-2個百分比。
使用1.5倍IQR計算出來的異常值，他們的詳細資料如下：
data[(data.Age<Lower_quantile_lower)|(data.Age>Upper_quantile_lower)]
從上面資料可得知，屬於異常值的乘客大部分都沒存活下來。
現在讓我們移除這些異常值
data_with_no_outlier = data[(data.Age>Lower_quantile_lower)&(data.Age<Upper_quantile_lower)]
data_with_no_outlier