PyOD简介
异常检测(anomaly detection),也叫异常分析(outlier analysis或者outlier detection)或者离群值检测,在工业上有非常广泛的应用场景:
同时它可以被用于机器学习任务中的预处理(preprocessing),防止因为少量异常点存在而导致的训练或预测失败。换句话来说,异常检测就是从茫茫数据中找到那些“长得不一样”的数据。但检测异常过程一般都比较复杂,而且实际情况下数据一般都没有标签(label),我们并不知道哪些数据是异常点,所以一般很难直接用简单的监督学习。异常值检测还有很多困难,如极端的类别不平衡、多样的异常表达形式、复杂的异常原因分析等。
异常值不一定是坏事。 例如,如果在生物学中实验,一只老鼠没有死,而其他一切都死,那么理解为什么会非常有趣。这可能会带来新的科学发现。 因此,检测异常值非常重要。
Python Outlier Detection(PyOD)是一个Python异常检测工具库,除了支持Sklearn上支持的四种模型外,还额外提供了很多模型如:
其主要亮点包括:
PyOD内置算法
PyOD工具包由三个主要功能组组成:
i) Individual Detection Algorithms:
Algorithm Linear Model 主成分分析(加权投影到特征向量超平面的距离之和) Linear Model 最小协方差行列式(使用马氏距离作为异常值分数) [9] [22] Linear Model OCSVM One-Class支持向量机 Linear Model 基于偏差的离群点检测Deviation-based Outlier Detection (LMDD) Proximity-Based 局部离群因子Local Outlier Factor Proximity-Based 基于连通性的离群因子Connectivity-Based Outlier Factor Proximity-Based CBLOF 基于聚类的局部离群因子Clustering-Based Local Outlier Factor Proximity-Based LOCI: Fast outlier detection using the local correlation integral Proximity-Based 基于直方图的异常值得分Histogram-based Outlier Score Proximity-Based k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) Proximity-Based AvgKNN Average kNN (use the average distance to k nearest neighbors as the outlier score) Proximity-Based MedKNN Median kNN (use the median distance to k nearest neighbors as the outlier score) Proximity-Based 子空间离群点检测Subspace Outlier Detection Probabilistic 基于角度的离群点检测Angle-Based Outlier Detection Probabilistic FastABOD Fast Angle-Based Outlier Detection using approximation Probabilistic 随机离群点选择Stochastic Outlier Selection Outlier Ensembles IForest Isolation Forest Outlier Ensembles Feature Bagging Outlier Ensembles 并行孤立点群的局部选择性组合LSCP: Locally Selective Combination of Parallel Outlier Ensembles Outlier Ensembles XGBOD Extreme Boosting Based Outlier Detection (Supervised) Outlier Ensembles Lightweight On-line Detector of Anomalies Neural Networks AutoEncoder Fully connected AutoEncoder (use reconstruction error as the outlier score) [1] [Ch.3] Neural Networks Variational AutoEncoder (use reconstruction error as the outlier score) Neural Networks SO_GAAL Single-Objective Generative Adversarial Active Learning Neural Networks MO_GAAL Multiple-Objective Generative Adversarial Active Learningii) Outlier Ensembles & Outlier Detector Combination Frameworks:
Algorithm Outlier Ensembles Feature Bagging Outlier Ensembles LSCP: Locally Selective Combination of Parallel Outlier Ensembles Outlier Ensembles XGBOD Extreme Boosting Based Outlier Detection (Supervised) Outlier Ensembles Lightweight On-line Detector of Anomalies Combination Average Simple combination by averaging the scores Combination Weighted Average Simple combination by averaging the scores with detector weights Combination Maximization Simple combination by taking the maximum scores Combination Average of Maximum Combination Maximization of Average Combination Median Simple combination by taking the median of the scores Combination majority Vote Simple combination by taking the majority vote of the labels (weights can be used)iii) Utility Functions:
Function Documentation generate_data Synthesized data generation; normal data is generated by a multivariate Gaussian and outliers are generated by a uniform distribution generate_data generate_data_clusters Synthesized data generation in clusters; more complex data patterns can be created with multiple clusters generate_data_clusters wpearsonr Calculate the weighted Pearson correlation of two samples wpearsonr Utility get_label_n Turn raw outlier scores into binary labels by assign 1 to top n outlier scores get_label_n Utility precision_n_scores calculate precision @ rank n precision_n_scoresAngle-Based Outlier Detection (ABOD)
k-Nearest Neighbors Detector
Isolation Forest
Histogram-based Outlier Detection
Local Correlation Integral (LOCI)
Feature Bagging
Clustering Based Local Outlier Factor
PyOD的使用
API介绍
特别需要注意的是,异常检测算法基本都是无监督学习,所以只需要X(输入数据),而不需要y(标签)。PyOD的使用方法和Sklearn中聚类分析很像,它的检测器(detector)均有统一的API。所有的PyOD检测器clf均有统一的API以便使用。
当检测器clf被初始化且fit(X)函数被执行后,clf就会生成两个重要的属性:
不难看出,当我们初始化一个检测器clf后,可以直接用数据X来“训练”clf,之后我们便可以得到X的异常分值(clf.decision_scores)以及异常标签(clf.labels_)。当clf被训练后(当fit函数被执行后),我们可以使用decision_function()和predict()函数来对未知数据进行训练。
示例代码:
import numpy as np from scipy import stats import matplotlib.pyplot as plt import matplotlib.font_manager from pyod.models.abod import ABOD from pyod.models.knn import KNN from pyod.utils.data import generate_data, get_outliers_inliers # generate random data with two features X_train, Y_train = generate_data(n_train=200, train_only=True, n_features=2) # by default the outlier fraction is 0.1 in generate data function outlier_fraction = 0.1 # store outliers and inliers in different numpy arrays x_outliers, x_inliers = get_outliers_inliers(X_train, Y_train) n_inliers = len(x_inliers) n_outliers = len(x_outliers) # separate the two features and use it to plot the data F1 = X_train[:, [0]].reshape(-1, 1) F2 = X_train[:, [1]].reshape(-1, 1) # create a meshgrid xx, yy = np.meshgrid(np.linspace(-10, 10, 200), np.linspace(-10, 10, 200)) # scatter plot plt.scatter(F1, F2) plt.xlabel('F1') plt.ylabel('F2') # 创建一个字典并添加要用于检测异常值的所有模型: classifiers = { 'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outlier_fraction), 'K Nearest Neighbors (KNN)': KNN(contamination=outlier_fraction) # 将数据拟合到我们在字典中添加的每个模型,然后,查看每个模型如何检测异常值: # set the figure size plt.figure(figsize=(12, 6)) for i, (clf_name, clf) in enumerate(classifiers.items()): # fit the dataset to the model clf.fit(X_train) # predict raw anomaly score scores_pred = clf.decision_function(X_train) * -1 # prediction of a datapoint category outlier or inlier y_pred = clf.predict(X_train) # no of errors in prediction n_errors = (y_pred != Y_train).sum() print('No of Errors : ', clf_name, n_errors) # rest of the code is to create the visualization # threshold value to consider a datapoint inlier or outlier threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction) # decision function calculates the raw anomaly score for every point Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1 Z = Z.reshape(xx.shape) subplot = plt.subplot(1, 2, i + 1) # fill blue colormap from minimum anomaly score to threshold value subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 10), cmap=plt.cm.Blues_r) # draw red contour line where anomaly score is equal to threshold a = subplot.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange') # scatter plot of inliers with white dots b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white', s=20, edgecolor='k') # scatter plot of outliers with black dots c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black', s=20, edgecolor='k') subplot.axis('tight') subplot.legend( [a.collections[0], b, c], ['learned decision function', 'true inliers', 'true outliers'], prop=matplotlib.font_manager.FontProperties(size=10), loc='lower right') subplot.set_title(clf_name) subplot.set_xlim((-10, 10)) subplot.set_ylim((-10, 10)) plt.show()PyOD实战:基于大型商场销售数据的异常发现
数据地址:https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
数据说明:
Variable Description Item_Identifier Unique product ID Item_Weight Weight of product Item_Fat_Content Whether the product is low fat or not Item_Visibility The % of total display area of all products in a store allocated to the particular product Item_Type The category to which the product belongs Item_MRP Maximum Retail Price (list price) of the product Outlet_Identifier Unique store ID Outlet_Establishment_Year The year in which store was established Outlet_Size The size of the store in terms of ground area covered Outlet_Location_Type The type of city in which the store is located Outlet_Type Whether the outlet is just a grocery store or some sort of supermarket Item_Outlet_Sales Sales of the product in the particulat store. This is the outcome variable to be predicted.1、加载需要用到的Python包和模块
import pandas as pd import numpy as np from scipy import stats import matplotlib.pyplot as plt import matplotlib.font_manager import warnings warnings.filterwarnings('ignore') # Import models from pyod.models.abod import ABOD from pyod.models.cblof import CBLOF from pyod.models.feature_bagging import FeatureBagging from pyod.models.hbos import HBOS from pyod.models.iforest import IForest from pyod.models.knn import KNN from pyod.models.lof import LOF2、读取数据并绘制Item MRP vs Item Outlet Sales散点图以了解数据:
df = pd.read_csv("train_kOBLwZA.csv") df.plot.scatter('Item_MRP','Item_Outlet_Sales')3、对数据进行规格化处理:
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) df[['Item_MRP','Item_Outlet_Sales']] = scaler.fit_transform(df[['Item_MRP','Item_Outlet_Sales']]) df[['Item_MRP','Item_Outlet_Sales']].head()4、将这些值存储在NumPy数组中,以便以后在我们的模型中使用:
X1 = df['Item_MRP'].values.reshape(-1,1) X2 = df['Item_Outlet_Sales'].values.reshape(-1,1) X = np.concatenate((X1,X2),axis=1)5、创建模型词典,设置异常分数值0.05(5%):
random_state = np.random.RandomState(1024) outliers_fraction = 0.05 # Define seven outlier detection tools to be compared classifiers = { 'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction), 'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state), 'Feature Bagging':FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state), 'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction), 'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state), 'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction), 'Average KNN': KNN(method='mean',contamination=outliers_fraction)6、逐个拟合每个模型,看看每个模型预测异常值的方式有什么不一样:
xx , yy = np.meshgrid(np.linspace(0,1 , 200), np.linspace(0, 1, 200)) for i, (clf_name, clf) in enumerate(classifiers.items()): clf.fit(X) # predict raw anomaly score scores_pred = clf.decision_function(X) * -1 # prediction of a datapoint category outlier or inlier y_pred = clf.predict(X) n_inliers = len(y_pred) - np.count_nonzero(y_pred) n_outliers = np.count_nonzero(y_pred == 1) plt.figure(figsize=(10, 10)) # copy of dataframe dfx = df dfx['outlier'] = y_pred.tolist() # IX1 - inlier feature 1, IX2 - inlier feature 2 IX1 = np.array(dfx['Item_MRP'][dfx['outlier'] == 0]).reshape(-1,1) IX2 = np.array(dfx['Item_Outlet_Sales'][dfx['outlier'] == 0]).reshape(-1,1) # OX1 - outlier feature 1, OX2 - outlier feature 2 OX1 = dfx['Item_MRP'][dfx['outlier'] == 1].values.reshape(-1,1) OX2 = dfx['Item_Outlet_Sales'][dfx['outlier'] == 1].values.reshape(-1,1) print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name) # threshold value to consider a datapoint inlier or outlier threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction) # decision function calculates the raw anomaly score for every point Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1 Z = Z.reshape(xx.shape) # fill blue map colormap from minimum anomaly score to threshold value plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r) # draw red contour line where anomaly score is equal to thresold a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red') # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange') b = plt.scatter(IX1,IX2, c='white',s=20, edgecolor='k') c = plt.scatter(OX1,OX2, c='black',s=20, edgecolor='k') plt.axis('tight') # loc=2 is used for the top left corner plt.legend( [a.collections[0], b,c], ['learned decision function', 'inliers','outliers'], prop=matplotlib.font_manager.FontProperties(size=20), loc=2) plt.xlim((0, 1)) plt.ylim((0, 1)) plt.title(clf_name) plt.show()执行结果:
OUTLIERS 447 INLIERS : 8076 Angle-based Outlier Detector (ABOD)OUTLIERS : 427 INLIERS : 8096 Cluster-based Local Outlier Factor (CBLOF)OUTLIERS : 392 INLIERS : 8131 Feature BaggingOUTLIERS : 501 INLIERS : 8022 Histogram-base Outlier Detection (HBOS)OUTLIERS : 427 INLIERS : 8096 Isolation ForestOUTLIERS : 311 INLIERS : 8212 K Nearest Neighbors (KNN)OUTLIERS : 176 INLIERS : 8347 Average KNN参考链接: