可靠的异常/异常值检测算法在许多领域都有实际应用。例如,异常检测允许过滤和清理用于训练机器学习算法的数据,从而提高它们的性能。然而,当数据是高维时,异常值挖掘具有挑战性,并且针对不同类型的数据(时间、空间、网络等)提出了不同的方法。在这里,我们提出了一种方法来挖掘通用数据集中的异常值,在这种方法中可以定义数据集元素之间有意义的距离。该方法基于定义一个完全连接的无向图,其中节点是数据集的元素,链接的权重是节点之间的距离。异常值分数是通过分析图的结构来定义的,特别是,通过使用 Jensen–Shannon (JS) 散度来比较不同节点的权重分布。我们使用公开可用的信用卡交易数据库演示该方法,其中一些交易被标记为欺诈。我们与使用欧几里得距离和图形渗透时获得的性能进行比较,表明 JS 散度导致性能提高,但增加了计算成本。
Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.