![]() |
逼格高的西瓜 · Named Pipe Client - ...· 2 月前 · |
![]() |
聪明的炒粉 · java一个接口多个实现类怎么调用-掘金· 11 月前 · |
![]() |
想旅行的松球 · 【python】绘图坐标轴标题中包含上标或下 ...· 1 年前 · |
![]() |
星星上的苦咖啡 · String.valueOf() ...· 1 年前 · |
![]() |
重情义的青椒 · @Scheduled定时任务不生效???_@ ...· 1 年前 · |
![]() |
咆哮的牛肉面
1 年前 |
Number of downloads in 2021 is based on the cran_downloads() function in the cranlogs R library. NA, not applicable.
Table 1 shows that neither the cluster R package nor the sklearn-cluster module in Python allows the evaluation of the stability of clusters. As indicated in Section 2, one should code this step oneself in Python, or link (if possible) with other packages in R. Of the selected R packages, FPC was the most downloaded in 2021 and provides the most internal assessment metrics. Clue and FPC evaluate cluster stability by bootstrapping but only FPC includes other methods such as noising, the complementarity of the two methods being recommended by Hennig ( 2008 ). clValid on the other hand proposes simpler methods, mainly used in biology, for evaluating the stability of clusters by removing the variables one by one.
The table in Appendix A is created based on Table 1 in Fahad et al. ( 2014 ), which we adapted for our purpose. Overall, Appendix A shows that none of the algorithms included in the selected packages satisfies all the properties sought in terms of genericity, simplicity of use and implementation, and robustness. For example, CLARA and Mini-batch K -means both allow very good handling of large data and are adapted to some extent to high dimensionality and rely on few parameters to be optimized. However, they only apply to continuous data and are not, particularly, suitable for noisy data. Also and unlike CLARA, Mini-batch K -means is only included in the scikit-learn module on Python.
This first synthesis work highlights the challenge to overcome.
Based on the literature review (Section 2) and the preliminary work (Section 3.3), we propose the Qluster workflow (refer to Figure 3 ), a set of methods that together represent a good balance for data scientists to make clustering on health data in a practical, efficient, robust and simple way. It covers the cluster generation step (step 3) through 1- factor analysis, 2- data clustering, and 3- stability evaluation. The output of the factor analysis (PCA, MCA, or FAMD) is the matrix of the coordinates of the individuals on the factorial dimensions, i.e., a table of continuous variables, allowing then the clustering by a PAM algorithm. For an in-depth discussion regarding the Qluster workflow, refer to Section 5.
To summarize, Qluster tries to generalize clustering tasks through a generic framework that is:
In addition, the Qluster workflow relies solely on four state-of-the-art R packages, allowing data scientists to quickly manage data of different natures and volumes and perform robust clustering:
Finally, the Qluster workflow is operationalizable and implementable from end to end (see in Appendix B a picture of implementation in the Dataiku 20 platform. Available upon request: moc.ecnarf-netniuq@tcatnoc ).
This generic workflow, usable in most situations, can be described through the following pseudocode ( Algorithm 1 ):
The Qluster pseudo-code.
input: X : The input data |
packages : FactoMineR, factoextra, FPC, missMDA |
output: Q : A clustering of X and associated measures |
1 if X is continuous only then |
2 F = PCA(X) , with F a FactoMineR object of class PCA |
3 else if X is categorical only then |
4 F = MCA(X) , with F a FactoMineR object of class MCA |
5 else // mixed continuous and categorical |
6 F = FAMD(X) , with F a FactoMineR object of class FAMD |
7 end |
8 Define from F the matrix M of coordinates of individuals on each dimension |
9 if X is “large” then |
10 Apply fviz_eig() on F to select C opt , a sufficient number of components |
11 Define the M opt matrix as M restricted to C opt components |
12 P = pamk(M opt ) , with usepam = FALSE (CLARA), criterion = “asw”, scaling = FALSE, and setting at convenience samples, sampsize , and krange |
13 Let K the optimal number of clusters in P |
14 else // X is not “large” |
15 if X is continuous only then |
16 F ncp = estim_ncpPCA(X) , with large [ncp.min, ncp.max] range |
17 else if X is categorical only then |
18 F ncp = estim_ncpMCA(X) , with large [ncp.min, ncp.max] range |
19 else // mixed continuous and categorical |
20 F ncp = estim_ncpFAMD(X) , with large [ncp.min, ncp.max] range |
21 end |
22 Retrieve the optimal number of dimensions C opt from F ncp |
23 Define the M opt matrix as M restricted to C opt components |
24 P = pamk(M opt ) , with usepam = TRUE (PAM), criterion = “asw”, scaling = FALSE, and setting at convenience krange |
25 Let K the optimal number of clusters in P |
26 end |
27 S = clusterboot(M opt ) , with bootmethod = c(“boot”, “noise”), krange = K, clustermethod = “pamkCBI” and the same parameters as in the previous step. Loop on the noise_level parameter to test different noise levels |
28 Return all useful results in Q |
The Cardiovascular Disease 21 dataset includes 70,000 patients with or without cardiovascular disease and 12 variables (five of them are continuous).
The following raw variables were used (raw variables' names are in italic):
The objective of this section is to present in detail the application of the Qluster workflow proposed in Section 3 on the following use case: to characterize the phenotypes of patients with cardiovascular disease (a subset of patients with cardio = Y). This represents 34,979 patients (about 50% of the whole dataset).
The following section details the application of the Qluster workflow to the cardiovascular dataset to help scientists use it for their own projects. Additional elements to the ones presented in Section 2 supporting the present methodology are also provided when relevant. We first present the preprocessing of the dataset, in which notably the few continuous variables are converted into qualitative data, before applying an MCA, which is a data-reduction technique for exploring the associations among multiple categorical variables (Greenacre, 1984 ; Warwick et al., 1989 ; Murtagh, 2005 ; Greenacre and Blasius, 2006 ; Nishisato, 2019 ). Then, given the large size of the database, the CLARA algorithm is applied and optimized. Finally, the clusters' stability is assessed and a brief interpretation of the clusters is provided.
First, the Body Mass Index (BMI) variable was created from both height and weight (Ortega et al., 2016 ). Then, outliers were detected by defining for each quantitative variable the thresholds above or below which values are more likely to be inaccurate. Acceptable values should be in the following ranges: 18 ≤ Age <120, 10 ≤ BMI <100, SBP ≤ 400, and DBP ≤ 200 [Ortega et al., 2016 ; Mayo Clinic 22 , French HTA (HAS) recommendations 23 ]. For simplicity, patients with at least one outlier were removed from the analysis (sensitivity analyses could be performed). Quantitative variables were then discretized in order to both create variables with a clinical sense and enable the use of the MCA algorithm (refer to Table 2 ).
Description of quantitative feature engineering.
Description Modalities Accepted value range | |||
---|---|---|---|
Age | Age | age ≤ 55; age > 55 | [18, 120] |
BMI | BMI |
Underweight: <18.5 kg/m
2
;
Normal: between 18.5 and 24.9 kg/m 2 ; Overweight: between 25.0 and 29.9 kg/m 2 ; Obese: ≥ 30.0 kg/m 2 |
[10, 100] |
high_sbp | High systolic blood pressure |
1: SBP > 130 mmHg;
0: SBP ≤ 130 mmHg |
≤ 400 |
high_dbp | High diastolic blood pressure |
1: DBP > 80 mmHg;
0: DBP ≤ 80 mmHg |
≤ 200 |
An additional binary hypertension variable was created based on both high_sbp and high_dbp variables that are used as a proxy for patients with hypertension [ hypertension = 1 if high_sbp = 1 and high_dbp = 1; else hypertension = 0 (Williams et al., 2018 )].
Finally, the variables selected to discriminate the patients must be chosen according to their medical relevance to the context of the study. To this end, the user must always consider the results he would obtain if a variable is included or not. In particular, the user has to ask himself whether active discrimination of the clusters by a variable is sought: considering the example of the two common variables age and race, if they are actively included in the clustering step, it will tend to create groups of young vs. old, Caucasian vs. non-Caucasian patients. If not, such variables can be kept for passive analysis of the generated clusters and assess a posteriori possible heterogeneity on these variables. In this use case, we removed the height, weight, and systolic and diastolic blood pressure features, as they are used to create the derived features listed above and are not useful alone for clustering.
At the end, we obtained a database of 34,134 patients and 11 variables.
Clustering variables with low prevalence are known to be challenging in data analysis, especially for techniques that are very sensitive to data and/or anomalous cases [e.g., regression analysis and factor analysis (Fahrmeir et al., 2013 )]. Most common techniques consist of either gathering rare modalities in groups of higher frequency or discarding the concerning modalities and/or variables. Additionally, binary clustering variables with low prevalence in the study population may be discarded from the analysis or grouped with other features when appropriate.
An arbitrary threshold of 10% was set to distinguish and eliminate features with rare modalities from the clustering features. This is consistent with recommendations before using Multiple Correspondence Analysis, which over-weighs rare modalities and multi-modality variables (Le Roux and Rouanet, 2010 ; Di Franco, 2016 ). When possible, modalities with <10% of prevalence were grouped with others based on medical relevance. As a result, both the Smoking (8.3% with smoke = Yes) and Alcohol intake (5.2% with alco = Yes) variables were ruled out to cluster data and were only used for a posteriori clusters description.
Moreover, some modalities were aggregated for the two following variables:
At the end, the dataset used to perform MCA contains a total number of nine categorical variables (age, BMI, high_sbp, high_dbp , hypertension, gluc, gender, cholesterol, and physical activity).
As with other methods for factor analysis (e.g., PCA and CA), MCA was combined with cluster analysis to capture data heterogeneity, through clusters of observations in the population that show distinctive patterns (Buuren and Heiser, 1989 ; Hwang et al., 2006 ; Mitsuhiro and Hiroshi, 2015 ; van de Velden et al., 2017 ; Testa et al., 2021 ).
The number of MCA components to be used was decided using the standard scree plot by identifying the “elbow” of the curve [method widely used with PCA (Cattell, 1996 )] while constraining eigenvalues to be strictly above a threshold of 0.11 equivalent of Kaiser's rule in PCA (i.e., 1/ C with C the number of categorical variables).
Based on the scree plot (refer to Figure 4 ), three dimensions were chosen, the third marking a clear elbow in the curve (related eigenvalue: 0.12; related percentage of variance explained: 9.8%).
Moreover, and for interpretation purposes, eigenvalues were corrected using the Benzecri correction 24 to consider that the binary coding scheme used in MCA creates artificial factors and, therefore, reduces the inertia explained (Greenacre, 1984 ). The top three components gather 99.9% of inertia after correcting with the Benzecri method (more details in Appendix D ).
The CLARA algorithm was used through the pamk() function in the FPC R package (version 2.2-5), a reliable package for flexible procedures for clustering, and with the following main parameters:
Other parameters include the non-scaling of the input data to not modify the observation space obtained from MCA.
The optimal ASW was obtained for a pool of three clusters (ASW: 0.42, refer to Figure 5 ).
Homogeneity and separability of clusters were further studied by analyzing the Silhouettes Width of patients in the best sample used to generate the clusters' medoids, using the fviz silhouette() function in the factoextra R package. As a reminder, the Silhouette Width characterizes both the cohesion of the cluster and its separation from the other clusters: a positive (respectively, negative) Silhouette Width for a patient is in favor of a correct (respectively, incorrect) affiliation to its own cluster.
Figure 6 shows a high level of intra-cluster cohesion and inter-clusters separability as only few patients (in clusters 2 and 3) have negative Silhouettes. Clusterwise Silhouette Widths are also all positive (ASW of 0.42, 0.47, and 0.30, for clusters 1, 2, and 3, respectively).
In order to evaluate clustering robustness, the clustering was performed several times on a cohort that was randomly modified. This allows generating under perturbations new versions of the original clusters and, thus, to evaluate the stability of the clusters to them. The stability of the clusters is all the higher as the new versions of the clusters generated under perturbations are similar to the original clusters. The data perturbation step was performed using two approaches that may provide complementary information based on the results in Hennig ( 2007 ): bootstrap and noise methods.
Clusters are all the more stable as the Jaccard similarity statistics and the number of recovered clusters are high, and the number of dissolved clusters is low.
The results of the data perturbation step are shown below:
Regardless of the method used, the results seem to be very robust and can certainly be explained by the large size of the database and the small number of clusters retained in the context of synthetic data. Clusters stability may be more variable in real cases.
It is worth noting that the clusterboot() function can also provide useful results and plots of clusters' stability (histogram of Jaccard similarity statistics by cluster, summary information for each cluster, etc.), but we did not provide them in this article since the obtained Jaccard similarity metrics were all around 100%.
Descriptive statistics (proportions and lift values) were computed from variables included or not in the clustering step. Cluster 1 [ n = 12,272 (36.0%)] groups patients who all have high values of diastolic and systolic blood pressure, and consequently hypertension. These patients are slightly more than the average with well-above normal cholesterol values (18.5% vs. 17.7%) and above normal glucose values (20.0% vs. 18.3%). On the contrary, patients from clusters 2 and 3 [ n = 15,477 (45.3%) and n = 6,385 (18.7%), respectively] are between 81% and 87% to have normal values of diastolic and systolic blood pressures, and none have hypertension. In contrast with cluster 2, patients from cluster 3 are many more with well-above normal cholesterol values (26.6% vs. 8.5%) and above normal glucose values (57.3% vs. 0.8%).
Patients from clusters 1 and 2 are overall younger than cluster 3 (age ≤ 55: 54.7% and 51.8% vs. 65.9%). Patients from clusters 1 and 3 are overall more obese than cluster 2 (41.4% and 44.5% vs. 21.8%).
To summarize, among patients with cardiovascular disease, cluster 1 gathers patients with hypertension, cluster 2 gathers patients healthier (although about the same age as cluster 1), and cluster 3 gathers slightly older patients with cholesterol and high levels of glucose (although no hypertension). Interestingly, the description of cluster 1 is consistent with a poorer lifestyle (lift values of 1.21 and 1.28 for Smoke and Alcohol, respectively) although this did not actively participate in clustering. Refer to Table 3 for more details.
Prevalence and lift values of each modality and by cluster.
Modalities Prevalence (% of patients) Lift values Cohort n = 12,272 n = 15,477 n = 6,385 n = 34,134 (36.0%) (45.3%) (18.7%) (100%) | |||||||
---|---|---|---|---|---|---|---|
Female | 62.1 | 64.2 | 71.4 | 64.8 | 0.96 | 0.99 | 1.10 |
Male | 37.9 | 35.8 | 28.6 | 35.2 | 1.08 | 1.02 | 0.81 |
Cholesterol normal | 60.6 | 90.7 | 16.5 | 66.0 | 0.92 | 1.37 | 0.25 |
Cholesterol above normal | 20.8 | 8.5 | 26.6 | 16.3 | 1.28 | 0.52 | 1.63 |
Cholesterol well-above normal | 18.5 | 0.7 | 56.9 | 17.7 | 1.05 | 0.04 | 3.22 |
Glucose normal | 80.0 | 99.2 | 42.7 | 81.7 | 0.98 | 1.21 | 0.52 |
Glucose above normal | 20.0 | 0.8 | 57.3 | 18.3 | 1.10 | 0.04 | 3.13 |
Physical activity | 81.0 | 76.8 | 79.6 | 78.8 | 1.03 | 0.97 | 1.01 |
Age ≤ 55 | 45.3 | 48.2 | 34.1 | 44.5 | 1.02 | 1.08 | 0.77 |
Age > 55 | 54.7 | 51.8 | 65.9 | 55.5 | 0.99 | 0.93 | 1.19 |
BMI obese | 41.4 | 21.8 | 44.5 | 33.1 | 1.25 | 0.66 | 1.34 |
BMI overweight | 36.5 | 37.3 | 36.1 | 36.8 | 0.99 | 1.01 | 0.98 |
BMI normal or underweight | 22.1 | 40.9 | 19.5 | 30.1 | 0.73 | 1.36 | 0.65 |
High Systolic blood pressure | 100.0 | 14.3 | 18.5 | 45.9 | 2.18 | 0.31 | 0.40 |
High Diastolic blood pressure | 100.0 | 12.9 | 16.8 | 44.9 | 2.23 | 0.29 | 0.37 |
Hypertension | 100.0 | 0 | 0 | 36.0 | 2.78 | 0 | 0 |
Smoke | 10.1 | 7.2 | 7.6 | 8.3 | 1.21 | 0.87 | 0.91 |
Alcohol | 6.6 | 3.8 | 5.7 | 5.2 | 1.28 | 0.74 | 1.09 |
C 1, C 2, and C 3 stand for Cluster 1, Cluster 2, and Cluster 3, respectively. Light blue: ≤ 0.5; blue: ≤ 1.0; yellow: ≤ 1.5; and green: >1.5. Lift is defined as % in the cluster vs. in the cohort.
In this section, we will first discuss some limitations of the Qluster workflow and possible enhancements, then discuss choices of parameters and the practical use of this workflow.
As often in data mining, one limit concerns the size of the data. It is clear that for massive data, where the number of rows is very high, specific algorithms such as grid-based methods or canopy pre-clustering algorithms (McCallum et al., 2000 ) are needed for the algorithms to scale up.
More specifically, in such cases, factor analysis may be impossible to calculate, as it requires making matrix calculations and inverting matrix of size n p ( n individuals, p binary variables). Please note that in the case of categorical variables, one may prefer to use the anglo-saxon MCA method that applies the CA algorithm on a Burt table ( p p ) instead of the complete disjunctive table ( n p ), which is more efficient in computing time and, thus, more appropriate for large data [also implemented in the MCA() function in FactoMineR (Greenacre, 2007 )]. Equally, in the case of very large data, the CLARA algorithm may be too time-consuming to compute as we still need to maintain enough samples and observations per sample for representativeness. For all these reasons, one suggests simply to analyze a random sample of the original dataset that is likely to be very representative of the latter while allowing the use of the Qluster workflow. Please also note that PCA() and FAMD() are known to take more time for computation than MCA() . One also suggests (when possible) to convert data into one type (continuous only or categorical only) in a data preparation step. Indeed, the upstream scaling of mixed data can be challenging, and the computation times by FAMD are more important. Alternatives may consist of not using the proposed workflow but algorithms that go fast on (very) large data such as Mini Batch K -means used on continuous variables or one-hot-encoded categorical variables. However, in addition to relying solely on Euclidean distance, these strategies may not allow for the prior use of factor analysis due to the size of the data, nor for the stability of clusters to be easily and properly assessed.
Conversely, when the number of columns is greater than the number of rows ( p > n ), the dimension reduction step via factor analysis methods makes even more sense to easily manage the high dimensionality of the data. However, in the most extreme cases where p >> n , standard factor methods may fail to yield consistent estimators of the loading vectors. In addition, the results may be difficult to interpret. In such situations, standardized methods may be a solution to improve the robustness, consistency, and interpretability of results (e.g., penalized PCA, Lee et al., 2012 ). It is also recommended that a relevant subset of the variables be selected prior to analysis (when possible).
Missing values management is not covered in this workflow, and it is, therefore, assumed that no missing values are present in the dataset. Indeed, both factor methods (PCA, MCA, FAMD) and proposed clustering methods (PAM, CLARA,…) require data without missing values. However, this workflow can be easily generalized to missing data, using the same missMDA package as for performing the selection of the optimal number of dimensions in factor analysis, in order to impute in a first step missing values using factor methods. The latter are state-of-the-art methods for handling missing values [e.g., function imputePCA(), imputeMCA() , and imputeFAMD() for simple imputations (Audigier et al., 2013 )] and can, thus, easily be integrated and/or used in an automated workflow to handle missing data. In addition, this R package makes it possible to perform multiple imputations [ MIPCA(), MIMCA() , and MIFAMD() ] for assessing incertitudes from imputed values and increasing confidence in results (Josse and Husson, 2011 ). In this sense, the Qluster workflow can be easily modified to reach the state of the art in missing data management (refer to Appendix E for an example of the Qluster workflow adapted for handling missing values).
Factor analysis allows the transformation of structured data of all kinds into continuous data, while dealing with large, collinear, noisy, and high-dimensional data. It also facilitates clustering by aggregating groups of homogeneous information within dimensions. Nevertheless, it cannot be guaranteed that the results will be “better” or “as good” with factor analysis in the clustering process. Similarly, the choice of factor analysis in this workflow comes with drawbacks that include the following items:
Alternatives may consist of data dimension reduction using feature selection methods, or manually, by grouping, transforming, and/or deleting variables based on clinical expertise.
In order to provide a simple, yet generic and robust workflow for the practical use of the same methodology in many applications, we have made a careful selection of both algorithms and software packages. In particular, the decision to use the PAM/CLARA algorithm is based on many aspects such as the fact that it is:
Yet, it is clear that other algorithms than the ones chosen could be routinely used, including those contained in the FPC R package to facilitate its integration within the workflow (e.g., DBSCAN and HAC). In particular, it is known that with non-flat geometry and/or uneven cluster size, DBSCAN is more appropriate than K -means and PAM. Equally, if the final goal is to obtain a hierarchy rather than a unique hard partition, the user may prefer an algorithm such as HAC, which can easily be used with the proposed packages in this workflow. However, the presence of additional parameters to tune or the lack of compatibility with massive data would make the workflow more complex. It is also important to note that this workflow is not intended to replace more in-depth work by data scientists to find what is optimal for a specific case study. More experienced data scientists can use the generic Qluster workflow for a first look at the data but are encouraged to adapt the general principles of this workflow to their case study (e.g., finding the most suitable algorithm). Such adaptations would be out-of-scope of this workflow in the sense of the initial objectives: genericity of applications while maintaining the simplicity of implementation and reliability/robustness of the methodology.
Equally, the user may want to benchmark several clustering algorithms as suggested by Hennig ( 2020 ). The comparison of methods solutions can be based on information measures (e.g., entropy and mutual information), internal validity measures (e.g., silhouette, refer to Section 2.4.), set-matching (i.e., mapping each cluster from the first clustering to the most similar cluster in the second clustering and computing recall, precision or any other measure), and pair counting [including dedicated visualization tools, refer to Achtert et al. ( 2012 )]. Some of these strategies are directly implemented in the clusterbenchstats() function from the FPC R package or in the clValid() function of the clValid R package. However, as our goal is to propose a simple-to-use workflow, this complexification—which would also greatly impact computing times and memory capacities—is left to the user's discretion. Moreover, multiplying the algorithms and the combination of parameters forces one to rely more heavily on a purely statistical criterion (e.g., ASW) to select the “best” clustering of the data, although this may not reflect the best partitioning in a clinical sense. Indeed, ASW remains a criterion characterizing the average separability over all the clusters, and its optimum may miss (the set of) results that are clinically relevant and/or useful for the desired objective. 25 If the data scientist wants to compare different algorithms, we recommend instead to fully explore the results of a well-chosen first algorithm, before challenging it with others, in order to be less dependent on the sole selection criterion of the ASW. This article, thus, takes the opposite view of the auto-ML literature by first advocating a full investigation of a parsimonious workflow made of well-chosen algorithms, rather than directly covering a wide range of algorithmic possibilities. On this topic, readers may be interested in recent areas of research around meta-clustering (Caruana et al., 2006 ) and ensemble clustering methods (Greene et al., 2004 ; Alqurashi and Wang, 2019 ). The first aims to produce several partitioning results so that the user can select those that are most useful. The second is intended to combine the clustering of several methods to propose a consensual result.
The bootstrapping and noise methods were chosen in the workflow because they are both available in the same function clusterboot() from the same package as for pamk() , and for their complementarity as recommended by Hennig ( 2007 ). Nevertheless, other methods may also be used as sensitivity analyses, including those proposed in the same FPC package. Furthermore, although this step allows for the clusters to be assessed, data scientists should keep in mind that stability is not the only important validity criterion—clusters obtained by very inflexible clustering methods may be stable but also not valid, as discussed in Hennig ( 2008 ). Finally, although several choices were made to try to manage outliers as best as possible, such as using a K -medoid algorithm and the Manhattan distance, the Qluster workflow does not fully address the issues related to outliers and extreme values. One solution may be to define threshold values to manually detect extreme values as a pre-processing step (as in the case study in Section 4), or to use more sophisticated statistical methods such as Yang et al. ( 2021 ).
Clusters' description is not covered in the Qluster workflow. However, many methods exist to interpret clusters (refer to Section 2.3). Data scientists can easily generalize Qluster to the description of clusters by using the functions already present in the FPC package in order not to make the workflow too complex, such as plotcluster() and cluster.varstats() following methodologies recommended by Hennig ( 2004 ).
Although general, the Qluster workflow does not cover all types of data and it is clear that for medical imaging data, omics data, or data in the form of signals, dedicated approaches must be considered. Nevertheless, most tabular data can be processed using the Qluster workflow. In this respect, although the Qluster workflow was specifically designed in the context of healthcare data analysis, it can easily be applied in other fields.
Cluster stability assessment could be considered as a criterion to be optimized, by iterating on this step in order to make this property an integral part of the clustering process itself. For example, stability measures could be used to select the optimal number of clusters, assuming that the clustering results are more stable with the correct number of clusters (Fränti and Rezaei, 2020 ).
However, attention should be paid to the fact that the bootstrap and noise methods are more computationally expensive than simple methods such as deleting the variables one by one (methods used on biological measurements and proposed in the clValid R package). Also, it may not be obvious to optimize the clustering on clusters' stability if the two proposed methods do not give similar results. For example, compared to the noise method, the bootstrap method is more likely to produce stable results as the size of the dataset increases in the case of PAM, and as the percentage of sample representativeness increases in the case of CLARA.
The question of the ultimate relevance of clusters is not addressed in this workflow. It should be noted that the absence of results may be a result in itself, as it may characterize a population that cannot be described in terms of several homogeneous subgroups (either because such subgroups do not exist or because the variables used do not allow us to find them). Nevertheless, it is clear that, as in the Data Mining process, we can consider looping back on this workflow by changing certain parameters if the results are not satisfactory or if an important criterion of the clustering was not taken into account at the beginning (e.g., the maximum number of clusters). More generally, the data scientist is encouraged to keep in mind that the final objective of clustering is often the clinical relevance and usefulness of the results generated. In this sense and as mentioned in Section 5.1, it is not forbidden to relax a purely statistical criterion, such as the ASW (whose optimum may miss some relevant subgroups as it is an indicator of the overall separability) to better represent the diversity of the population studied, or to favor the generation of hypotheses in the case where the statistical optimum only gives broad results not enough specific for the initial objective.
In the same vein, negative silhouette values are viewed too pejoratively in the cluster validity analysis (interpreted as clustering failure). In fact, negative silhouettes characterize patients who, on average, are closer to patients from another cluster than to patients from their own cluster. Therefore, patients with a negative Silhouette may be informative of potential relationships between clusters and should, therefore, be considered as potential additional information about disease history and phenotypic complexity, such as one cluster that is the natural evolution of another. Hence, it is recommended that an analysis of patients with negative Silhouettes be included in the workflow to better assess whether they are a reflection of “bad” clustering or the key to better understanding the disease.
In the case where the optimal number of clusters is the minimum of the range of K (as in our example in Section 4), we recommend (if appropriate) that data scientists test for lower values of K to challenge the obtained optimum. Similarly, if the optimum is obtained for K = 2, data scientists should test whether the dataset should be split into two clusters, using the Duda–Hart test that tests the null hypothesis of homogeneity in the whole dataset. This can be done using the same pamk() function by setting up the minimum of K to 1, or directly using the dudahart2() function (also in the FPC R package). In any case, if the primary objective is to provide fine-grained knowledge of the study population, it will still be possible to provide results with the optimal K that was initially obtained, keeping in mind that the levels of inter-cluster separability and intra-cluster homogeneity are not really higher than those that would be obtained with a smaller number of clusters.
The Qluster workflow can be easily automated for data scientists and organizations that need a routine way to cluster clinical data. Indeed, data scientists may create a main function for applying this workflow, including by setting the nature of data (categorical/continuous/mixed), the volume (normal/large), and parameters related to each called function. It is worth mentioning, however, that the quality of the input data and the structure of the groups to be found are factors that may not allow the present workflow to identify relevant results every time. In this case, the data scientist can refer to the indications given above or, if necessary, consider an approach more adapted to his data.
In this article, we propose Qluster , a practical workflow for data scientists because of its genericity of application (e.g., usable on small or big data, on continuous, categorical, or mixed variables, and database of high-dimensionality or not) while maintaining the simplicity of implementation and use (e.g., need for few packages and algorithms, few parameters to tune, …), and the robustness and reliability of the methodology (e.g., evaluation of the stability of clusters, use of proven algorithms and robust packages, and management of noisy or multicollinear data). It, therefore, does not rely on any innovative approach per se but rather on a careful selection and combination of state-of-the-art clustering methods for practical purposes and robustness.
Data clustering is a difficult task for many data scientists who are faced with a large literature and a large number of algorithms and implementations. We believe that Qluster can (1) improve the quality of analyses carried out as part of such studies (refer to Qluster 's criteria for robustness and reliability), promote and ease clustering studies (refer to Qluster 's criteria for genericity and simplicity of use), and increase the skills of some of the statisticians/data scientists involved (refer to the literature review provided and the general principles of Qluster ). This workflow can also be used by more experienced data scientists for initial explorations of the data before designing more in-depth analyses.
Finally, this workflow can be fully operationalized, using either scripted tools or a Data Science platform supporting the use of R packages. As an illustrative example, we made an implementation on the Dataiku platform of the Qluster workflow to process a Kaggle dataset (refer to Appendix B ). This implementation is usable on the free edition and is made available on request (email: moc.ecnarf-netniuq@tcatnoc ).
Publicly available datasets were analyzed in this study. This data can be found here: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset .
MR and PG contributed very significantly throughout this study. CE and J-DZ were the main contributors to both the writing and the methodology. All authors participated in writing the manuscript, contributed to the revision of the manuscript, and read and approved the submitted version.
The team would like to thank Dr. Martin Montmerle for initiating the scientific study that led to this research work. The team would also like to thank Dr. Martin Montmerle, Vincent Martenot, Valentin Masdeu, Pierre Tang, and Sam Ekhtiari for their careful review of the article. Finally, the team would like to thank Quinten for providing the opportunity to conduct this study.
https://www.kaggle.com/sulianova/cardiovascular-disease-dataset?select=cardio_train.csv
2 For further details, see Celebi ( 2014 ).
https://scikit-learn.org/stable/ https://cran.r-project.org/web/packages/cluster/index.html https://cran.r-project.org/web/packages/fpc/index.html https://cran.r-project.org/web/packages/FactoMineR/index.html https://github.com/MaxHalford/prince https://cran.r-project.org/web/packages/missMDA/index.htmlpackages https://cran.r-project.org/web/packages/factoextra/index.html https://rdrr.io/github/QTCAT/qtcat/man/clarans.html https://github.com/daveti/pycluster https://pyclustering.github.io/ https://github.com/sandysa/Interpretable~Clustering https://cran.r-project.org/web/packages/clValid/index.html
15 A discussion on other possible criteria and ways to integrate them into the workflow is presented in Section 5.1 (e.g., management of missing data and outliers).
16 e.g., CRAN Task View for cluster analysis: https://cran.r-project.org/web/views/Cluster.html .
17 e.g. of handy R packages: https://towardsdatascience.com/a-comprehensive-list-of-handy-r-packages-e85dad294b3d .
18 The notion of large data, as well as how to fix the hyperparameters samples and sampsize in CLARA algorithm, may vary according to the computing capabilities of the user's system. One recommends users to pre-tests different scenarios to adapt these thresholds to their own settings. For guidance, this workflow applied on the case study in Section 4 (34,134 observations and nine variables) took 5 h and 30 min with 8 CPUs and 10 GB RAM.
19 It is worth noting that standardization of continuous data is recommended before using PCA, to not give excessive importance to the variables with the largest variances.
21 On Kaggle: www.kaggle.com/sulianova/cardiovascular-disease-dataset
23 Prise en charge des patients adultes atteints d'hypertension arterielle essentielle - Actualisation 2005. https://www.has-sante.fr/upload/docs/application/pdf/2011-09/hta_2005-recommandations.pdf .
24 One may prefer the Greenacre adjustment to Benzecri correction, which tends to be less optimistic than the Benzecri correction. Please note that both methods are not currently implemented in the proposed R packages and must therefore be implemented by data scientists if desired. R code for Benzecri correction is provided in Appendix C .
25 Similarly, the user may want to compare several methods for selecting the optimal number of clusters, including other direct methods (e.g., elbow method on the total within-cluster sum of square) or methods based on statistical testing (e.g., gap statistic).
CE, MR, and PG were employed by Quinten. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2022.1055294/full#supplementary-material
Description of algorithms in selected libraries on some of the criteria defining genericity, ease of implementation and use, and robustness.
Example of an implementation of the Qluster workflow on the Dataiku platform.
R Code to implement Benzecri correction from MCA eigenvalues.
Eigenvalues and variances explained with and without Benzecri correction.
Example of the Qluster workflow adapted for handling missing values.
![]() |
聪明的炒粉 · java一个接口多个实现类怎么调用-掘金 11 月前 |