Qluster: An easy-to-implement generic workflow for robust clustering of health data

Cyril Esnault , Melissa Rollot , Pauline Guilmin , and Jean-Daniel Zucker ^{Libraries

Language

Ease of implementation and use

Robustness

Number of downloads in 2021

Data processing

Data clustering

Internal validation metrics

Clusters description

Cluster stability assessment

RYesYesSilhouette width, Caliński-Harabasz index, Hubert's gamma coefficient, Dunn index, Tibshirani, and Walther's prediction strength, etc.YesBootstrap, noise, resampling, etc.985,853

Cluster

RYesYesSilhouette width, gap statistic, etc.Yes/891,577
RNoYesVariance accounted for (VEF), deviance accounted for (DEF), …YesBootstrap467,260

clValid

RNoYesConnectivity, Silhouette width, Dunn Index, etc.YesRemoving each column, one at a time96,676

sklearn.cluster

PythonYesYesRand index, Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI), Silhouette width, Caliński-Harabasz Index, Davies-Bouldin Index, etc.No/NA}

Open in a separate window

Number of downloads in 2021 is based on the cran_downloads() function in the cranlogs R library. NA, not applicable.

Table 1 shows that neither the cluster R package nor the sklearn-cluster module in Python allows the evaluation of the stability of clusters. As indicated in Section 2, one should code this step oneself in Python, or link (if possible) with other packages in R. Of the selected R packages, FPC was the most downloaded in 2021 and provides the most internal assessment metrics. Clue and FPC evaluate cluster stability by bootstrapping but only FPC includes other methods such as noising, the complementarity of the two methods being recommended by Hennig ( 2008 ). clValid on the other hand proposes simpler methods, mainly used in biology, for evaluating the stability of clusters by removing the variables one by one.

The table in Appendix A is created based on Table 1 in Fahad et al. ( 2014 ), which we adapted for our purpose. Overall, Appendix A shows that none of the algorithms included in the selected packages satisfies all the properties sought in terms of genericity, simplicity of use and implementation, and robustness. For example, CLARA and Mini-batch K -means both allow very good handling of large data and are adapted to some extent to high dimensionality and rely on few parameters to be optimized. However, they only apply to continuous data and are not, particularly, suitable for noisy data. Also and unlike CLARA, Mini-batch K -means is only included in the scikit-learn module on Python.

This first synthesis work highlights the challenge to overcome.

3.4. Qluster

Based on the literature review (Section 2) and the preliminary work (Section 3.3), we propose the Qluster workflow (refer to Figure 3 ), a set of methods that together represent a good balance for data scientists to make clustering on health data in a practical, efficient, robust and simple way. It covers the cluster generation step (step 3) through 1- factor analysis, 2- data clustering, and 3- stability evaluation. The output of the factor analysis (PCA, MCA, or FAMD) is the matrix of the coordinates of the individuals on the factorial dimensions, i.e., a table of continuous variables, allowing then the clustering by a PAM algorithm. For an in-depth discussion regarding the Qluster workflow, refer to Section 5.

An external file that holds a picture, illustration, etc. Object name is frai-05-1055294-g0003.jpg

Open in a separate window

Figure 3

The Qluster workflow (the colored step pads correspond to those detailed in Figure 1 ).

To summarize, Qluster tries to generalize clustering tasks through a generic framework that is:

Adapted to variables of any nature , be it categorical only, continuous only, or a mix of both. This is made possible by transforming all data in a continuous setting (that is both for mature in the literature and simpler to process) using methods for factor analysis [MCA for categorical data only or FAMD for mixed data (Pagès, 2004 )]. As mentioned in Section 2.2, the latter also allows for dealing with collinearity , high-dimensionality , and noise . It also makes the work for a clustering algorithm easier, as there are both fewer variables to deal with and greater clarity in information to cluster (factor analysis methods are themselves meant to uncover profiles into components of richer information).
Adapted to datasets of any volume , be it small or large data. Indeed, the same partitioning-based algorithm is used (PAM), either applied entirely on a dataset of reasonable size or on samples of a large ¹⁸ dataset (CLARA algorithm), using the same pamk() function from the FPC R package. In addition to this practical aspect, PAM was chosen over the (widely used) K -means algorithm based on its ability to be deterministic and to deal with Manhattan distance, which is less sensitive to outliers than the Euclidean distance (Jin and Jiawei, 2010 ). Moreover, PAM is known for its simplicity of use [fewer parameters, e.g., than with DBSCAN or BIRCH (Fahad et al., 2014 )] and is also implemented in an easy-to-use all-inclusive package (not the case for, e.g., CLARANS, KAMILA, Mini batch K -means, DENCLUE, and STING that are suited for large datasets but where assessing clusters stability would require extensive code development by data scientists). More details on the choice of the clustering algorithm can be found in section 5.

In addition, the Qluster workflow relies solely on four state-of-the-art R packages, allowing data scientists to quickly manage data of different natures and volumes and perform robust clustering:

Both the clustering and cluster stability assessment tasks are performed using the FPC R package [functions pamk() and clusterboot() , respectively]. R has been chosen over Python because the former offers all the clustering methods desired, and no package including all the steps of interest to clustering was found for the latter (we would have to code some steps by ourself, refer to Section 2 for more details). The clusterboot() function offers many ways to assess clusters' stability, but one selects the two followings for routine practice and for their complementarity as mentioned in Hennig ( 2008 ): bootstrapping and noising.
The factor analysis part is handled using the FactoMineR R package [functions PCA(), MCA() , and FAMD() , for continuous, categorical, and mixed data, respectively, the latter function generalizing the others]. This step is optional in the case where only continuous variables are in the input. ¹⁹ To select the optimal number of components to keep, one recommends for small data to use the deterministic cross-validation technique implemented in the missMDA package [function estim_ncpPCA(), estim_ncpMCA(), estim_ncpFAMD() (Josse et al., 2012 )]. As this method requires high computing time, the standard “elbow” method in a scree plot is recommended for large data, using the factoextra R package [function fviz_eig() ].

Finally, the Qluster workflow is operationalizable and implementable from end to end (see in Appendix B a picture of implementation in the Dataiku ²⁰ platform. Available upon request: moc.ecnarf-netniuq@tcatnoc ).

This generic workflow, usable in most situations, can be described through the following pseudocode ( Algorithm 1 ):

Algorithm 1

The Qluster pseudo-code.

input: X : The input data

packages : FactoMineR, factoextra, FPC, missMDA

output: Q : A clustering of X and associated measures

1 if X is continuous only then

2 F = PCA(X) , with F a FactoMineR object of class PCA

3 else if X is categorical only then

4 F = MCA(X) , with F a FactoMineR object of class MCA

5 else // mixed continuous and categorical

6 F = FAMD(X) , with F a FactoMineR object of class FAMD

7 end

8 Define from F the matrix M of coordinates of individuals on each dimension

9 if X is “large” then

10 Apply fviz_eig() on F to select C _opt , a sufficient number of components

11 Define the M _opt matrix as M restricted to C _opt components

12 P = pamk(M _opt ) , with usepam = FALSE (CLARA), criterion = “asw”, scaling = FALSE, and setting at convenience samples, sampsize , and krange

13 Let K the optimal number of clusters in P

14 else // X is not “large”

15 if X is continuous only then

16 F _ncp = estim_ncpPCA(X) , with large [ncp.min, ncp.max] range

17 else if X is categorical only then

18 F _ncp = estim_ncpMCA(X) , with large [ncp.min, ncp.max] range

19 else // mixed continuous and categorical

20 F _ncp = estim_ncpFAMD(X) , with large [ncp.min, ncp.max] range

21 end

22 Retrieve the optimal number of dimensions C _opt from F _ncp

23 Define the M _opt matrix as M restricted to C _opt components

24 P = pamk(M _opt ) , with usepam = TRUE (PAM), criterion = “asw”, scaling = FALSE, and setting at convenience krange

25 Let K the optimal number of clusters in P

26 end

27 S = clusterboot(M _opt ) , with bootmethod = c(“boot”, “noise”), krange = K, clustermethod = “pamkCBI” and the same parameters as in the previous step. Loop on the noise_level parameter to test different noise levels

28 Return all useful results in Q

Open in a separate window

4. Detailed workflow methodology

4.1. The Cardiovascular disease dataset and the objective

The Cardiovascular Disease ²¹ dataset includes 70,000 patients with or without cardiovascular disease and 12 variables (five of them are continuous).

The following raw variables were used (raw variables' names are in italic):

Age (days, converted into years) - age
Height (cm) - height
Weight (kg) - weight
Gender (M/F) - gender
Systolic blood pressure (SBP) (mmHg) - ap_hi
Diastolic blood pressure (DBP) (mmHg) - ap_lo
Cholesterol (categories 1: normal, 2: above normal, 3: well above normal) - cholesterol
Glucose (categories 1: normal, 2: above normal, 3: well above normal) - gluc
Smoking (Y/N) - smoke
Alcohol intake (Y/N) - alco
Physical activity (Y/N) - active
Presence or absence of cardiovascular disease (Y/N) - cardio

The objective of this section is to present in detail the application of the Qluster workflow proposed in Section 3 on the following use case: to characterize the phenotypes of patients with cardiovascular disease (a subset of patients with cardio = Y). This represents 34,979 patients (about 50% of the whole dataset).

4.2. Step-by-step application of the Qluster workflow

The following section details the application of the Qluster workflow to the cardiovascular dataset to help scientists use it for their own projects. Additional elements to the ones presented in Section 2 supporting the present methodology are also provided when relevant. We first present the preprocessing of the dataset, in which notably the few continuous variables are converted into qualitative data, before applying an MCA, which is a data-reduction technique for exploring the associations among multiple categorical variables (Greenacre, 1984 ; Warwick et al., 1989 ; Murtagh, 2005 ; Greenacre and Blasius, 2006 ; Nishisato, 2019 ). Then, given the large size of the database, the CLARA algorithm is applied and optimized. Finally, the clusters' stability is assessed and a brief interpretation of the clusters is provided.

4.2.1. Data preparation

4.2.1.1. Features derivation and selection

First, the Body Mass Index (BMI) variable was created from both height and weight (Ortega et al., 2016 ). Then, outliers were detected by defining for each quantitative variable the thresholds above or below which values are more likely to be inaccurate. Acceptable values should be in the following ranges: 18 ≤ Age <120, 10 ≤ BMI <100, SBP ≤ 400, and DBP ≤ 200 [Ortega et al., 2016 ; Mayo Clinic ²² , French HTA (HAS) recommendations ²³ ]. For simplicity, patients with at least one outlier were removed from the analysis (sensitivity analyses could be performed). Quantitative variables were then discretized in order to both create variables with a clinical sense and enable the use of the MCA algorithm (refer to Table 2 ).

Table 2

Description of quantitative feature engineering.

Description Modalities Accepted value range
Age	Age	age ≤ 55; age > 55	[18, 120]
BMI	BMI	Underweight: <18.5 kg/m ² ; Normal: between 18.5 and 24.9 kg/m ² ; Overweight: between 25.0 and 29.9 kg/m ² ; Obese: ≥ 30.0 kg/m ²	[10, 100]
high_sbp	High systolic blood pressure	1: SBP > 130 mmHg; 0: SBP ≤ 130 mmHg	≤ 400
high_dbp	High diastolic blood pressure	1: DBP > 80 mmHg; 0: DBP ≤ 80 mmHg	≤ 200

Open in a separate window

An additional binary hypertension variable was created based on both high_sbp and high_dbp variables that are used as a proxy for patients with hypertension [ hypertension = 1 if high_sbp = 1 and high_dbp = 1; else hypertension = 0 (Williams et al., 2018 )].

Finally, the variables selected to discriminate the patients must be chosen according to their medical relevance to the context of the study. To this end, the user must always consider the results he would obtain if a variable is included or not. In particular, the user has to ask himself whether active discrimination of the clusters by a variable is sought: considering the example of the two common variables age and race, if they are actively included in the clustering step, it will tend to create groups of young vs. old, Caucasian vs. non-Caucasian patients. If not, such variables can be kept for passive analysis of the generated clusters and assess a posteriori possible heterogeneity on these variables. In this use case, we removed the height, weight, and systolic and diastolic blood pressure features, as they are used to create the derived features listed above and are not useful alone for clustering.

At the end, we obtained a database of 34,134 patients and 11 variables.

4.2.1.2. Dealing with low prevalent features and modalities

Clustering variables with low prevalence are known to be challenging in data analysis, especially for techniques that are very sensitive to data and/or anomalous cases [e.g., regression analysis and factor analysis (Fahrmeir et al., 2013 )]. Most common techniques consist of either gathering rare modalities in groups of higher frequency or discarding the concerning modalities and/or variables. Additionally, binary clustering variables with low prevalence in the study population may be discarded from the analysis or grouped with other features when appropriate.

An arbitrary threshold of 10% was set to distinguish and eliminate features with rare modalities from the clustering features. This is consistent with recommendations before using Multiple Correspondence Analysis, which over-weighs rare modalities and multi-modality variables (Le Roux and Rouanet, 2010 ; Di Franco, 2016 ). When possible, modalities with <10% of prevalence were grouped with others based on medical relevance. As a result, both the Smoking (8.3% with smoke = Yes) and Alcohol intake (5.2% with alco = Yes) variables were ruled out to cluster data and were only used for a posteriori clusters description.

Moreover, some modalities were aggregated for the two following variables:

Glucose ( gluc ): modalities 2 (above normal, 8.8%) and 3 (well-above normal, 9.5%) were grouped into one modality 2 (above normal).
BMI ( BMI ): modalities “underweight” (0.5%) and “normal” were grouped into one modality “underweight & normal”.

At the end, the dataset used to perform MCA contains a total number of nine categorical variables (age, BMI, high_sbp, high_dbp , hypertension, gluc, gender, cholesterol, and physical activity).

4.2.2. Perform multiple correspondence analysis

As with other methods for factor analysis (e.g., PCA and CA), MCA was combined with cluster analysis to capture data heterogeneity, through clusters of observations in the population that show distinctive patterns (Buuren and Heiser, 1989 ; Hwang et al., 2006 ; Mitsuhiro and Hiroshi, 2015 ; van de Velden et al., 2017 ; Testa et al., 2021 ).

The number of MCA components to be used was decided using the standard scree plot by identifying the “elbow” of the curve [method widely used with PCA (Cattell, 1996 )] while constraining eigenvalues to be strictly above a threshold of 0.11 equivalent of Kaiser's rule in PCA (i.e., 1/ C with C the number of categorical variables).

Based on the scree plot (refer to Figure 4 ), three dimensions were chosen, the third marking a clear elbow in the curve (related eigenvalue: 0.12; related percentage of variance explained: 9.8%).

An external file that holds a picture, illustration, etc. Object name is frai-05-1055294-g0004.jpg

Open in a separate window

Figure 4

Scree plot of variance explained by dimension from an MCA (in red, the eigenvalue of 0.11).

Moreover, and for interpretation purposes, eigenvalues were corrected using the Benzecri correction ²⁴ to consider that the binary coding scheme used in MCA creates artificial factors and, therefore, reduces the inertia explained (Greenacre, 1984 ). The top three components gather 99.9% of inertia after correcting with the Benzecri method (more details in Appendix D ).

4.2.3. Clustering of the data

4.2.3.1. Parameters specification

The CLARA algorithm was used through the pamk() function in the FPC R package (version 2.2-5), a reliable package for flexible procedures for clustering, and with the following main parameters:

Distance measure: dissimilarity matrix was computed using the Manhattan distance. The latter is more robust and less sensitive to outliers than the standard Euclidean distance (Jin and Jiawei, 2010 ).
Number K of clusters: From 3–11,
- - The number of clusters was optimized on the Average Silhouette Width (ASW) quality measure, which is an internal validity metric reflecting the compactness and separation of the clusters. The ASW is based on the Silhouettes Width that was calculated for all patients in the best sample, i.e., the one used to obtain cluster medoids and generate clusters (Rousseeuw, 1987 ).
- - The range of clusters to be tested was determined to enable the identification of phenotypically similar subgroups while not generating an excessive number of subgroups for interpretation.

The number of samples and sample size: 100 samples of 5% study population size (1,706 patients).
- - Experiments have shown that five samples of size 40 + 2C (with C the number of variables in input) give satisfactory results (Kaufman and Rousseeuw, 1990 ). However, increasing these numbers is recommended (if feasible), to limit sampling biases and favor converging toward the best solution. Equally, the higher the sample size is, the higher it is representative of the entire dataset. We, therefore, recommend pretesting on his material up to what parameter values the computation times are acceptable considering the size of the input dataset and the other steps of the workflow (including the clusters stability evaluation step, the most time consuming).

Other parameters include the non-scaling of the input data to not modify the observation space obtained from MCA.

4.2.3.2. Results

The optimal ASW was obtained for a pool of three clusters (ASW: 0.42, refer to Figure 5 ).

An external file that holds a picture, illustration, etc. Object name is frai-05-1055294-g0005.jpg

Open in a separate window

Figure 5

Average Silhouette Width for each set of clusters ( K = 3–11).

Homogeneity and separability of clusters were further studied by analyzing the Silhouettes Width of patients in the best sample used to generate the clusters' medoids, using the fviz silhouette() function in the factoextra R package. As a reminder, the Silhouette Width characterizes both the cohesion of the cluster and its separation from the other clusters: a positive (respectively, negative) Silhouette Width for a patient is in favor of a correct (respectively, incorrect) affiliation to its own cluster.

Figure 6 shows a high level of intra-cluster cohesion and inter-clusters separability as only few patients (in clusters 2 and 3) have negative Silhouettes. Clusterwise Silhouette Widths are also all positive (ASW of 0.42, 0.47, and 0.30, for clusters 1, 2, and 3, respectively).

An external file that holds a picture, illustration, etc. Object name is frai-05-1055294-g0006.jpg

Open in a separate window

Figure 6

Silhouette Width for each patient within each of the three clusters (the red line corresponds to the Average Silhouette Width).

4.2.4. Cluster stability assessment

4.2.4.1. Parameters specification

In order to evaluate clustering robustness, the clustering was performed several times on a cohort that was randomly modified. This allows generating under perturbations new versions of the original clusters and, thus, to evaluate the stability of the clusters to them. The stability of the clusters is all the higher as the new versions of the clusters generated under perturbations are similar to the original clusters. The data perturbation step was performed using two approaches that may provide complementary information based on the results in Hennig ( 2007 ): bootstrap and noise methods.

Bootstrap approach:
- - This approach consists in performing the clustering as described in Section 4.2.3 on B = 50 bootstrapped data [i.e., random sampling with replacement (Efron, 1979 ; Efron and Tibshirani, 1994 )], using the clusterboot() function in the FPC R package (version 2.2-5).
- - The Jaccard similarity metric is used to compute, for each cluster, the proximity between the clusters of patients obtained from the bootstrapped sample and the original clusters. It is given by the number of patients in common between the new cluster and the original cluster divided by the total number of distinct patients considered (i.e., present in either the new or the original cluster).
- - For each cluster, the following results are provided:
  - ▪ the mean of the Jaccard similarity statistic.
  - ▪ the number of times the cluster was “dissolved”, defined as a Jaccard similarity value ≤ 0.5. This value indicates instability.
  - ▪ the number of times the cluster was “recovered”, defined as a Jaccard similarity value ≥ 0.75. This value indicates stability. There is some theoretical justification to consider a Jaccard similarity value smaller or equal to 0.5 as an indication of a “dissolved cluster”, refer to Hennig ( 2008 ). Between 0.6 and 0.75, clusters may be considered as indicating patterns in the data, but which points exactly should belong to these clusters is highly doubtful.

Noise approach:
- - This approach consists in performing the clustering as described in Section 4.2.3 on B = 50 noisy data and for different values of noise, using the clusterboot() function in the FPC R package.
  - ▪ Level of noise values: from 1 to 10%
- - The number of times each cluster was “dissolved” and “recovered” is provided, as well as the mean of the Jaccard similarity statistic, according to the noise values.

4.2.4.2. Results

Clusters are all the more stable as the Jaccard similarity statistics and the number of recovered clusters are high, and the number of dissolved clusters is low.

The results of the data perturbation step are shown below:

For the bootstrap approach, clusters 1, 2 and 3 all have a Jaccard similarity statistic of 100% over 50 iterations. The three clusters were recovered for 100% of bootstrapped iterations, which characterizes very high stability to resampling with replacement.
For the noise approach, clusters 1, 2, and 3 at worst (for 2% noise) Jaccard similarity statistics of 100%, 98%, and 96%, respectively, over 50 iterations. The three clusters were recovered for 100% of iterations regardless of the level of noise (from 1 to 10%), which characterizes very high stability to noise.

Regardless of the method used, the results seem to be very robust and can certainly be explained by the large size of the database and the small number of clusters retained in the context of synthetic data. Clusters stability may be more variable in real cases.

It is worth noting that the clusterboot() function can also provide useful results and plots of clusters' stability (histogram of Jaccard similarity statistics by cluster, summary information for each cluster, etc.), but we did not provide them in this article since the obtained Jaccard similarity metrics were all around 100%.

4.2.5. Clusters interpretation

Descriptive statistics (proportions and lift values) were computed from variables included or not in the clustering step. Cluster 1 [ n = 12,272 (36.0%)] groups patients who all have high values of diastolic and systolic blood pressure, and consequently hypertension. These patients are slightly more than the average with well-above normal cholesterol values (18.5% vs. 17.7%) and above normal glucose values (20.0% vs. 18.3%). On the contrary, patients from clusters 2 and 3 [ n = 15,477 (45.3%) and n = 6,385 (18.7%), respectively] are between 81% and 87% to have normal values of diastolic and systolic blood pressures, and none have hypertension. In contrast with cluster 2, patients from cluster 3 are many more with well-above normal cholesterol values (26.6% vs. 8.5%) and above normal glucose values (57.3% vs. 0.8%).

Patients from clusters 1 and 2 are overall younger than cluster 3 (age ≤ 55: 54.7% and 51.8% vs. 65.9%). Patients from clusters 1 and 3 are overall more obese than cluster 2 (41.4% and 44.5% vs. 21.8%).

To summarize, among patients with cardiovascular disease, cluster 1 gathers patients with hypertension, cluster 2 gathers patients healthier (although about the same age as cluster 1), and cluster 3 gathers slightly older patients with cholesterol and high levels of glucose (although no hypertension). Interestingly, the description of cluster 1 is consistent with a poorer lifestyle (lift values of 1.21 and 1.28 for Smoke and Alcohol, respectively) although this did not actively participate in clustering. Refer to Table 3 for more details.

Table 3

Prevalence and lift values of each modality and by cluster.

Modalities Prevalence (% of patients) Lift values Cohort n = 12,272 n = 15,477 n = 6,385 n = 34,134 (36.0%) (45.3%) (18.7%) (100%)
Female	62.1	64.2	71.4	64.8	0.96	0.99	1.10
Male	37.9	35.8	28.6	35.2	1.08	1.02	0.81
Cholesterol normal	60.6	90.7	16.5	66.0	0.92	1.37	0.25
Cholesterol above normal	20.8	8.5	26.6	16.3	1.28	0.52	1.63
Cholesterol well-above normal	18.5	0.7	56.9	17.7	1.05	0.04	3.22
Glucose normal	80.0	99.2	42.7	81.7	0.98	1.21	0.52
Glucose above normal	20.0	0.8	57.3	18.3	1.10	0.04	3.13
Physical activity	81.0	76.8	79.6	78.8	1.03	0.97	1.01
Age ≤ 55	45.3	48.2	34.1	44.5	1.02	1.08	0.77
Age > 55	54.7	51.8	65.9	55.5	0.99	0.93	1.19
BMI obese	41.4	21.8	44.5	33.1	1.25	0.66	1.34
BMI overweight	36.5	37.3	36.1	36.8	0.99	1.01	0.98
BMI normal or underweight	22.1	40.9	19.5	30.1	0.73	1.36	0.65
High Systolic blood pressure	100.0	14.3	18.5	45.9	2.18	0.31	0.40
High Diastolic blood pressure	100.0	12.9	16.8	44.9	2.23	0.29	0.37
Hypertension	100.0	0	0	36.0	2.78	0	0
Smoke	10.1	7.2	7.6	8.3	1.21	0.87	0.91
Alcohol	6.6	3.8	5.7	5.2	1.28	0.74	1.09

Open in a separate window

C 1, C 2, and C 3 stand for Cluster 1, Cluster 2, and Cluster 3, respectively. Light blue: ≤ 0.5; blue: ≤ 1.0; yellow: ≤ 1.5; and green: >1.5. Lift is defined as % in the cluster vs. in the cohort.

5. Discussion

In this section, we will first discuss some limitations of the Qluster workflow and possible enhancements, then discuss choices of parameters and the practical use of this workflow.

5.1. Limitations and proposition for enhancing this workflow

5.1.1. When large data are too large

As often in data mining, one limit concerns the size of the data. It is clear that for massive data, where the number of rows is very high, specific algorithms such as grid-based methods or canopy pre-clustering algorithms (McCallum et al., 2000 ) are needed for the algorithms to scale up.

More specifically, in such cases, factor analysis may be impossible to calculate, as it requires making matrix calculations and inverting matrix of size n p ( n individuals, p binary variables). Please note that in the case of categorical variables, one may prefer to use the anglo-saxon MCA method that applies the CA algorithm on a Burt table ( p p ) instead of the complete disjunctive table ( n p ), which is more efficient in computing time and, thus, more appropriate for large data [also implemented in the MCA() function in FactoMineR (Greenacre, 2007 )]. Equally, in the case of very large data, the CLARA algorithm may be too time-consuming to compute as we still need to maintain enough samples and observations per sample for representativeness. For all these reasons, one suggests simply to analyze a random sample of the original dataset that is likely to be very representative of the latter while allowing the use of the Qluster workflow. Please also note that PCA() and FAMD() are known to take more time for computation than MCA() . One also suggests (when possible) to convert data into one type (continuous only or categorical only) in a data preparation step. Indeed, the upstream scaling of mixed data can be challenging, and the computation times by FAMD are more important. Alternatives may consist of not using the proposed workflow but algorithms that go fast on (very) large data such as Mini Batch K -means used on continuous variables or one-hot-encoded categorical variables. However, in addition to relying solely on Euclidean distance, these strategies may not allow for the prior use of factor analysis due to the size of the data, nor for the stability of clusters to be easily and properly assessed.

Conversely, when the number of columns is greater than the number of rows ( p > n ), the dimension reduction step via factor analysis methods makes even more sense to easily manage the high dimensionality of the data. However, in the most extreme cases where p >> n , standard factor methods may fail to yield consistent estimators of the loading vectors. In addition, the results may be difficult to interpret. In such situations, standardized methods may be a solution to improve the robustness, consistency, and interpretability of results (e.g., penalized PCA, Lee et al., 2012 ). It is also recommended that a relevant subset of the variables be selected prior to analysis (when possible).

5.1.2. Generalizability of this workflow to missing data

Missing values management is not covered in this workflow, and it is, therefore, assumed that no missing values are present in the dataset. Indeed, both factor methods (PCA, MCA, FAMD) and proposed clustering methods (PAM, CLARA,…) require data without missing values. However, this workflow can be easily generalized to missing data, using the same missMDA package as for performing the selection of the optimal number of dimensions in factor analysis, in order to impute in a first step missing values using factor methods. The latter are state-of-the-art methods for handling missing values [e.g., function imputePCA(), imputeMCA() , and imputeFAMD() for simple imputations (Audigier et al., 2013 )] and can, thus, easily be integrated and/or used in an automated workflow to handle missing data. In addition, this R package makes it possible to perform multiple imputations [ MIPCA(), MIMCA() , and MIFAMD() ] for assessing incertitudes from imputed values and increasing confidence in results (Josse and Husson, 2011 ). In this sense, the Qluster workflow can be easily modified to reach the state of the art in missing data management (refer to Appendix E for an example of the Qluster workflow adapted for handling missing values).

5.1.3. Discussion on using factor analysis as a first step

Factor analysis allows the transformation of structured data of all kinds into continuous data, while dealing with large, collinear, noisy, and high-dimensional data. It also facilitates clustering by aggregating groups of homogeneous information within dimensions. Nevertheless, it cannot be guaranteed that the results will be “better” or “as good” with factor analysis in the clustering process. Similarly, the choice of factor analysis in this workflow comes with drawbacks that include the following items:

The packages used cannot handle the ordinal nature of the variables. The latter must be treated as categorical or continuous.
The observations × components matrix is continuous, although some raw variables could be categorical. This prevents the user from favoring (when relevant) positive co-occurrence over negative co-occurrence via the Jaccard similarity coefficient.

Alternatives may consist of data dimension reduction using feature selection methods, or manually, by grouping, transforming, and/or deleting variables based on clinical expertise.

5.1.4. Discussion on using a single K-medoid algorithm

In order to provide a simple, yet generic and robust workflow for the practical use of the same methodology in many applications, we have made a careful selection of both algorithms and software packages. In particular, the decision to use the PAM/CLARA algorithm is based on many aspects such as the fact that it is:

one of the best known, studied, and used algorithms by the community, for general purposes,
suitable for continuous variables (i.e., the most mature in the literature).
meant for the most frequent use case of clustering (i.e., hard partitioning),
suitable for the Manhattan distance, a less sensitive to outliers distance, unlike its counterpart on the Euclidean distance ( K -means),
deterministic, due to its internal medoid initialization procedure, unlike the basic K -means algorithm that may lead to inconsistent or non-reproducible clusters,
requiring few parameters to set up [e.g., conversely to BIRCH and DBSCAN, refer to Fahad et al. ( 2014 )],
very well-implemented within a recognized reference R package (the FPC package) facilitating its use within a complete and robust clustering approach,
usable within the same R function [ pamk() ] regardless of the volume of data.

Yet, it is clear that other algorithms than the ones chosen could be routinely used, including those contained in the FPC R package to facilitate its integration within the workflow (e.g., DBSCAN and HAC). In particular, it is known that with non-flat geometry and/or uneven cluster size, DBSCAN is more appropriate than K -means and PAM. Equally, if the final goal is to obtain a hierarchy rather than a unique hard partition, the user may prefer an algorithm such as HAC, which can easily be used with the proposed packages in this workflow. However, the presence of additional parameters to tune or the lack of compatibility with massive data would make the workflow more complex. It is also important to note that this workflow is not intended to replace more in-depth work by data scientists to find what is optimal for a specific case study. More experienced data scientists can use the generic Qluster workflow for a first look at the data but are encouraged to adapt the general principles of this workflow to their case study (e.g., finding the most suitable algorithm). Such adaptations would be out-of-scope of this workflow in the sense of the initial objectives: genericity of applications while maintaining the simplicity of implementation and reliability/robustness of the methodology.

Equally, the user may want to benchmark several clustering algorithms as suggested by Hennig ( 2020 ). The comparison of methods solutions can be based on information measures (e.g., entropy and mutual information), internal validity measures (e.g., silhouette, refer to Section 2.4.), set-matching (i.e., mapping each cluster from the first clustering to the most similar cluster in the second clustering and computing recall, precision or any other measure), and pair counting [including dedicated visualization tools, refer to Achtert et al. ( 2012 )]. Some of these strategies are directly implemented in the clusterbenchstats() function from the FPC R package or in the clValid() function of the clValid R package. However, as our goal is to propose a simple-to-use workflow, this complexification—which would also greatly impact computing times and memory capacities—is left to the user's discretion. Moreover, multiplying the algorithms and the combination of parameters forces one to rely more heavily on a purely statistical criterion (e.g., ASW) to select the “best” clustering of the data, although this may not reflect the best partitioning in a clinical sense. Indeed, ASW remains a criterion characterizing the average separability over all the clusters, and its optimum may miss (the set of) results that are clinically relevant and/or useful for the desired objective. ²⁵ If the data scientist wants to compare different algorithms, we recommend instead to fully explore the results of a well-chosen first algorithm, before challenging it with others, in order to be less dependent on the sole selection criterion of the ASW. This article, thus, takes the opposite view of the auto-ML literature by first advocating a full investigation of a parsimonious workflow made of well-chosen algorithms, rather than directly covering a wide range of algorithmic possibilities. On this topic, readers may be interested in recent areas of research around meta-clustering (Caruana et al., 2006 ) and ensemble clustering methods (Greene et al., 2004 ; Alqurashi and Wang, 2019 ). The first aims to produce several partitioning results so that the user can select those that are most useful. The second is intended to combine the clustering of several methods to propose a consensual result.

5.1.5. Discussion on the clusters' stability assessment step

The bootstrapping and noise methods were chosen in the workflow because they are both available in the same function clusterboot() from the same package as for pamk() , and for their complementarity as recommended by Hennig ( 2007 ). Nevertheless, other methods may also be used as sensitivity analyses, including those proposed in the same FPC package. Furthermore, although this step allows for the clusters to be assessed, data scientists should keep in mind that stability is not the only important validity criterion—clusters obtained by very inflexible clustering methods may be stable but also not valid, as discussed in Hennig ( 2008 ). Finally, although several choices were made to try to manage outliers as best as possible, such as using a K -medoid algorithm and the Manhattan distance, the Qluster workflow does not fully address the issues related to outliers and extreme values. One solution may be to define threshold values to manually detect extreme values as a pre-processing step (as in the case study in Section 4), or to use more sophisticated statistical methods such as Yang et al. ( 2021 ).

5.1.6. Discussion on the clusters' interpretation

Clusters' description is not covered in the Qluster workflow. However, many methods exist to interpret clusters (refer to Section 2.3). Data scientists can easily generalize Qluster to the description of clusters by using the functions already present in the FPC package in order not to make the workflow too complex, such as plotcluster() and cluster.varstats() following methodologies recommended by Hennig ( 2004 ).

5.1.7. Discussion on the types of data that are supported by the Qluster workflow

Although general, the Qluster workflow does not cover all types of data and it is clear that for medical imaging data, omics data, or data in the form of signals, dedicated approaches must be considered. Nevertheless, most tabular data can be processed using the Qluster workflow. In this respect, although the Qluster workflow was specifically designed in the context of healthcare data analysis, it can easily be applied in other fields.

5.2. Discussion and recommendation on the practical use of the workflow

5.2.1. Use of cluster stability as a criterion to be optimized

Cluster stability assessment could be considered as a criterion to be optimized, by iterating on this step in order to make this property an integral part of the clustering process itself. For example, stability measures could be used to select the optimal number of clusters, assuming that the clustering results are more stable with the correct number of clusters (Fränti and Rezaei, 2020 ).

However, attention should be paid to the fact that the bootstrap and noise methods are more computationally expensive than simple methods such as deleting the variables one by one (methods used on biological measurements and proposed in the clValid R package). Also, it may not be obvious to optimize the clustering on clusters' stability if the two proposed methods do not give similar results. For example, compared to the noise method, the bootstrap method is more likely to produce stable results as the size of the dataset increases in the case of PAM, and as the percentage of sample representativeness increases in the case of CLARA.

5.2.2. What if the results are not satisfying?

The question of the ultimate relevance of clusters is not addressed in this workflow. It should be noted that the absence of results may be a result in itself, as it may characterize a population that cannot be described in terms of several homogeneous subgroups (either because such subgroups do not exist or because the variables used do not allow us to find them). Nevertheless, it is clear that, as in the Data Mining process, we can consider looping back on this workflow by changing certain parameters if the results are not satisfactory or if an important criterion of the clustering was not taken into account at the beginning (e.g., the maximum number of clusters). More generally, the data scientist is encouraged to keep in mind that the final objective of clustering is often the clinical relevance and usefulness of the results generated. In this sense and as mentioned in Section 5.1, it is not forbidden to relax a purely statistical criterion, such as the ASW (whose optimum may miss some relevant subgroups as it is an indicator of the overall separability) to better represent the diversity of the population studied, or to favor the generation of hypotheses in the case where the statistical optimum only gives broad results not enough specific for the initial objective.

In the same vein, negative silhouette values are viewed too pejoratively in the cluster validity analysis (interpreted as clustering failure). In fact, negative silhouettes characterize patients who, on average, are closer to patients from another cluster than to patients from their own cluster. Therefore, patients with a negative Silhouette may be informative of potential relationships between clusters and should, therefore, be considered as potential additional information about disease history and phenotypic complexity, such as one cluster that is the natural evolution of another. Hence, it is recommended that an analysis of patients with negative Silhouettes be included in the workflow to better assess whether they are a reflection of “bad” clustering or the key to better understanding the disease.

5.2.3. What if the optimum number of clusters is the minimum of range K ?

In the case where the optimal number of clusters is the minimum of the range of K (as in our example in Section 4), we recommend (if appropriate) that data scientists test for lower values of K to challenge the obtained optimum. Similarly, if the optimum is obtained for K = 2, data scientists should test whether the dataset should be split into two clusters, using the Duda–Hart test that tests the null hypothesis of homogeneity in the whole dataset. This can be done using the same pamk() function by setting up the minimum of K to 1, or directly using the dudahart2() function (also in the FPC R package). In any case, if the primary objective is to provide fine-grained knowledge of the study population, it will still be possible to provide results with the optimal K that was initially obtained, keeping in mind that the levels of inter-cluster separability and intra-cluster homogeneity are not really higher than those that would be obtained with a smaller number of clusters.

5.2.4. Using this workflow routinely

The Qluster workflow can be easily automated for data scientists and organizations that need a routine way to cluster clinical data. Indeed, data scientists may create a main function for applying this workflow, including by setting the nature of data (categorical/continuous/mixed), the volume (normal/large), and parameters related to each called function. It is worth mentioning, however, that the quality of the input data and the structure of the groups to be found are factors that may not allow the present workflow to identify relevant results every time. In this case, the data scientist can refer to the indications given above or, if necessary, consider an approach more adapted to his data.

6. Conclusion

In this article, we propose Qluster , a practical workflow for data scientists because of its genericity of application (e.g., usable on small or big data, on continuous, categorical, or mixed variables, and database of high-dimensionality or not) while maintaining the simplicity of implementation and use (e.g., need for few packages and algorithms, few parameters to tune, …), and the robustness and reliability of the methodology (e.g., evaluation of the stability of clusters, use of proven algorithms and robust packages, and management of noisy or multicollinear data). It, therefore, does not rely on any innovative approach per se but rather on a careful selection and combination of state-of-the-art clustering methods for practical purposes and robustness.

Data clustering is a difficult task for many data scientists who are faced with a large literature and a large number of algorithms and implementations. We believe that Qluster can (1) improve the quality of analyses carried out as part of such studies (refer to Qluster 's criteria for robustness and reliability), promote and ease clustering studies (refer to Qluster 's criteria for genericity and simplicity of use), and increase the skills of some of the statisticians/data scientists involved (refer to the literature review provided and the general principles of Qluster ). This workflow can also be used by more experienced data scientists for initial explorations of the data before designing more in-depth analyses.

Finally, this workflow can be fully operationalized, using either scripted tools or a Data Science platform supporting the use of R packages. As an illustrative example, we made an implementation on the Dataiku platform of the Qluster workflow to process a Kaggle dataset (refer to Appendix B ). This implementation is usable on the free edition and is made available on request (email: moc.ecnarf-netniuq@tcatnoc ).

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset .

Author contributions

MR and PG contributed very significantly throughout this study. CE and J-DZ were the main contributors to both the writing and the methodology. All authors participated in writing the manuscript, contributed to the revision of the manuscript, and read and approved the submitted version.

Acknowledgments

The team would like to thank Dr. Martin Montmerle for initiating the scientific study that led to this research work. The team would also like to thank Dr. Martin Montmerle, Vincent Martenot, Valentin Masdeu, Pierre Tang, and Sam Ekhtiari for their careful review of the article. Finally, the team would like to thank Quinten for providing the opportunity to conduct this study.

Footnotes

https://www.kaggle.com/sulianova/cardiovascular-disease-dataset?select=cardio_train.csv

² For further details, see Celebi ( 2014 ).

https://scikit-learn.org/stable/ https://cran.r-project.org/web/packages/cluster/index.html https://cran.r-project.org/web/packages/fpc/index.html https://cran.r-project.org/web/packages/FactoMineR/index.html https://github.com/MaxHalford/prince https://cran.r-project.org/web/packages/missMDA/index.htmlpackages https://cran.r-project.org/web/packages/factoextra/index.html https://rdrr.io/github/QTCAT/qtcat/man/clarans.html https://github.com/daveti/pycluster https://pyclustering.github.io/ https://github.com/sandysa/Interpretable~Clustering https://cran.r-project.org/web/packages/clValid/index.html

¹⁵ A discussion on other possible criteria and ways to integrate them into the workflow is presented in Section 5.1 (e.g., management of missing data and outliers).

¹⁶ e.g., CRAN Task View for cluster analysis: https://cran.r-project.org/web/views/Cluster.html .

¹⁷ e.g. of handy R packages: https://towardsdatascience.com/a-comprehensive-list-of-handy-r-packages-e85dad294b3d .

¹⁸ The notion of large data, as well as how to fix the hyperparameters samples and sampsize in CLARA algorithm, may vary according to the computing capabilities of the user's system. One recommends users to pre-tests different scenarios to adapt these thresholds to their own settings. For guidance, this workflow applied on the case study in Section 4 (34,134 observations and nine variables) took 5 h and 30 min with 8 CPUs and 10 GB RAM.

¹⁹ It is worth noting that standardization of continuous data is recommended before using PCA, to not give excessive importance to the variables with the largest variances.

https://www.dataiku.com/

²¹ On Kaggle: www.kaggle.com/sulianova/cardiovascular-disease-dataset

https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/diagnosis-treatment/drc-20373417#:%CB%9C:text=Your%20blood%20pressure%20below%2080%20mm%20Hg

²³ Prise en charge des patients adultes atteints d'hypertension arterielle essentielle - Actualisation 2005. https://www.has-sante.fr/upload/docs/application/pdf/2011-09/hta_2005-recommandations.pdf .

²⁴ One may prefer the Greenacre adjustment to Benzecri correction, which tends to be less optimistic than the Benzecri correction. Please note that both methods are not currently implemented in the proposed R packages and must therefore be implemented by data scientists if desired. R code for Benzecri correction is provided in Appendix C .

²⁵ Similarly, the user may want to compare several methods for selecting the optimal number of clusters, including other direct methods (e.g., elbow method on the total within-cluster sum of square) or methods based on statistical testing (e.g., gap statistic).

Conflict of interest

CE, MR, and PG were employed by Quinten. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2022.1055294/full#supplementary-material

Appendix A

Description of algorithms in selected libraries on some of the criteria defining genericity, ease of implementation and use, and robustness.

Click here for additional data file. ^{(20K, xlsx)}

Appendix B

Example of an implementation of the Qluster workflow on the Dataiku platform.

Click here for additional data file. ^{(230K, TIF)}

Appendix C

R Code to implement Benzecri correction from MCA eigenvalues.

Click here for additional data file. ^{(13K, docx)}

Appendix D

Eigenvalues and variances explained with and without Benzecri correction.

Click here for additional data file. ^{(14K, docx)}

Appendix E

Example of the Qluster workflow adapted for handling missing values.

Click here for additional data file. ^{(197K, TIF)}

References

Achtert E., Goldhofer S., Kriegel H.-P., Schubert E., Zimek A. (2012). “Evaluation of clusterings – metrics and visual support,” 2012 IEEE 28th International Conference on Data Engineering (Arlington, VA), 1285–1288. 10.1109/ICDE.2012.128 [ CrossRef ] [ Google Scholar ]
Ahmad A., Khan S. S. (2019). Survey of state-of-the-art mixed data clustering algorithms . IEEE Access 7 , 31883–31902. 10.1109/ACCESS.2019.2903568 [ CrossRef ] [ Google Scholar ]
Aljalbout E., Golkov V., Siddiqui Y., Strobel M., Cremers D. (2018). Clustering 0with deep learning: taxonomy and new methods . arXiv:1801.07648. 10.48550/arXiv.1801.07648 [ CrossRef ] [ Google Scholar ]
Alqurashi T., Wang W. (2019). Clustering ensemble method . Int. J. Mach. Learn. Cyber . 10 , 1227–1246. 10.1007/s13042-017-0756-7 [ CrossRef ] [ Google Scholar ]
Altman N., Krzywinski K. (2017). Clustering . Nat. Methods 14 , 545–546. 10.1038/nmeth.4299 [ CrossRef ] [ Google Scholar ]
Arabie P., Hubert L. (1994). Cluster analysis in marketing research . Adv. Methods Market. Res. 160−189. [ Google Scholar ]
Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J. M., Perona I. (2013). An extensive comparative study of cluster validity indices . Pattern Recognit. 46 , 243–256. 10.1016/j.patcog.2012.07.021 [ CrossRef ] [ Google Scholar ]
Arthur D., Vassilvitskii S. (2007). “k-means++: the advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Philadelphia, PA: Society for Industrial and Applied Mathematics; ), 1027–1035. [ Google Scholar ]
Audigier V., Husson H., Josse J. (2013). A principal components method to impute missing values for mixed data . arXiv:1301.4797 . 10.1007/s11634-014-0195-1 [ CrossRef ] [ Google Scholar ]
Bandalos D. L., Boehm-Kaufman M. R. (2010). “Four common misconceptions in exploratory factor analysis,” in Statistical and Methodological Myths and Urban Legends (Routledge: ), 81–108. [ Google Scholar ]
Bertsimas D., Orfanoudaki A., Wiberg H. (2021). Interpretable clustering: an optimization approach . Mach. Learn. 110 , 89–138. 10.1007/s10994-020-05896-2 [ CrossRef ] [ Google Scholar ]
Bezdek J. C., Pal N. R. (1998). Some new indexes of cluster validity . IEEE Transact. Syst. Man Cybernet. 28 , 301–315. 10.1109/3477.678624 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Bezdek J. C., Robert E., William F. (1984). FCM: the fuzzy c-means clustering algorithm . Comput. Geosci. 10 , 191–203. 10.1016/0098-3004(84)90020-7 [ CrossRef ] [ Google Scholar ]
Bock H. H. (1987). “On the Interface between Cluster Analysis, Principal Component Analysis, and Multidimensional Scaling,” in Multivariate Statistical Modeling and Data Analysis: Proceedings of the Advanced Symposium on Multivariate Modeling and Data Analysis May 15–16, 1986 , eds H. Bozdogan, and A. K. Gupta (Dordrecht: Springer Netherlands; Theory and Decision Library), 17–34. [ Google Scholar ]
Bousquet P. J., Philippe D., Abir T., Kamal M., Pascal D., Jean B. (2015). Clinical relevance of cluster analysis in phenotyping allergic rhinitis in a real-life study . Int. Arch. Allergy Immunol. 166 , 231–240. 10.1159/000381339 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Bro R., Kjeldahl K., Smilde A. K., Kiers H. A. L. (2008). Cross-validation of component models: a critical look at current methods . Anal. Bioanal. Chem. 390 , 1241–1251. 10.1007/s00216-007-1790-1 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Brock G., Vasyl P., Susmita D., Somnath D. (2008). clValid: An R Package for Cluster Validation . Available online at: https://www.jstatsoft.org/article/view/v025i04 (accessed December 16, 2021).
Buuren S., Heiser W. J. (1989). Clusteringn objects intok groups under optimal scaling of variables . Psychometrika 54 , 699–706. 10.1007/BF02296404 [ CrossRef ] [ Google Scholar ]
Caliński T., Harabasz J. (1974). A dendrite method for cluster analysis . Commun. Stat. 3 , 1–27. 10.1080/03610927408827101 [ CrossRef ] [ Google Scholar ]
Caruana R., Elhawary M., Nguyen N., Smith C. (2006). “Meta clustering,” in Sixth International Conference on Data Mining (ICDM'06) (Hong Kong: ), 107–118. 10.1109/ICDM.2006.103 [ CrossRef ] [ Google Scholar ]
Cattell R. B. (1996). The scree test for the number of factors . Multivariate Behav. Res. 1 , 245–276. 10.1207/s15327906mbr0102_10 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Celebi M. E. (ed.). (2014). Partitional Clustering Algorithms . Springer. 10.1007/978-3-319-09259-1 [ CrossRef ] [ Google Scholar ]
Ciampi A., Yves L. (2000). “Clustering large, multi-level data sets: an approach based on kohonen self organizing maps,” in Principles of Data Mining and Knowledge Discovery, Lecture Notes in Computer Science , eds D. A. Zighed, J. Komorowski, and J. Zytkow (Berlin, Heidelberg: Springer), 353–358. [ Google Scholar ]
Clausen S. E. (1998). Applied Correspondence Analysis: An Introduction. Sage . Available online at: https://en.wikipedia.org/w/index.php?title=Cluster_analysis&oldid=1034255231 (accessed July 18, 2021).
Costa P. S., Santos N. C., Cunha P., Cotter J., Sousa N. (2013). The Use of Multiple Correspondence Analysis to Explore Associations between Categories of Qualitative Variables in Healthy Ageing . Available online at: https://www.hindawi.com/journals/jar/2013/302163/ (accessed December 16, 2021). [ PMC free article ] [ PubMed ]
Datta S., Somnath D. (2006). Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes . BMC Bioinformat. 7 , 397. 10.1186/1471-2105-7-397 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
De Soete G., Carroll J. D. (1994). “K-means clustering in a low-dimensional Euclidean space,” in New Approaches in Classification and Data Analysis , eds E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Berlin, Heidelberg: Springer; Studies in Classification, Data Analysis, and Knowledge Organization), 212–219. [ Google Scholar ]
DeSarbo W. S., Daniel J. H., Kamel J. (1991). Multiclus: a new method for simultaneously performing multidimensional scaling and cluster analysis . Psychometrika 56 , 121–136. 10.1007/BF02294590 [ CrossRef ] [ Google Scholar ]
Di Franco G. (2016). Multiple correspondence analysis: one only or several techniques? Qual. Quant. 50 , 1299–1315. 10.1007/s11135-015-0206-0 [ CrossRef ] [ Google Scholar ]
Do C. B., Serafim B. (2008). What is the expectation maximization algorithm? Nat. Biotechnol. 26 , 897–899. 10.1038/nbt1406 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Drennan R. D. (2010). Statistics for Archaeologists . New York, NY: Springer. 10.1007/978-1-4419-0413-3 [ CrossRef ] [ Google Scholar ]
Efron B. (1979). Bootstrap methods: another look at the jackknife . Ann. Stat. 7 , 1–26. 10.1214/aos/1176344552 [ CrossRef ] [ Google Scholar ]
Efron B., Tibshirani R. J. (1994). An Introduction to the Bootstrap . CRC Press, 456. [ Google Scholar ]
Esnault C., May-Line G., Maxence Q., Alexandre T., Jean-daniel Z. (2020). Q-Finder: an algorithm for credible subgroup discovery in clinical data analysis - an application to the international diabetes management practice study . Front. Artif. Intell. 3 , 559927. 10.3389/frai.2020.559927 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Ester M., Hans-Peter K., Jorg S., Xiaowei X. (1996). “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. KDD'96 (Portland, OR: AAAI Press; ), 226–231. [ Google Scholar ]
Estivill-Castro V. (2002). Why so many clustering algorithms: a position paper . ACM SIGKDD Explorat. Newslett. 4 , 65–75. 10.1145/568574.568575 [ CrossRef ] [ Google Scholar ]
Fahad A., Najlaa A., Zahir T., Abdullah A., Ibrahim K., Albert Y. Z., et al.. (2014). A survey of clustering algorithms for big data: taxonomy and empirical analysis . IEEE Transact. Emerg. Topi. Comput. 2 , 267–279. 10.1109/TETC.2014.2330519 [ CrossRef ] [ Google Scholar ]
Fahrmeir L., Thomas K., Stefan L., Brian M. (2013). “Categorical regression models,” in Regression: Models, Methods and Applications , eds L. Fahrmeir, T. Kneib, S. Lang, and B. Marx (Berlin, Heidelberg: Springer; ), 325–347. [ Google Scholar ]
Fisher D. H. (1987). Knowledge acquisition via incremental conceptual clustering . Mach. Learn. 2 , 139–172. 10.1007/BF00114265 [ CrossRef ] [ Google Scholar ]
Foss A. H., Marianthi M. (2018). Kamila: clustering mixed-type data in R and hadoop . J. Stat. Softw. 83 , 1–44. 10.18637/jss.v083.i13 [ CrossRef ] [ Google Scholar ]
Fränti P., Rezaei M. (2020). Can the Number of Clusters Be Determined by External Indices? Available online at: https://ieeexplore.ieee.org/document/9090190 (accessed May 12, 2022).
Fränti P., Sieranoja S., Wikström K., Laatikainen T. (2022). Clustering diagnoses from 58 million patient visits in finland between 2015 and 2018 . JMIR Med. Informat. 10 , e35422. 10.2196/35422 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Gordon A. D. (1999). Classification - 2nd Edn. Routledge Book . Available online at: https://www.routledge.com/Classification/Gordon/p/book/9780367399665 (accessed December 15, 2021).
Green P. E., Abba M. K. (1995). A comparison of alternative approaches to cluster-based market segmentation . Mark. Res. Soc. J. 37 , 1–19. 10.1177/147078539503700302 [ CrossRef ] [ Google Scholar ]
Greenacre M. (2007). Correspondence Analysis in Practice. 2nd Edn . New York, NY: Chapman and Hall/CRC. [ Google Scholar ]
Greenacre M., Blasius J. (eds.). (2006). Multiple Correspondence Analysis and Related Methods . New York, NY: Chapman and Hall/CRC. [ Google Scholar ]
Greenacre M. J. (1984). Theory and Applications of Correspondence Analysis . Academic Press. [ Google Scholar ]
Greene D., Tsymbal A., Bolshakova N., Cunningham P. (2004). “Ensemble clustering in medical diagnostics,” in Proceedings. 17th IEEE Symposium on Computer-Based Medical Systems (Bethesda, MD: ), 576–581. 10.1109/CBMS.2004.1311777 [ CrossRef ] [ Google Scholar ]
Guha S., Rajeev R., Kyuseok S. (1998). CURE: an efficient clustering algorithm for large databases . ACM SIGMOD Rec. 27 , 73–84. 10.1145/276305.276312 [ CrossRef ] [ Google Scholar ]
Guha S., Rajeev R., Kyuseok S. (2000). Rock: A Robust Clustering Algorithm for Categorical Attributes . Available online at: https://www.sciencedirect.com/science/article/abs/pii/S0306437900000223 (accessed December 15, 2000).
Halkidi M., Yannis B., Michalis V. (2001). On clustering validation techniques . J. Intell. Inf. Syst. 17 , 107–145. 10.1023/A:1012801612483 [ CrossRef ] [ Google Scholar ]
Handl J., Joshua K., Douglas B. K. (2005). Computational cluster validation in postgenomic data analysis . Bioinformatics 21 , 3201–3212. 10.1093/bioinformatics/bti517 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Hennig C. (2004). Asymmetric linear dimension reduction for classification . J. Comput. Graph. Statist. 13 , 930–945. 10.1198/106186004X12740 [ CrossRef ] [ Google Scholar ]
Hennig C. (2007). Cluster-Wise Assessment of Cluster Stability . Available online at: https://www.sciencedirect.com/science/article/abs/pii/S0167947306004622?via%3Dihub (accessed November 9, 2021).
Hennig C. (2008). Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods . J. Multivariat. Anal. 99 , 1154–1176. 10.1016/j.jmva.2007.07.002 [ CrossRef ] [ Google Scholar ]
Hennig C. (2014). “How many bee species? A case study in determining the number of clusters,” in Data Analysis, Machine Learning and Knowledge Discovery , eds M. Spiliopoulou, L. SchmidtThieme, and R. Janning (Cham: Springer International Publishing; Studies in Classification, Data Analysis, and Knowledge Organization; ), 41–49. [ Google Scholar ]
Hennig C. (2020). Cluster validation by measurement of clustering characteristics relevant to the user . arXiv:1703.09282. 10.1002/9781119597568.ch1 [ CrossRef ] [ Google Scholar ]
Hennig C., Liao T. F. (2013). How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification . J. R. Stat. Soc. Ser. C . 62 , 309–369. 10.1111/j.1467-9876.2012.01066.x [ CrossRef ] [ Google Scholar ]
Hinneburg A., Keim D. A. (1998). An Efficient Approach to Clustering in Large Multimedia Databases With Noise, Vol. 98. Bibliothek der Universität Konstanz, 58–65 . Available online at: https://www.aaai.org/Papers/KDD/1998/KDD98-009.pdf
Huang Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining . DMKD . 3 , 34–39. [ Google Scholar ]
Huang Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values . Data Min. Knowl. Discover. 2 , 283–304. 10.1023/A:1009769707641 [ CrossRef ] [ Google Scholar ]
Hwang H., Dillon W. R., Takane Y. (2006). An extension of multiple correspondence analysis for identifying heterogeneous subgroups of respondents . Psychometrika 71 , 161–171. 10.1007/s11336-004-1173-x [ CrossRef ] [ Google Scholar ]
Jain A. (2010). Data clustering: 50 years beyond K-means . Pattern Recogn. Lett . 31 , 651–666. 10.1016/j.patrec.2009.09.011 [ CrossRef ] [ Google Scholar ]
Jin X., Jiawei H. (2010). “K-medoids clustering,” in Encyclopedia of Machine Learning , eds C. Sammut, and G. I. Webb (Boston, MA: Springer; ), 564–565. [ Google Scholar ]
Josse J., Chavent M., Liquet B., Husson F. (2012). Handling missing values with regularized iterative multiple correspondence analysis . J. Classif. 29 , 91–116. 10.1007/s00357-012-9097-0 [ CrossRef ] [ Google Scholar ]
Josse J., Husson F. (2011). Multiple imputation in principal component analysis . Adv. Data Anal. Classif. 5 , 231–246. 10.1007/s11634-011-0086-7 [ CrossRef ] [ Google Scholar ]
Kamoshida R., Fuyuki I. (2020). “Automated clustering and knowledge acquisition support for beginners,” in Procedia Computer Science. Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24th International Conference KES2020 , 1596–1605. Available online at: https://www.sciencedirect.com/science/article/pii/S1877050920320846 (accessed September 9, 2021).
Kaufman L., Rousseeuw P. J. (1990). “Partitioning around medoids (Program PAM),” in Finding Groups in Data. Section: 2 eprint (John Wiley & Sons, Ltd.), 68–125. Available online at: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470316801.ch2 ; https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470316801.ch2 (accessed December 5, 2021).
Kaufman L., Rousseeuw P. J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis . John Wiley & Sons. [ Google Scholar ]
Kaushik M., Bhawana M. (2014). Comparative study of K-means and hierarchical clustering techniques . Int. J. Softw. Hardw. Res. Eng. 2 , 93–98. [ Google Scholar ]
Kiselev V. Y., Tallulah S. A., Martin H. (2019). Challenges in unsupervised clustering of single-cell RNA-seq data . Nat. Rev. Genet. 20 , 273–282. 10.1038/s41576-018-0088-9 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Kleinberg J. (2002). “An impossibility theorem for clustering,” in Proceedings of the 15th International Conference on Neural Information Processing Systems (NIPS'02) (Cambridge, MA: MIT Press; ), 463–470. [ Google Scholar ]
Lange T., Volker R., Mikio L. B., Joachim B. (2004). Stability-based validation of clustering solutions . Neural Comput. 16 , 1299–1323. 10.1162/089976604773717621 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Le Roux B., Rouanet H. (2010). Multiple Correspondence Analysis, Vol. 163 . Sage. 10.4135/9781412993906 [ CrossRef ] [ Google Scholar ]
Lee Y. K., Lee E. R., Park B. U. (2012). Principal component analysis in very high-dimensional spaces . Stat. Sin . 22 , 933–956. 10.5705/ss.2010.149 [ CrossRef ] [ Google Scholar ]
Li N., Latecki L. J. (2017). “Affinity learning for mixed data clustering,” in IJCAI , 2173–2179. Available online at: https://www.ijcai.org/Proceedings/2017/0302.pdf
Lorenzo-Seva U. (2011). Horn's parallel analysis for selecting the number of dimensions in correspondence analysis . Methodology 7 , 96–102. 10.1027/1614-2241/a000027 [ CrossRef ] [ Google Scholar ]
MacQueen J. (1967). “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (University of California Press), 281–298. Available online at: https://projecteuclid.org/ebooks/berkeley-symposium-on-mathematical-statistics-and-probability/Proceedings-of-the-Fifth-Berkeley-Symposium-on-Mathematical-Statistics-and/chapter/Some-methods-for-classification-and-analysis-of-multivariate-observations/bsmsp/1200512992 (accessed December 5, 2021).
McCallum A., Kamal N., Lyle H. U. (2000). “Efficient clustering of high-dimensional data sets with application to reference matching,” in KDD '00: Proceedings of the Sixth ACM SIGKDD International conFerence on Knowledge Discovery and Data Mining (Boston, MA: ACM Press; ), 169–178. [ Google Scholar ]
McCane B., Michael A. (2008). Distance functions for categorical and mixed variables . Pattern Recognit. Lett. 29 , 986–993. 10.1016/j.patrec.2008.01.021 [ CrossRef ] [ Google Scholar ]
McInnes L., Healy J., Melville J. (2018). Umap: uniform manifold approximation and projection for dimension reduction . arXiv 1802, 03426 . 10.21105/joss.00861 [ CrossRef ] [ Google Scholar ]
Meilă M. (2007). Comparing clusterings—an information based distance . J. Multivar. Anal. 98 , 873–895. 10.1016/j.jmva.2006.11.013 [ CrossRef ] [ Google Scholar ]
Milligan G. W., Martha C. C. (1985). An examination of procedures for determining the number of clusters in a data set . Psychometrika 50 , 159–179. 10.1007/BF02294245 [ CrossRef ] [ Google Scholar ]
Mitsuhiro M., Hiroshi Y. (2015). Reduced k-means clustering with MCA in a lowdimensional space . Comput. Stat. 30 , 463–475. 10.1007/s00180-014-0544-8 [ CrossRef ] [ Google Scholar ]
Mittal M., Lalit M. G., Duraisamy J. H., Jasleen K. S. (2019). Clustering approaches for high-dimensional databases: a review . Wiley Interdiscipl. Revi. Data Mining Knowl. Discov. 9 , e1300. 10.1002/widm.1300 [ CrossRef ] [ Google Scholar ]
Murtagh F. (2005). Correspondence Analysis and Data Coding With Java and R . Chapman and Hall/CRC. 10.1201/9781420034943 [ CrossRef ] [ Google Scholar ]
Nagpal A., Jatain A., Gaur D. (2013). “Review based on data clustering algorithms,” in 2013 IEEE Conference on Information & Communication Technologies (IEEE: ), 298–303. 10.1109/CICT.2013.6558109 [ CrossRef ] [ Google Scholar ]
Ng R., Jiawei H. (2002). CLARANS: a method for clustering objects for spatial data mining . IEEE Trans. Knowl. Data Eng. 14 , 1003–1016. 10.1109/TKDE.2002.1033770 [ CrossRef ] [ Google Scholar ]
Ng R. T., Han J. (1994). “Efficient and effective clustering methods for spatial data mining,” in Proceedings of VLDB , 144–155. [ Google Scholar ]
Nietto P. R., Maria C. N. (2017). “Estimating the number of clusters as a preprocessing step to unsupervised learning,” in Intelligent Systems Design and Applications. Advances in Intelligent Systems and Computing , eds A. M. Madureira, A. Abraham, D. Gamboa, and P. Novais (Cham: Springer International Publishing; ), 25–34. [ Google Scholar ]
Nishisato S. (2019). Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press . Available online at: https://www.degruyter.com/document/doi/10.3138/9781487577995/html (accessed December 5, 2021).
Obembe O., Oyelade J. (2019). Data Clustering: Algorithms and Its Applications. IEEE Xplore . Available online at: https://ieeexplore.ieee.org/document/8853526 (accessed March 16, 2021).
Ortega F. B., Carl J. L., Steven N. B. (2016). Obesity and cardiovascular disease . Circ. Res. 118 , 1752–1770. 10.1161/CIRCRESAHA.115.306883 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Oyelade J., Isewon I., Oladipupo O., Emebo O., Omogbadegun Z., Aromolaran O., et al.. (2019). “Data clustering: Algorithms and its applications” in 2019 19th International Conference on Computational Science and Its Applications (ICCSA) (St. Petersburg: ), 71–81. 10.1109/ICCSA.2019.000-1 [ CrossRef ] [ Google Scholar ]
Pagès J. (2004). Analyse Factorielle de Donnees Mixtes . Available online at: http://www.numdam.org/article/RSA_2004__52_4_93_0.pdf (accessed March 19, 2021).
Pagès J., Husson F. (2017). Exploratory Multivariate Analysis by Example Using R 2nd Edition - F . Available online at: https://www.routledge.com/Exploratory-Multivariate-Analysis-by-Example-Using-R/Husson-Le-Pages/p/book/9780367658021 (accessed December 16, 2021).
Rezaei M., Pasi F. (2016). Set matching measures for external cluster validity . IEEE Trans. Knowl. Data Eng. 28 , 2173–2186. 10.1109/TKDE.2016.2551240 [ CrossRef ] [ Google Scholar ]
Rousseeuw P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis . J. Comput. Appl. Math. 20 , 53–65. 10.1016/0377-0427(87)90125-7 [ CrossRef ] [ Google Scholar ]
Saint Pierre A., Giemza J., Alves I., Karakachoff M., Gaudin M., Amouyel P. (2020). The genetic history of France . Eur. J. Hum. Genet. 28 , 853–865. 10.1038/s41431-020-0584-1 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Saisubramanian S., Sainyam G., Shlomo Z. (2020). Balancing the tradeoff between clustering value and interpretability . arXiv:1912.07820 . 10.1145/3375627.3375843 [ CrossRef ] [ Google Scholar ]
Sculley D. (2010). “Web-scale k-means clustering,” in Proceedings of the 19th International Conference on World Wide Web. WWW '10 (New York, NY: Association for Computing Machinery; ), 1177–1178. [ Google Scholar ]
Sheikholeslami G., Chatterjee S., Zhang A. (1998). WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases — Semantic Scholar . Available online at: https://www.semanticscholar.org/paper/WaveCluster%5C%3A-A-Multi-Resolution-Clustering-Approach-Sheikholeslami-Chatterjee/f0015f0e834a84699a9b83c6c9af33acdac05069 (accessed December 16, 2021).
Shirkhorshidi A. S., Aghabozorgi S., Wah T. Y., Herawan T. (2014). “Big data clustering: A review,” in Computational Science and Its Applications - ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, Vol. 8583 . Cham: Springer. 10.1007/978-3-319-09156-3_49 [ CrossRef ] [ Google Scholar ]
Sieranoja S., Pasi F. (2019). Fast and general density peaks clustering . Pattern Recognit. Lett. 128 , 551–558. 10.1016/j.patrec.2019.10.019 [ CrossRef ] [ Google Scholar ]
Sieranoja S., Pasi F. (2022). Adapting k-means for graph clustering . Knowl. Inf. Syst. 64 , 115–142. 10.1007/s10115-021-01623-y [ CrossRef ] [ Google Scholar ]
Takane Y., Hwang H. (2002). Generalized constrained canonical correlation analysis . Multivar. Behav. Res . 37 , 163–195. 10.1207/S15327906MBR3702_01 [ CrossRef ] [ Google Scholar ]
Testa D., Jourde-Chiche N., Mancini J., Varriale P., Radoszycki L., Chiche L. (2021). Unsupervised clustering analysis of data from an online community to identify lupus patient profiles with regards to treatment preferences . Lupus 30 , 1834–1837. 10.1177/09612033211033977 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Tibshirani R., Guenther W., Trevor H. (2001). Estimating the number of clusters in a data set via the gap statistic . J. R. Stat. Soc. Ser. B 63 , 411–423. 10.1111/1467-9868.00293 [ CrossRef ] [ Google Scholar ]
Torgerson W. S. (1952). Multidimensional scaling: I. Theory and method . Psychometrika 17 , 401–419 10.1007/BF02288916 [ CrossRef ] [ Google Scholar ]
van de Velden M., D'Enza A. I., Palumbo F. C. (2017). Cluster correspondence analysis . Psychometrika. 82 , 158–185. 10.1007/s11336-016-9514-0 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Van der Maaten L., Hinton G. (2008). Visualizing data using t-SNE . J. Mach. Learn. Res. 9 , 2579–2605. [ Google Scholar ]
Vellido A. (2020). The importance of interpretability and visualization in machine learning for applications in medicine and health care . Neural Comput. Appl. 32 , 18069–18083. 10.1007/s00521-019-04051-w [ CrossRef ] [ Google Scholar ]
Wang W., Yang J., Muntz R. (1997). “STING: A statistical information grid approach to spatial data mining,” in Vldb, Vol. 97 , 186–195. Available online at: http://fmdb.cs.ucla.edu/Treports/970006.pdf
Warwick K. M., Lebart L., Morineau A. (1989). Multivariate descriptive statistical analysis (correspondence analysis and related techniques for large matrices) . Appl. Stochast. Models Data Anal. 5 , 175–175. 10.1002/asm.3150050207 [ CrossRef ] [ Google Scholar ]
Williams B., Mancia G., Spiering W., Rosei E. A., Azizi M., Burnier M., et al.. (2018). ESC/ESH guidelines for the management of arterial hypertension: The task force for the management of arterial hypertension of the European Society of Cardiology and the European Society of Hypertension . J. Hypertens . 36 , 1953–2041. 10.1097/HJH.0000000000001940 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Windgassen S., Rona M-M., Kimberley G., Trudie C. (2018). The importance of cluster analysis for enhancing clinical practice: an example from irritable bowel syndrome . J. Mental Health 27 , 94–96. 10.1080/09638237.2018.1437615 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Wiwie C., Jan B., Richard R. (2015). Comparing the performance of biomedical clustering methods . Nat. Methods 12 , 1033–1038. 10.1038/nmeth.3583 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Xu R., Donald C. W. (2010). Clustering algorithms in biomedical research: a review . IEEE Rev. Biomed. Eng. 3 , 120–154. 10.1109/RBME.2010.2083647 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Yang J., Susanto R., Pasi F. (2021). Mean-shift outlier detection and filtering . Pattern Recognit. 115 , 107874. 10.1016/j.patcog.2021.107874 [ CrossRef ] [ Google Scholar ]
Zhang T., Raghu R., Miron L. (1996). BIRCH: an efficient data clustering method for very large databases . ACM SIGMOD Rec. 25 , 103–114. 10.1145/235968.233324 [ CrossRef ] [ Google Scholar ]
Zhao Y., Karypis G. (2002). Comparison of Agglomerative and Partitional Document Clustering Algorithms. Section: Technical Reports. Minnesota Univ Minneapolis Dept of Computer Science . Available online at: https://apps.dtic.mil/sti/citations/ADA439503 (accessed September 9, 2021).
Zhou F. L., Hirotaka W., Yuki T., Mathilde B., Dian K., Cyril E., et al.. (2019). Identification of subgroups of patients with type 2 diabetes with differences in renal function preservation, comparing patients receiving sodium-glucose co-transporter-2 inhibitors with those receiving dipeptidyl peptidase-4 inhibitors, using a supervised machine-learning algorithm (PROFILE study): a retrospective analysis of a Japanese commercial medical database . Diabetes Obes. Metab. 21 , 1925–1934. 10.1111/dom.13753 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Zwick W. R., Wayne F. V. (1986). Comparison of five rules for determining the number of components to retain . Psychol. Bull. 99 , 432–442. 10.1037/0033-2909.99.3.432 [ CrossRef ] [ Google Scholar ]

Articles from Frontiers in Artificial Intelligence are provided here courtesy of Frontiers Media SA