Discover a faster, simpler path to publishing in a high-quality journal. PLOS ONE promises fair, rigorous peer review, broad scope, and wide readership – a perfect fit for your research every time.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Affiliation Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, United States of America

Affiliations Department of Emergency Medicine, Massachusetts General Hospital, Boston, Massachusetts, United States of America, Harvard Medical School, Boston, Massachusetts, United States of America

Affiliations Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, United States of America, Harvard Medical School, Boston, Massachusetts, United States of America, Mass General Brigham, Boston, Massachusetts, United States of America

Affiliation Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, United States of America

Abstract

Large collaborative research networks provide opportunities to jointly analyze multicenter electronic health record (EHR) data, which can improve the sample size, diversity of the study population, and generalizability of the results. However, there are challenges to analyzing multicenter EHR data including privacy protection, large-scale computation resource requirements, heterogeneity across sites, and correlated observations. In this paper, we propose a federated algorithm for generalized linear mixed models (Fed-GLMM), which can flexibly model multicenter longitudinal or correlated data while accounting for site-level heterogeneity. Fed-GLMM can be applied to both federated and centralized research networks to enable privacy-preserving data integration and improve computational efficiency. By communicating a limited amount of summary statistics, Fed-GLMM can achieve nearly identical results as the gold-standard method where the GLMM is directly fitted to the pooled dataset. We demonstrate the performance of Fed-GLMM in numerical experiments and an application to longitudinal EHR data from multiple healthcare facilities.

Fig 4. Adjusted odds ratios of virtual vs. in-person visit by patient and visit characteristics.

Using the forest plot, we visualized the adjusted odds ratios obtained through Fed-GMM for both all facilities (federated setting to demonstrate privacy preservation) and single facility (centralized setting to demonstrate computation improvement). The points and bars represent the point estimates and 95% confidence intervals, respectively. Abbreviations: OR—Odds Ratio; NH—Non-Hispanic; LEP—Limited English Proficiency.

https://doi.org/10.1371/journal.pone.0280192.g004

To address the computational concern, analyses in both federated and centralized settings involve splitting data at facilities with a large number of records. We evaluated the performance of the clustering-based splitting method designed to address the issue of crossed patient- and physician-level random effects as described in Methods. As demonstrated in S2 Fig , for all coefficients of interest, clustering-based splitting resulted in negligible bias compared with the random splitting strategy. This also demonstrated the good performance of Fed-GLMM in the real-world data scenario given that an appropriate splitting strategy is applied.

In both federated and centralized settings, the characteristics associated with lower odds of conducting a virtual visit (i.e., higher odds of conducting an in-person visit) are increasing age, Hispanic, non-Hispanic Black, non-Hispanic Asian or other non-Hispanic relative to non-Hispanic White race/ethnicity, limited English proficiency, and inactive patient portal. Compared with primary care, behavioral health and specialty visits were more likely to be conducted virtually. Visits of female patients, as well as visits billed to Medicaid were also more likely to be conducted virtually. Except for the behavior health specialty, the parameter estimates for the federated and centralized settings are very close to each other, demonstrating the potential dominant effects of the site with the largest proportion of data on the results based on all sites.

Discussion

In light of the increasing need for multicenter collaborative research utilizing EHR data and the potential challenges in data sharing and large-scale computation, we proposed the Fed-GLMM algorithm to model correlated EHR data that allows privacy-preserving integration of datasets from multiple healthcare systems. Our method enables fitting GLMM with much less computation time and memory cost in both federated and centralized networks, and thus can also be applied to EHR from a single site. Our simulation study has demonstrated that Fed-GLMM achieves nearly identical results to the pooled analysis with reduced computation time over a broad spectrum of settings. Our real-world data analysis demonstrated the feasibility of applying Fed-GLMM to single-site and multicenter EHR-based studies to fit a model with millions of observations.

In contrast to a meta-analysis, our method is not based on constructing a weighted average of local estimators obtained from each site. When studying rare events, the local estimators would be biased due to the limited number of cases, which can lead to biases in the meta-analysis. Our method is more robust to such biases as we directly aggregate the first- and second- order derivatives, which are not sensitive to the rareness of the event. The contribution of each study in Fed-GLMM is captured by the shared derivatives, which cannot be summarized to a single weight as in a univariate meta-analysis. Nevertheless, compared to existing work of federated and distributed algorithms, the most important contribution of Fed-GLMM is that it allows the modeling of longitudinal and correlated data within each institution and can accommodate all GLMM specifications, including crossed or nested random effects. However, when performing Fed-GLMM to improve computational efficiency through splitting large-scale data in a centralized setting, one needs to be mindful of the data-splitting strategy. For nested random effects, splitting the data by the highest-level factors will allow Fed-GLMM estimates to converge to the gold-standard pooled analysis results. For crossed random effects, we recommend splitting the data such that the correlated observations are allocated to the same subsets as much as possible as shown in S2 Fig . This makes the Fed-GLMM estimates close (though not identical) to the pooled analysis estimates.

As demonstrated in our simulation and real-world data analyses, iterative communication among the central analytics and individual sites is not required. In most cases, only one round of parameter updating provides negligible bias. Thus, in federated research networks that rely on manual data transferring, our method with one round of iteration is preferred to reduce the communication cost. However, when multiple rounds of communication are feasible, with an increasing number of iterations, our method will eventually converge to the pooled analysis. When studying rare conditions, extra iterations help correct the bias, so a balance needs to be reached between the communication cost and estimation accuracy. In addition, the sharing of first- and second-order derivatives is common among federated algorithms but may still entail a risk of identifiability for small datasets with rare events. Nevertheless, this risk is limited in that the transmission of summary-level statistics is typically regulated and protected by the data-sharing protocols of collaborative research networks. Methods such as differential privacy and data encryption techniques can be combined with Fed-GLMM to improve privacy protection. While we have demonstrated Fed-GLMM for analyzing EHR data to assess virtual care utilization, our method can be used to address correlated structures in many types of real-world datasets. Future steps include paring Fed-GLMM with high-dimensional data analysis methods to model genetic datasets, as well as the combination of Fed-GLMM and vertical data integration techniques to integrate granular clinical and health service information from longitudinal administrative claims and survey databases [ 35 ].

Supporting information

S2 Fig. Comparison of Fed-GLMM accuracy relative to pooled analysis between different data splitting strategies.

Using randomly extracted small sub-datasets (n = 100,000) from the EHR of a single facility, we compare the accuracy of Fed-GLMM with different data splitting strategies. Two splitting strategies were attempted and compared: random splitting and our proposed clustering-based splitting introduced in S 1 Table . A sub-dataset was split into 5 subsets by both strategies. The absolute relative bias was calculated as the difference between the corresponding Fed-GLMM estimates and those given by the pooled analysis in absolute percentage. A total of 50 randomly extracted sub-datasets were used in the evaluation. For all coefficients of interests, clustering-based splitting resulted in negligible bias compared with the random splitting strategy. Abbreviations: NH—Non-Hispanic; LEP—Limited English Proficiency.

https://doi.org/10.1371/journal.pone.0280192.s002

(JPG)

S4 Table. Local variance parameter estimates at each facility.

We demonstrated variance-component parameter estimates obtained separately for each site to evaluate our assumption of equal variance-component parameters across sites. For sites with a relatively small number of records, the estimates were obtained by fitting the GLMM locally. For sites with a large number of records, the estimates were obtained by a meta-analysis aggregating the GLMM estimates from split data batches. The variance- component parameter estimates for both patient- and physician-level random effects are similar across the sites, supporting our model specification with homogenous variance-component parameters.

https://doi.org/10.1371/journal.pone.0280192.s006

(DOCX)

References

  1. 1. Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: Data quality issues and informatics opportunities. Summit on Translational Bioinformatics 2010;2010:1. pmid:21347133