健壮的墨镜 · Owned Entity Types - ...· 4 月前 · |
调皮的石榴 · Java SQLException: ...· 1 年前 · |
睿智的核桃 · Matlab中实现在Axes中直接操纵数据点 ...· 1 年前 · |
逼格高的沙发 · 无法将设计更改保存到 Access 对象 ...· 1 年前 · |
温柔的鞭炮
1 年前 |
Numbers computed from N=1,000,000. π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. β, the regression coefficient. ϕ, the factor correlation. λ, the factor loading. The population FMIs of the four factor loadings corresponding to variables containing missingness are all within 0.01 of each other, and therefore only the FMIs of the λ for X 2 is reported .
In Model 2 (two-factor model), there were 4 factor loadings associated with variables that contained missing data, namely the factor loadings of X 2 , X 4 on F 1 , and the factor loadings of Y 2 , Y 4 on F 2 . As these loadings were of the same size, and the missing rate was the same for all four variables, only the FMI on the loading of X 2 is reported here. Under MCAR, the FMIs of the factor loading (λ) were δ pop , λ =0.20, 0.40, and 0.60 for per-variable missing rates of π mis =0.2, 0.4, and 0.6, respectively. Under MAR-L, the FMIs were δ pop , λ =0.37, 0.60, 0.71. Under MAR-NL, they were δ pop , λ =0.44, 0.71, 0.83. Once again, we observed that in each condition, the FMI increased as the missing rates increased, and the MAR-NL condition lead to the highest FMIs, followed by MAR-L, and then MCAR.
The pattern of differences between the missing data mechanisms, however, did not hold for the FMIs of the factor correlation (ϕ) in Model 2. The FMIs under per variable missing rates of π mis =0.2, 0.4, and 0.6, respectively, were δ pop , ϕ =0.13, 0.26, 0.38 in MCAR, δ pop , ϕ =0.14, 0.27, 0.38 in MAR-L, and δ pop , ϕ =0.14, 0.28, and 0.39 in MAR-NL. The FMIs of the factor correlation, unlike the FMIs of other parameters, did not differ notably across the missing mechanisms. Even within the same model, parameters affected by the same missing rates and the same missing data mechanism could see drastically different amounts of information loss. For instance, under π mis = 0.2 and MAR-NL, the FMI on ϕ was δ pop , ϕ = 0.14, but the FMI on λ in the same model was δ pop , ϕ = 0.44. These results emphasize the point that information loss is not predictable from the missing rates and whether the missing data mechanism is MCAR or MAR. Information loss can only be quantified by computing the FMI for a particular mechanism and parameter of interest.
For the regression coefficient β, the three sample FMI estimates, , , and were identical, due to the regression model being saturated. We refer to the resulting estimate simply as in Table 2 . As we can see in the tables, the sample FMI estimates were largely unbiased, with the long run means of over the 1,000 replications falling within 0.05 of the population value in practically all conditions. Notable bias (i.e., >0.05) only arose when the missing data mechanism was NL-MAR, the sample size was small, and the missing rate was high, where the sample FMI could underestimate the population value by as much as 0.07 (near the top of Table 2 under the MAR-NL columns).
π mis MAR-L MAR-NL | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
0.2 | 50 | 0.01 | −0.01 | −0.02 | ||||||
0.4 | 50 | 0.00 | −0.02 | −0.05 | ||||||
0.6 | 50 | 0.00 | −0.02 | −0.07 | ||||||
0.2 | 100 | 0.00 | 0.05 | 0.19 | 0.00 | −0.01 | ||||
0.4 | 100 | 0.00 | −0.01 | −0.03 | ||||||
0.6 | 100 | 0.01 | −0.01 | −0.06 | ||||||
0.2 | 200 | 0.00 | 0.03 | 0.13 | 0.00 | |||||
0.4 | 200 | 0.00 | 0.05 | 0.19 | −0.01 | −0.02 | ||||
0.6 | 200 | 0.00 | −0.03 | |||||||
0.2 | 500 | 0.00 | 0.02 | 0.09 | 0.00 | 0.04 | 0.16 | 0.00 | 0.05 | 0.19 |
0.4 | 500 | 0.00 | 0.03 | 0.12 | 0.00 | −0.01 | ||||
0.6 | 500 | 0.00 | 0.04 | 0.14 | 0.00 | −0.01 |
π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. ETI denotes the 95% ETI width, which is the distance between the 2.5 and 97.5 percentile of the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >0.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .
The sample FMI estimates were not particularly efficient at the smaller sample sizes typical of regression analyses, and would produce highly variable results from run to run. At N = 50 and 100, the RMSEs were around 0.10 or higher across nearly all missing rates and missing data mechanisms. At N = 100, the estimate was the most precise estimate under MCAR with π mis = 0.2, with RMSE of 0.05 and 95% ETI width of 0.19. For larger sample sizes N = 200 and 500, the performance of was overall acceptable under MCAR, but not for MAR-L and MAR-NL. Under both MAR conditions, good performance was only achieved when N = 500 and π mis = 0.2.
For a more intuitive illustration of bias and variability, the smoothed densities of the 1,000 replications in each condition are shown in Figures 1 – 3 . We can see that almost all sampling distributions centered around the population value, with the exception of the top right panels of Figure 3 , which correspond to the small sample size, high missing rate, nonlinear MAR conditions. These distributions, however, do not pack very tightly around the population value, except when the sample size is 500, and only when either the missing rate was low, or the missing mechanism was MCAR.
In the two-factor model, not all replications were usable. In some cases, the model would fail to converge. In other cases, when FMIs were requested, lavaan would sometimes produce an error, leading to NAs or negative values for the FMI estimates. These issues were exacerbated by higher missing rates, and were particularly pronounced in when the sample size was small. Under N = 100 and π mis =0.6 (overall missing rate 0.3), would encounter more issues in more than 30% of the runs. In this regard, the best performing estimate was , which almost never produced negative values, and had the lowest rate of NA occurrences. However, even for , the rate of failed or improper estimates was often close to or above 10% at N = 100. At N = 200, would encounter this issue at most 3% of the time, even when the per variable missing rate was 0.6. See Supplementary Tables 3 , 4 for the rate of failed or improper estimates for the factor correlations and factor loadings, respectively. The following results were obtained by excluding all occurrences of NAs and negative values when aggregating the sample FMI estimates.
The sample FMIs of factor correlation (ϕ) were largely unbiased in MCAR, or when the mechanism was MAR but the sample size was 200 or above (see Tables 3 – 5 ). Bias was also generally small at N = 100 when the missing data mechanism was MCAR, but notable bias could occur in MAR-L and MAR-NL. Among the three estimates, showed the largest amount of bias overall, severely overestimating the FMI (0.17 above the true value on average) when the missing data mechanism was nonlinear MAR and the per variable missing rate was π mis =0.6. also showed some notable bias under N = 100, but only up to 0.11 in the worst case. showed the least amount of bias, producing raw bias values very close to 0.05 or below across all the conditions.
π mis 95% ETI width | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
0.2 | 100 | 0.01 | 0.00 | 0.01 | ||||||
0.4 | 100 | 0.02 | 0.00 | 0.03 | ||||||
0.6 | 100 | −0.02 | 0.05 | |||||||
0.2 | 200 | 0.00 | 0.00 | 0.00 | 0.04 | 0.04 | 0.04 | 0.17 | 0.16 | 0.15 |
0.4 | 200 | −0.01 | 0.00 | 0.01 | ||||||
0.6 | 200 | −0.02 | −0.01 | 0.02 | ||||||
0.2 | 500 | 0.00 | 0.00 | 0.00 | 0.03 | 0.03 | 0.03 | 0.10 | 0.10 | 0.10 |
0.4 | 500 | 0.00 | 0.00 | 0.00 | 0.05 | 0.05 | 0.05 | 0.20 | 0.19 | 0.18 |
0.6 | 500 | −0.01 | −0.01 | 0.01 | ||||||
0.2 | 1000 | 0.00 | 0.00 | 0.00 | 0.02 | 0.02 | 0.02 | 0.07 | 0.07 | 0.07 |
0.4 | 1000 | 0.00 | 0.00 | 0.00 | 0.03 | 0.03 | 0.03 | 0.13 | 0.13 | 0.13 |
0.6 | 1000 | −0.01 | 0.00 | 0.01 | 0.19 | 0.19 | 0.20 |
π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. Distribution width is the distance between the 2.5 and 97.5 percentile in the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >0.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .
π mis 95% ETI width | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
0.2 | 100 | 0.04 | 0.02 | 0.02 | ||||||
0.4 | 100 | |||||||||
0.6 | 100 | |||||||||
0.2 | 200 | 0.01 | 0.00 | 0.00 | 0.05 | 0.04 | 0.19 | 0.18 | ||
0.4 | 200 | 0.02 | 0.01 | 0.03 | ||||||
0.6 | 200 | 0.04 | 0.02 | |||||||
0.2 | 500 | 0.00 | 0.00 | 0.00 | 0.03 | 0.03 | 0.03 | 0.12 | 0.12 | 0.11 |
0.4 | 500 | 0.01 | 0.01 | 0.01 | ||||||
0.6 | 500 | 0.00 | 0.00 | 0.02 | ||||||
0.2 | 1000 | 0.00 | 0.00 | 0.00 | 0.02 | 0.02 | 0.02 | 0.08 | 0.08 | 0.07 |
0.4 | 1000 | 0.00 | 0.00 | 0.01 | 0.05 | 0.05 | 0.04 | 0.18 | 0.18 | 0.15 |
0.6 | 1000 | 0.00 | 0.00 | 0.01 |
π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. Distribution width is the distance between the 2.5 and 97.5 percentile in the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >0.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .
π mis 95% ETI width | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
0.2 | 100 | 0.03 | 0.01 | 0.02 | ||||||
0.4 | 100 | 0.02 | 0.04 | |||||||
0.6 | 100 | |||||||||
0.2 | 200 | 0.01 | 0.00 | 0.00 | 0.05 | 0.05 | 0.19 | 0.18 | ||
0.4 | 200 | 0.01 | 0.01 | 0.01 | ||||||
0.6 | 200 | −0.01 | 0.00 | 0.04 | ||||||
0.2 | 500 | 0.00 | 0.00 | 0.00 | 0.03 | 0.03 | 0.03 | 0.12 | 0.11 | 0.11 |
0.4 | 500 | 0.00 | 0.00 | 0.00 | 0.20 | |||||
0.6 | 500 | 0.00 | 0.00 | 0.01 | ||||||
0.2 | 1000 | 0.00 | 0.00 | 0.00 | 0.02 | 0.02 | 0.02 | 0.08 | 0.08 | 0.07 |
0.4 | 1000 | 0.00 | 0.00 | 0.00 | 0.04 | 0.04 | 0.04 | 0.14 | 0.14 | 0.14 |
0.6 | 1000 | 0.00 | 0.00 | 0.01 | 0.20 | 0.20 |
π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. Distribution width is the distance between the 2.5 and 97.5 percentile in the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .
Although the population FMIs of the factor correlations were quite low, they proved difficult to estimate precisely. As illustrated in Figures 4 – 6 , the sampling distributions of the FMI estimates were very wide, and showed considerable bias at small sample sizes. Although the bias was absent at larger sample sizes, the estimates would only fall closely around the population value when the missing rate was low. As seen in Tables 3 – 5 , when the per variable missing rate was π mis =0.4, a sample size of 500 is required for to provide a precise estimate, with and giving worse performances. Under the high missing rate of π mis = 0.6, a sample size of 1,000 was required. Overall, gave the best performance, showing the lowest RMSE and 95% ETI width.
Compared to the factor correlation FMIs, the factor loading FMIs performed much better in Model 2. Similar to the population simulation, here we report on the FMIs of the X 2 loading on F 1. The results for the other loadings X 4 , Y 2 , and Y 4 are largely identical. Overall, as illustrated in Figures 7 – 9 , the sample FMIs of factor loadings showed little bias and would typically fall much closer to the population values. As shown in Supplementary Tables 5 – 7 , the bias, RMSE, and 95% ETI width were satisfactory for all three estimates when the sample size was N = 500 or greater. At N = 200, the FMIs only performed well when the missing mechanism was MCAR and the per variable missing rate was 0.2. At N = 100, all three estimates performed poorly, but was closer to acceptable performance, producing 95% ETI widths of 0.30 when would produce widths close to 0.50, and would produce widths close to 0.70. Once again, showed the best performance overall, with the lowest RMSE and 95% ETI width, and a similar bias to .
Here we provide an example on how to obtain FMIs from the lavaan package (version 0.6-7 and up) in R , using the Holzinger and Swineford ( 1939 ) dataset and simulated MCAR missing data. The example is adapted from Savalei and Rosseel (ress). The dataset, available through the lavaan package, contains cognitive performance test scores from 301 school children. The data for this example can be loaded into the R workspace using the following code.
For the purpose of this demonstration, we will conduct a confirmatory factor analysis with three correlated factors: visual skills, verbal skills, and mental speed. Each factor is measured by three tests, with variable names x1 to x9 in the datasets, following a model given by the lavaan model syntax below.
As the dataset does not contain missing data, we will introduce MCAR missingness by randomly removing 61 values from each variable independently, resulting in an overall missing rate of 20%.
With the example data and model syntax prepared, we fit the data to the CFA model and obtain the FMIs with the following code.
The option std.lv = TRUE fixes all variances of the latent factors to 1, allowing all loadings to be freely estimated, while the missing = ~ml~ asks lavaan to handle the missing data using FIML. The function parameterEstimates extracts the results from the model fit, where the option fmi = TRUE requests FMI estimates from lavaan alongside the parameter estimates. The remove.nonfree = TRUE option omits parameters that are not freely estimated from the output—In this case, the latent factor variances are not printed in the output table, as they were fixed to 1. By default, lavaan uses Hessian numeric estimation of the observed information matrix, which yields for the FMI estimates. To produce , the observed.information input must be specified to request the analytic approximation of the information.
By default, the analytic approximation is based on structured information, to obtain , the h1.information input must be provided to request the unstructured information.
The three FMI estimates provide largely similar values (see Table 6 ). They agree on which loading estimates have the highest FMIs, namely the loadings of X 1 ( , , ), X 2 ( , , ), X 3 ( , , ), and X 7 ( , , ). The three estimates also agree on which factor correlation has the highest FMI, namely the correlation between visual skill and mental speed, , , . The FMI of the factor correlation between visual skill and mental speed also shows the largest difference among the FMI estimates, between 0.23 and 0.29. The three estimates disagree on which factor correlation has the lowest FMI, but the differences are small: For the correlation between visual and verbal skills, , , ; for the correlation between verbal skill and mental speed, , , .
Parameter Variables | ||||
---|---|---|---|---|
Loading | Visual, X1 | 0.29 | 0.27 | 0.29 |
Loading | Visual, X2 | 0.26 | 0.25 | 0.28 |
Loading | Visual, X3 | 0.28 | 0.28 | 0.31 |
Loading | Verbal, X4 | 0.24 | 0.22 | 0.24 |
Loading | Verbal, X5 | 0.17 | 0.17 | 0.19 |
Loading | Verbal, X6 | 0.19 | 0.18 | 0.19 |
Loading | Speed, X7 | 0.27 | 0.27 | 0.28 |
Loading | Speed, X8 | 0.23 | 0.24 | 0.26 |
Loading | Speed, X9 | 0.19 | 0.22 | 0.22 |
Factor correlation | Visual, verbal | 0.18 | 0.15 | 0.18 |
Factor correlation | Visual, speed | 0.23 | 0.24 | 0.29 |
Factor correlation | Verbal, speed | 0.18 | 0.18 | 0.20 |
Overall, the highest FMI estimate in the model is 0.31. The associated WIF is , which indicates an estimated 20% increase of the width of the confidence interval due to missing data. Reporting these FMIs alongside analyses of empirical data will provide readers with a better sense of how much the presence of missing data has affected the efficiency of the parameter estimates, especially when comparing across studies where FMIs may differ even under the same missing rates. The full R code of this example is provided in Appendix A , and a summary of the lavaan options of the three estimates is given in Supplementary Table 1 .
The current simulation study suggests that a relatively large sample size may be necessary for the estimation of the FMIs. Sample FMI estimates were largely unbiased, even in very small samples with N = 50, which are typical of regression analyses. However, at such a small sample size, the estimates were imprecise, varying greatly from sample to sample, especially when the missing rate was high. When the missing rates were reasonably low (π mis = 0.2, 0.4; corresponding to 10–20% overall missing rate), sample sizes of several hundreds, which are typical in structural equation models, were able to produce reasonably efficient estimates. However, at an overall missing rate of 30%, it would require sample sizes exceeding 1,000 to produce precise estimates. It is worth noting that we used very strong selection mechanisms in the MAR-L and MAR-NL conditions to contrast the results from MCAR. In applied settings, the MAR selection mechanism would typically be weaker, which would lead to sample FMI estimate performances that are closer to better performances we saw in the MCAR conditions.
The three estimates are identical for saturated models, such as regression. However, the choice makes a difference when the model is not saturated. In the two-factor model, , the estimate via numeric Hessian, showed a distinct disadvantage as it was more likely break down when the sample size was small, or when the missing rate was high. In contrast, , the analytic estimate based on the unstructured model, was much less likely to break down in all cases, and was more precise than . Although occasionally showed a slightly higher bias than , which was based on the structured model, its performance was overall more favorable in these simulations.
The simulation study investigated FMI in FIML, but the results should generalize to FMI computed from MI with a large number of imputations. In MI, the FMI is conceptually given by the ratio of the between-imputation variance over the sum of the within- and between- imputation variances. As the number of imputations approaches infinity, this ratio becomes equivalent to the ratio of variance increase due to missing data over variance in the observed data as estimated from FIML. For simulation studies, FMI can be more computationally expensive in MI, as the estimate is produced in the final pooling stage of the analysis, and often requires a large number of imputations (more than 100) to achieve an acceptable level of accuracy (Harel, 2007 ). For substantive research, the researcher may simply choose between FIML and MI as the estimation method of FMI based on the missing data technique they are already using to produce the estimates of the model parameters.
As far as we are aware, this study is the first to look at the properties of sample FMIs computed using FIML. As such, we focused on two relatively simple and commonly used models, with three missing data mechanisms selected to contrast the impact of the specific mechanism on the FMI values and to stress that these values are not the same as the rates of missing data. Future research may wish to expand on the study conditions, for example, by controlling for the number of missing patterns, examining how changing the values of parameters in the model (such as the regression coefficient) would change the properties of the FMI estimates. It would also be worthwhile to investigate the relative performance of the three FMI estimates under incorrect models. When the model is wrong, the Hessian-based estimate, , is theoretically superior, as it is the only consistent estimate. However, whether this theoretical advantage would translate into a practical advantage needs to be examined in simulation studies. It would also be helpful to develop bootstrap SE/CI for the sample FMIs, so that researchers would have a better sense of the precision of the FMI estimates in their particular sample.
While our focus was on evaluating the properties of sample FMI estimates in terms how well they served as estimates of the corresponding population FMIs, the properties of population FMIs themselves may be of interest, and is a topic we are exploring in other work. In this ongoing work, we are finding that information loss can occur in unintuitive and unpredictable ways, and patterns in population FMIs observed in one context do not always generalize to other context. For example, based on the population-level FMI values we obtained for the conditions in this study, one may be tempted to conclude that the population FMIs of factor correlations are, in general, insensitive to missing data mechanisms (e.g., the middle rows of Table 1 ). However, this was not always the case. In conditions not reported here, when the indicators of F 1 were all completely observed, and the indicators of F 2 contained missingness conditioned on the indicators of F 1 , the FMI of the factor correlation became more sensitive to the missing data mechanism. Although the population quantities estimated by the sample FMIs may exhibit different patterns, we do not expect the sample FMI estimates themselves to show drastically different properties in these alternative scenarios.
The FMI estimates via FIML are available in most recent releases of lavaan (version 0.6-7 and up), and example R code for how to retrieve them is given in Appendix A . We recommend empirical researchers to routinely examine and report FMIs for key parameters in substantive analysis. FMIs capture the complex interplay between numerous factors such as missing rates, missing data mechanisms, and model parameters, sometimes in unintuitive ways. They can provide critical additional insights into how the standard errors, confidence intervals, and hypothesis tests may have been impacted by the presence of missing data. For methodologists conducting simulation studies with missing data, the population FMIs in the different study design conditions should be computed and reported, in order to provide better context of the performance of missing data techniques being studied.
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://osf.io/xyzt8/ .
VS developed the equations, statistical properties, and details of implementation for FMI estimates. LC designed and carried out the simulation study and wrote the initial draft of the manuscript. All authors contributed to the revisions of the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
1 As the sample size approaches infinity, all parameter estimates approach their true values, and therefore all standard errors approach 0. Multiplying the standard errors by makes them converge to nonzero numbers.
2 See Appendix A for how to request FMI estimates in lavaan . We would like to thank Yves Rossel, the developer of lavaan , for implementing these variations in the computation of FMI, permitting us to carry out this study.
Funding. This research was supported by grant RGPIN-2015-05251 from the Natural Sciences and Engineering Research Council of Canada (NSERC) to VS.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2021.667802/full#supplementary-material
睿智的核桃 · Matlab中实现在Axes中直接操纵数据点 - 知乎 1 年前 |