相关文章推荐
健壮的墨镜  ·  Owned Entity Types - ...·  4 月前    · 
调皮的石榴  ·  Java SQLException: ...·  1 年前    · 
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely. As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health. Learn more about our disclaimer.
Front Psychol. 2021; 12: 667802.
Published online 2021 Aug 26. doi: 10.3389/fpsyg.2021.667802
PMCID: PMC8426626

Three Sample Estimates of Fraction of Missing Information From Full Information Maximum Likelihood

Lihan Chen Model Parameter π mis δ pop, j MAR-L MAR-NL Regressionβ0.20.150.290.35Regressionβ0.40.330.510.65Regressionβ0.60.520.630.79Two-factorϕ0.20.150.290.35Two-factorϕ0.40.330.510.65Two-factorϕ0.60.520.630.79Two-factorλ0.20.150.290.35Two-factorλ0.40.330.510.65Two-factorλ0.60.520.630.79

Numbers computed from N=1,000,000. π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. β, the regression coefficient. ϕ, the factor correlation. λ, the factor loading. The population FMIs of the four factor loadings corresponding to variables containing missingness are all within 0.01 of each other, and therefore only the FMIs of the λ for X 2 is reported .

In Model 2 (two-factor model), there were 4 factor loadings associated with variables that contained missing data, namely the factor loadings of X 2 , X 4 on F 1 , and the factor loadings of Y 2 , Y 4 on F 2 . As these loadings were of the same size, and the missing rate was the same for all four variables, only the FMI on the loading of X 2 is reported here. Under MCAR, the FMIs of the factor loading (λ) were δ pop , λ =0.20, 0.40, and 0.60 for per-variable missing rates of π mis =0.2, 0.4, and 0.6, respectively. Under MAR-L, the FMIs were δ pop , λ =0.37, 0.60, 0.71. Under MAR-NL, they were δ pop , λ =0.44, 0.71, 0.83. Once again, we observed that in each condition, the FMI increased as the missing rates increased, and the MAR-NL condition lead to the highest FMIs, followed by MAR-L, and then MCAR.

The pattern of differences between the missing data mechanisms, however, did not hold for the FMIs of the factor correlation (ϕ) in Model 2. The FMIs under per variable missing rates of π mis =0.2, 0.4, and 0.6, respectively, were δ pop , ϕ =0.13, 0.26, 0.38 in MCAR, δ pop , ϕ =0.14, 0.27, 0.38 in MAR-L, and δ pop , ϕ =0.14, 0.28, and 0.39 in MAR-NL. The FMIs of the factor correlation, unlike the FMIs of other parameters, did not differ notably across the missing mechanisms. Even within the same model, parameters affected by the same missing rates and the same missing data mechanism could see drastically different amounts of information loss. For instance, under π mis = 0.2 and MAR-NL, the FMI on ϕ was δ pop , ϕ = 0.14, but the FMI on λ in the same model was δ pop , ϕ = 0.44. These results emphasize the point that information loss is not predictable from the missing rates and whether the missing data mechanism is MCAR or MAR. Information loss can only be quantified by computing the FMI for a particular mechanism and parameter of interest.

2.2.2. Model 1

For the regression coefficient β, the three sample FMI estimates, δ ^ 1 , β , δ ^ 2 , β , and δ ^ 3 , β were identical, due to the regression model being saturated. We refer to the resulting estimate simply as δ ^ β in Table 2 . As we can see in the tables, the sample FMI estimates were largely unbiased, with the long run means of δ ^ β over the 1,000 replications falling within 0.05 of the population value in practically all conditions. Notable bias (i.e., >0.05) only arose when the missing data mechanism was NL-MAR, the sample size was small, and the missing rate was high, where the sample FMI could underestimate the population value by as much as 0.07 (near the top of Table 2 under the MAR-NL columns).

Table 2

The bias, RMSE, and the 95% equal-tailed interval width of the regression coefficient FMI estimate ( δ ^ β ).

π mis MAR-L MAR-NL
0.2 50 0.01 −0.01 −0.02
0.4 50 0.00 −0.02 −0.05
0.6 50 0.00 −0.02 −0.07
0.2 100 0.00 0.05 0.19 0.00 −0.01
0.4 100 0.00 −0.01 −0.03
0.6 100 0.01 −0.01 −0.06
0.2 200 0.00 0.03 0.13 0.00
0.4 200 0.00 0.05 0.19 −0.01 −0.02
0.6 200 0.00 −0.03
0.2 500 0.00 0.02 0.09 0.00 0.04 0.16 0.00 0.05 0.19
0.4 500 0.00 0.03 0.12 0.00 −0.01
0.6 500 0.00 0.04 0.14 0.00 −0.01

π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. ETI denotes the 95% ETI width, which is the distance between the 2.5 and 97.5 percentile of the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >0.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .

The sample FMI estimates were not particularly efficient at the smaller sample sizes typical of regression analyses, and would produce highly variable results from run to run. At N = 50 and 100, the RMSEs were around 0.10 or higher across nearly all missing rates and missing data mechanisms. At N = 100, the estimate was the most precise estimate under MCAR with π mis = 0.2, with RMSE of 0.05 and 95% ETI width of 0.19. For larger sample sizes N = 200 and 500, the performance of δ ^ β was overall acceptable under MCAR, but not for MAR-L and MAR-NL. Under both MAR conditions, good performance was only achieved when N = 500 and π mis = 0.2.

For a more intuitive illustration of bias and variability, the smoothed densities of the 1,000 replications in each condition are shown in Figures 1 3 . We can see that almost all sampling distributions centered around the population value, with the exception of the top right panels of Figure 3 , which correspond to the small sample size, high missing rate, nonlinear MAR conditions. These distributions, however, do not pack very tightly around the population value, except when the sample size is 500, and only when either the missing rate was low, or the missing mechanism was MCAR.

The sampling distribution of FMI for regression coefficients in MCAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line. The sampling distributions of the three estimates are virtually identical in simple regression.

The sampling distribution of FMI for regression coefficients in nonlinear MAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line. The sampling distributions of the three estimates are virtually identical in simple regression.

The sampling distribution of FMI for regression coefficients in linear MAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line. The sampling distributions of the three estimates are virtually identical in simple regression.

2.2.3. Model 2

In the two-factor model, not all replications were usable. In some cases, the model would fail to converge. In other cases, when FMIs were requested, lavaan would sometimes produce an error, leading to NAs or negative values for the FMI estimates. These issues were exacerbated by higher missing rates, and were particularly pronounced in δ ^ 1 , j when the sample size was small. Under N = 100 and π mis =0.6 (overall missing rate 0.3), δ ^ 1 , j would encounter more issues in more than 30% of the runs. In this regard, the best performing estimate was δ ^ 3 , j , which almost never produced negative values, and had the lowest rate of NA occurrences. However, even for δ ^ 3 , j , the rate of failed or improper estimates was often close to or above 10% at N = 100. At N = 200, δ ^ 3 , j would encounter this issue at most 3% of the time, even when the per variable missing rate was 0.6. See Supplementary Tables 3 , 4 for the rate of failed or improper estimates for the factor correlations and factor loadings, respectively. The following results were obtained by excluding all occurrences of NAs and negative values when aggregating the sample FMI estimates.

The sample FMIs of factor correlation (ϕ) were largely unbiased in MCAR, or when the mechanism was MAR but the sample size was 200 or above (see Tables 3 5 ). Bias was also generally small at N = 100 when the missing data mechanism was MCAR, but notable bias could occur in MAR-L and MAR-NL. Among the three estimates, δ ^ 1 , ϕ showed the largest amount of bias overall, severely overestimating the FMI (0.17 above the true value on average) when the missing data mechanism was nonlinear MAR and the per variable missing rate was π mis =0.6. δ ^ 3 , ϕ also showed some notable bias under N = 100, but only up to 0.11 in the worst case. δ ^ 2 , ϕ showed the least amount of bias, producing raw bias values very close to 0.05 or below across all the conditions.

Table 3

The bias, RMSE, and the 95% equal-tailed interval width of factor correlation FMIs under MCAR.

π mis 95% ETI width δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ
0.2 100 0.01 0.00 0.01
0.4 100 0.02 0.00 0.03
0.6 100 −0.02 0.05
0.2 200 0.00 0.00 0.00 0.04 0.04 0.04 0.17 0.16 0.15
0.4 200 −0.01 0.00 0.01
0.6 200 −0.02 −0.01 0.02
0.2 500 0.00 0.00 0.00 0.03 0.03 0.03 0.10 0.10 0.10
0.4 500 0.00 0.00 0.00 0.05 0.05 0.05 0.20 0.19 0.18
0.6 500 −0.01 −0.01 0.01
0.2 1000 0.00 0.00 0.00 0.02 0.02 0.02 0.07 0.07 0.07
0.4 1000 0.00 0.00 0.00 0.03 0.03 0.03 0.13 0.13 0.13
0.6 1000 −0.01 0.00 0.01 0.19 0.19 0.20

π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. Distribution width is the distance between the 2.5 and 97.5 percentile in the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >0.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .

Table 5

The bias, RMSE, and the 95% equal-tailed interval width of factor correlation FMIs under nonlinear MAR.

π mis 95% ETI width δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ
0.2 100 0.04 0.02 0.02
0.4 100
0.6 100
0.2 200 0.01 0.00 0.00 0.05 0.04 0.19 0.18
0.4 200 0.02 0.01 0.03
0.6 200 0.04 0.02
0.2 500 0.00 0.00 0.00 0.03 0.03 0.03 0.12 0.12 0.11
0.4 500 0.01 0.01 0.01
0.6 500 0.00 0.00 0.02
0.2 1000 0.00 0.00 0.00 0.02 0.02 0.02 0.08 0.08 0.07
0.4 1000 0.00 0.00 0.01 0.05 0.05 0.04 0.18 0.18 0.15
0.6 1000 0.00 0.00 0.01

π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. Distribution width is the distance between the 2.5 and 97.5 percentile in the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >0.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .

Table 4

The bias, RMSE, and the 95% equal-tailed interval width of factor correlation FMIs under linear MAR.

π mis 95% ETI width δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ δ ^ 1 , ϕ δ ^ 2 , ϕ δ ^ 3 , ϕ
0.2 100 0.03 0.01 0.02
0.4 100 0.02 0.04
0.6 100
0.2 200 0.01 0.00 0.00 0.05 0.05 0.19 0.18
0.4 200 0.01 0.01 0.01
0.6 200 −0.01 0.00 0.04
0.2 500 0.00 0.00 0.00 0.03 0.03 0.03 0.12 0.11 0.11
0.4 500 0.00 0.00 0.00 0.20
0.6 500 0.00 0.00 0.01
0.2 1000 0.00 0.00 0.00 0.02 0.02 0.02 0.08 0.08 0.07
0.4 1000 0.00 0.00 0.00 0.04 0.04 0.04 0.14 0.14 0.14
0.6 1000 0.00 0.00 0.01 0.20 0.20

π mis is the probability of missing value in each variable with missingness. The overall missing rate is π mis /2. Distribution width is the distance between the 2.5 and 97.5 percentile in the sampling distributions. Bias values with magnitudes >0.05, RMSEs >0.05, and ETI widths >.20 are indicated with bold fonts. The conditional formatting was applied prior to rounding .

Although the population FMIs of the factor correlations were quite low, they proved difficult to estimate precisely. As illustrated in Figures 4 6 , the sampling distributions of the FMI estimates were very wide, and showed considerable bias at small sample sizes. Although the bias was absent at larger sample sizes, the estimates would only fall closely around the population value when the missing rate was low. As seen in Tables 3 5 , when the per variable missing rate was π mis =0.4, a sample size of 500 is required for δ ^ 3 , ϕ to provide a precise estimate, with δ ^ 1 , ϕ and δ ^ 2 , ϕ giving worse performances. Under the high missing rate of π mis = 0.6, a sample size of 1,000 was required. Overall, δ ^ 3 , ϕ gave the best performance, showing the lowest RMSE and 95% ETI width.

The sampling distribution of FMI for factor correlations in MCAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line.

The sampling distribution of FMI for factor correlations in nonlinear MAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line.

The sampling distribution of FMI for factor correlations in linear MAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line.

Compared to the factor correlation FMIs, the factor loading FMIs performed much better in Model 2. Similar to the population simulation, here we report on the FMIs of the X 2 loading on F 1. The results for the other loadings X 4 , Y 2 , and Y 4 are largely identical. Overall, as illustrated in Figures 7 9 , the sample FMIs of factor loadings showed little bias and would typically fall much closer to the population values. As shown in Supplementary Tables 5 7 , the bias, RMSE, and 95% ETI width were satisfactory for all three estimates when the sample size was N = 500 or greater. At N = 200, the FMIs only performed well when the missing mechanism was MCAR and the per variable missing rate was 0.2. At N = 100, all three estimates performed poorly, but δ ^ 3 , λ was closer to acceptable performance, producing 95% ETI widths of 0.30 when δ ^ 2 , λ would produce widths close to 0.50, and δ ^ 1 , λ would produce widths close to 0.70. Once again, δ ^ 3 , λ showed the best performance overall, with the lowest RMSE and 95% ETI width, and a similar bias to δ ^ 2 , λ .

An external file that holds a picture, illustration, etc. Object name is fpsyg-12-667802-g0007.jpg

The sampling distribution of FMI for factor loadings in MCAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line.

The sampling distribution of FMI for factor loadings in nonlinear MAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line.

The sampling distribution of FMI for factor loadings in linear MAR. The panel rows correspond to sample sizes of N = 50, 100, 200, and 500. The panel columns correspond to per variable missing rates of 20, 40, and 60%, or overall population missing rates of 10, 20, and 30%. The population FMI value in each panel is given as a vertical black dotted line.

3. Example Analysis

Here we provide an example on how to obtain FMIs from the lavaan package (version 0.6-7 and up) in R , using the Holzinger and Swineford ( 1939 ) dataset and simulated MCAR missing data. The example is adapted from Savalei and Rosseel (ress). The dataset, available through the lavaan package, contains cognitive performance test scores from 301 school children. The data for this example can be loaded into the R workspace using the following code.

For the purpose of this demonstration, we will conduct a confirmatory factor analysis with three correlated factors: visual skills, verbal skills, and mental speed. Each factor is measured by three tests, with variable names x1 to x9 in the datasets, following a model given by the lavaan model syntax below.

As the dataset does not contain missing data, we will introduce MCAR missingness by randomly removing 61 values from each variable independently, resulting in an overall missing rate of 20%.

With the example data and model syntax prepared, we fit the data to the CFA model and obtain the FMIs with the following code.

The option std.lv = TRUE fixes all variances of the latent factors to 1, allowing all loadings to be freely estimated, while the missing = ~ml~ asks lavaan to handle the missing data using FIML. The function parameterEstimates extracts the results from the model fit, where the option fmi = TRUE requests FMI estimates from lavaan alongside the parameter estimates. The remove.nonfree = TRUE option omits parameters that are not freely estimated from the output—In this case, the latent factor variances are not printed in the output table, as they were fixed to 1. By default, lavaan uses Hessian numeric estimation of the observed information matrix, which yields δ ^ 1 for the FMI estimates. To produce δ ^ 2 , the observed.information input must be specified to request the analytic approximation of the information.

By default, the analytic approximation is based on structured information, to obtain δ ^ 3 , the h1.information input must be provided to request the unstructured information.

The three FMI estimates provide largely similar values (see Table 6 ). They agree on which loading estimates have the highest FMIs, namely the loadings of X 1 ( δ ^ 1 = 0 . 29 , δ ^ 2 = 0 . 27 , δ ^ 3 = 0 . 29 ), X 2 ( δ ^ 1 = 0 . 26 , δ ^ 2 = 0 . 25 , δ ^ 3 = 0 . 28 ), X 3 ( δ ^ 1 = 0 . 28 , δ ^ 2 = 0 . 28 , δ ^ 3 = 0 . 31 ), and X 7 ( δ ^ 1 = 0 . 27 , δ ^ 2 = 0 . 27 , δ ^ 3 = 0 . 28 ). The three estimates also agree on which factor correlation has the highest FMI, namely the correlation between visual skill and mental speed, δ ^ 1 = 0 . 23 , δ ^ 2 = 0 . 24 , δ ^ 3 = 0 . 29 . The FMI of the factor correlation between visual skill and mental speed also shows the largest difference among the FMI estimates, between 0.23 and 0.29. The three estimates disagree on which factor correlation has the lowest FMI, but the differences are small: For the correlation between visual and verbal skills, δ ^ 1 = 0 . 18 , δ ^ 2 = 0 . 15 , δ ^ 3 = 0 . 18 ; for the correlation between verbal skill and mental speed, δ ^ 1 = 0 . 18 , δ ^ 2 = 0 . 18 , δ ^ 3 = 0 . 20 .

Table 6

Results from the example analysis.

Parameter Variables δ ^ 1 δ ^ 2 δ ^ 3
Loading Visual, X1 0.29 0.27 0.29
Loading Visual, X2 0.26 0.25 0.28
Loading Visual, X3 0.28 0.28 0.31
Loading Verbal, X4 0.24 0.22 0.24
Loading Verbal, X5 0.17 0.17 0.19
Loading Verbal, X6 0.19 0.18 0.19
Loading Speed, X7 0.27 0.27 0.28
Loading Speed, X8 0.23 0.24 0.26
Loading Speed, X9 0.19 0.22 0.22
Factor correlation Visual, verbal 0.18 0.15 0.18
Factor correlation Visual, speed 0.23 0.24 0.29
Factor correlation Verbal, speed 0.18 0.18 0.20

Overall, the highest FMI estimate in the model is 0.31. The associated WIF is 1 / 1 - 0 . 31 = 1 . 2 , which indicates an estimated 20% increase of the width of the confidence interval due to missing data. Reporting these FMIs alongside analyses of empirical data will provide readers with a better sense of how much the presence of missing data has affected the efficiency of the parameter estimates, especially when comparing across studies where FMIs may differ even under the same missing rates. The full R code of this example is provided in Appendix A , and a summary of the lavaan options of the three estimates is given in Supplementary Table 1 .

4. Discussion

The current simulation study suggests that a relatively large sample size may be necessary for the estimation of the FMIs. Sample FMI estimates were largely unbiased, even in very small samples with N = 50, which are typical of regression analyses. However, at such a small sample size, the estimates were imprecise, varying greatly from sample to sample, especially when the missing rate was high. When the missing rates were reasonably low (π mis = 0.2, 0.4; corresponding to 10–20% overall missing rate), sample sizes of several hundreds, which are typical in structural equation models, were able to produce reasonably efficient estimates. However, at an overall missing rate of 30%, it would require sample sizes exceeding 1,000 to produce precise estimates. It is worth noting that we used very strong selection mechanisms in the MAR-L and MAR-NL conditions to contrast the results from MCAR. In applied settings, the MAR selection mechanism would typically be weaker, which would lead to sample FMI estimate performances that are closer to better performances we saw in the MCAR conditions.

The three estimates are identical for saturated models, such as regression. However, the choice makes a difference when the model is not saturated. In the two-factor model, δ ^ 1 , j , the estimate via numeric Hessian, showed a distinct disadvantage as it was more likely break down when the sample size was small, or when the missing rate was high. In contrast, δ ^ 3 , j , the analytic estimate based on the unstructured model, was much less likely to break down in all cases, and was more precise than δ ^ 2 , j . Although δ ^ 3 , j occasionally showed a slightly higher bias than δ ^ 2 , j , which was based on the structured model, its performance was overall more favorable in these simulations.

The simulation study investigated FMI in FIML, but the results should generalize to FMI computed from MI with a large number of imputations. In MI, the FMI is conceptually given by the ratio of the between-imputation variance over the sum of the within- and between- imputation variances. As the number of imputations approaches infinity, this ratio becomes equivalent to the ratio of variance increase due to missing data over variance in the observed data as estimated from FIML. For simulation studies, FMI can be more computationally expensive in MI, as the estimate is produced in the final pooling stage of the analysis, and often requires a large number of imputations (more than 100) to achieve an acceptable level of accuracy (Harel, 2007 ). For substantive research, the researcher may simply choose between FIML and MI as the estimation method of FMI based on the missing data technique they are already using to produce the estimates of the model parameters.

As far as we are aware, this study is the first to look at the properties of sample FMIs computed using FIML. As such, we focused on two relatively simple and commonly used models, with three missing data mechanisms selected to contrast the impact of the specific mechanism on the FMI values and to stress that these values are not the same as the rates of missing data. Future research may wish to expand on the study conditions, for example, by controlling for the number of missing patterns, examining how changing the values of parameters in the model (such as the regression coefficient) would change the properties of the FMI estimates. It would also be worthwhile to investigate the relative performance of the three FMI estimates under incorrect models. When the model is wrong, the Hessian-based estimate, δ ^ 1 , j , is theoretically superior, as it is the only consistent estimate. However, whether this theoretical advantage would translate into a practical advantage needs to be examined in simulation studies. It would also be helpful to develop bootstrap SE/CI for the sample FMIs, so that researchers would have a better sense of the precision of the FMI estimates in their particular sample.

While our focus was on evaluating the properties of sample FMI estimates in terms how well they served as estimates of the corresponding population FMIs, the properties of population FMIs themselves may be of interest, and is a topic we are exploring in other work. In this ongoing work, we are finding that information loss can occur in unintuitive and unpredictable ways, and patterns in population FMIs observed in one context do not always generalize to other context. For example, based on the population-level FMI values we obtained for the conditions in this study, one may be tempted to conclude that the population FMIs of factor correlations are, in general, insensitive to missing data mechanisms (e.g., the middle rows of Table 1 ). However, this was not always the case. In conditions not reported here, when the indicators of F 1 were all completely observed, and the indicators of F 2 contained missingness conditioned on the indicators of F 1 , the FMI of the factor correlation became more sensitive to the missing data mechanism. Although the population quantities estimated by the sample FMIs may exhibit different patterns, we do not expect the sample FMI estimates themselves to show drastically different properties in these alternative scenarios.

The FMI estimates via FIML are available in most recent releases of lavaan (version 0.6-7 and up), and example R code for how to retrieve them is given in Appendix A . We recommend empirical researchers to routinely examine and report FMIs for key parameters in substantive analysis. FMIs capture the complex interplay between numerous factors such as missing rates, missing data mechanisms, and model parameters, sometimes in unintuitive ways. They can provide critical additional insights into how the standard errors, confidence intervals, and hypothesis tests may have been impacted by the presence of missing data. For methodologists conducting simulation studies with missing data, the population FMIs in the different study design conditions should be computed and reported, in order to provide better context of the performance of missing data techniques being studied.

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://osf.io/xyzt8/ .

Author Contributions

VS developed the equations, statistical properties, and details of implementation for FMI estimates. LC designed and carried out the simulation study and wrote the initial draft of the manuscript. All authors contributed to the revisions of the manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1 As the sample size approaches infinity, all parameter estimates approach their true values, and therefore all standard errors approach 0. Multiplying the standard errors by n makes them converge to nonzero numbers.

2 See Appendix A for how to request FMI estimates in lavaan . We would like to thank Yves Rossel, the developer of lavaan , for implementing these variations in the computation of FMI, permitting us to carry out this study.

Funding. This research was supported by grant RGPIN-2015-05251 from the Natural Sciences and Engineering Research Council of Canada (NSERC) to VS.

References

  • Allison P. D. (1987). Estimation of linear models with incomplete data . Sociol. Methodol . 17 , 71–103. 10.2307/271029 [ CrossRef ] [ Google Scholar ]
  • Arbuckle J. L. (1996). Full Information Estimation in the Presence of Incomplete Data . Mahwah, NJ: Lawrence Erlbaum Associates. [ Google Scholar ]
  • Chen L., Savalei V., Rhemtulla M. (2020). Two-stage maximum likelihood approach for item-level missing data in regression . Behav. Res. Methods 52 , 2306–2323. 10.3758/s13428-020-01355-x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Collins L. M., Schafer J. L., Kam C.-M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures . Psychol. Methods 6 , 330–351. 10.1037/1082-989X.6.4.330 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Graham J. W., Taylor B. J., Olchowski A. E., Cumsille P. E. (2006). Planned missing data designs in psychological research . Psychol. Methods 11 , 323–343. 10.1037/1082-989X.11.4.323 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Harel O. (2007). Inferences on missing information under multiple imputation and two-stage multiple imputation . Stat. Methodol . 4 , 75–89. 10.1016/j.stamet.2006.03.002 [ CrossRef ] [ Google Scholar ]
  • Holzinger K. J., Swineford F. (1939). A study in factor analysis: the stability of a bi-factor solution . Suppl. Educ. Monogr . 48 :xi + 91. [ Google Scholar ]
  • Little R. J. A., Rubin D. B. (1987). Statistical Analysis With Missing Data . New York, NY: Wiley. [ Google Scholar ]
  • Little R. J. A., Rubin D. B. (2002). Statistical Analysis With Missing Data, 2nd Edn . Wiley series in probability and statistics. Hoboken, NJ: Wiley. 10.1002/9781119013563 [ CrossRef ] [ Google Scholar ]
  • Muthén B., Kaplan D., Hollis M. (1987). On structural equation modeling with data that are not missing completely at random . Psychometrika 52 , 431–462. 10.1007/BF02294365 [ CrossRef ] [ Google Scholar ]
  • Orchard T., Woodbury M. A. (1972). “A missing information principle: theory and applications,” in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics (Berkeley, CA: University of California Press; ), 697–715. 10.1525/9780520325883-036 [ CrossRef ] [ Google Scholar ]
  • R Core Team (2019). R: A Language and Environment for Statistical Computing . Vienna: R Foundation for Statistical Computing. [ Google Scholar ]
  • Rao C. R. (1973). Linear Statistical Inference and Its Applications . New York, NY: John Wiley & Sons. 10.1002/9780470316436 [ CrossRef ] [ Google Scholar ]
  • Rosseel Y. (2012). lavaan: An R package for structural equation modeling . J. Stat. Softw . 48 , 1–36. 10.18637/jss.v048.i02 [ CrossRef ] [ Google Scholar ]
  • Rubin D. B. (1976). Inference and missing data . Biometrika 63 , 581–592. 10.1093/biomet/63.3.581 [ CrossRef ] [ Google Scholar ]
  • Rubin D. B. (1987). Multiple Imputation for Nonresponse in Surveys . New York, NY: John Wiley & Sons. 10.1002/9780470316696 [ CrossRef ] [ Google Scholar ]
  • Savalei V. (2010). Expected versus observed information in sem with incomplete normal and nonnormal data . Psychol. Methods 15 , 352–367. 10.1037/a0020143 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Savalei V., Bentler P. M. (2005). A statistically justified pairwise ml method for incomplete nonnormal data: a comparison with direct ml and pairwise ADF . Struct. Equat. Model. Multidisc. J . 12 , 183–214. 10.1207/s15328007sem1202_1 [ CrossRef ] [ Google Scholar ]
  • Savalei V., Falk C. F. (2014). Robust two-stage approach outperforms robust full information maximum likelihood with incomplete nonnormal data . Struct. Equat. Model. Multidisc. J . 21 , 280–302. 10.1080/10705511.2014.882692 [ CrossRef ] [ Google Scholar ]
  • Savalei V., Rhemtulla M. (2012). On obtaining estimates of the fraction of missing information from full information maximum likelihood . Struct. Equat. Model. Multidisc. J . 19 , 477–494. 10.1080/10705511.2012.687669 [ CrossRef ] [ Google Scholar ]
  • Savalei V., Rhemtulla M. (2017). Normal theory two-stage ML estimator when data are missing at the item level . J. Educ. Behav. Stat . 42 , 405–431. 10.3102/1076998617694880 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Savalei V., Rosseel Y. (in press). Computational options for standard errors test statistics with incomplete nonnormal data . Struct. Equat. Model. [ Google Scholar ]
  • Sullivan T. R., White I. R., Salter A. B., Ryan P., Lee K. J. (2018). Should multiple imputation be the method of choice for handling missing data in randomized trials? Stat. Methods Med. Res . 27 , 2610–2626. 10.1177/0962280216683570 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • van Buuren S., Groothuis-Oudshoorn K. (2011). mice: Multivariate imputation by chained equations in R . J. Stat. Softw . 45 , 1–67. 10.18637/jss.v045.i03 [ CrossRef ] [ Google Scholar ]
  • Yoo J. E. (2009). The effect of auxiliary variables and multiple imputation on parameter estimation in confirmatory factor analysis . Educ. Psychol. Measure . 69 , 929–947. 10.1177/0013164409332225 [ CrossRef ] [ Google Scholar ]
  • Yucel R. M., He Y., Zaslavsky A. M. (2011). Gaussian-based routines to impute categorical variables in health surveys . Stat. Med . 30 , 3447–3460. 10.1002/sim.4355 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Articles from Frontiers in Psychology are provided here courtesy of Frontiers Media SA