Bichteler A. 2016. Constrained multiple imputation: A case study in estimation and modeling on data missing below the limit of detection. Presented in the “Melding Dose-Response Relationships” session at the Society of Risk Analysis, December 14, San Diego, CA.
Abstract
High proportions of data missing below the limit of detection (LOD) pose significant challenges to estimating population blood serum levels of many dioxin-like compounds (DLCs). Single substitutions, e.g. LOD/squareroot(2) in NHANES, are known to underestimate variability and to bias mean and upper bound estimates when missingness is high. Modeled relations with other markers, e.g. glycohemoglobin, are likewise unreliable. Pooling blood serum by sets of 8 individuals introduced in NHANES 2005/06 improved detection rates, but the resulting drop in LODs (by up to a factor of 10) further complicates matters: within-pool membership and weighting, therefore variability, are unknown, therefore plausible estimates of change over time untenable. In this case study of DLCs in 4 biennials of NHANES (2001-2008), we suggest a multiple imputation with chained equations (MICE) approach to meeting these challenges. Multivariate chained Bayesian regression models using other DLCs and related individual covariates (e.g. cholesterol, triglycerides) as independent variables and constraining the fitted value between 0 and the LOD resulted in credible distributions of concentrations in NHANES 2001-04. Convergence was not problematic, and imputation regression diagnostics demonstrated consistency with measured values. Stable survey weighted means, upper bounds, and uncertainty around those values were estimated. Using the variance calculated from retrospectively pooled NHANES 2003-04 individuals, we imputed 8 lognormal deviates for each pooled sample in NHANES 2005-08. With 4 biennials of multiply imputed individual data, we could trend population change over time, incorporating realistic variability while mitigating the plunge in LODs. These robust imputations also formed the basis for modeling diabetes risk, e.g. glycohemoglobin, with fully continuous DLCs, boosting power and precision while avoiding the pitfalls inherent in categorizing continuous variables.