The Generalized Multivariate Difference estimator (GMDe) is a simple, robust, model-based method for complex sample surveys with high dimensions. GMDe improves population estimates for study variables, which serve the analysis objectives, using auxiliary variables that are well‑correlated with the study variables. GMDe is a multivariate generalization of the long‑neglected class of difference estimators. GMDe is a pioneering, multivariate alternative to the conventional mixed estimator, calibration estimator, and regression estimators, e.g., GREG.
Auxiliary statistics can be known constants, e.g., population totals from administrative records, full‑coverage remotely sensed data, and transactional datasets commonly referred to as "big data." In addition, auxiliary statistics may be population estimates that include random sampling or prediction errors. GMDe can process large vectors of population estimates, where sub‑partitions can represent different cells and margins in multiple statistical tables, different time‑periods, different domains and sub‑domains, and different small areas.
GMDe is a hybrid estimator; it splits the estimation method into an initial design-based component, followed by a model‑based component. This simplifies the estimator for complex surveys.
GMDe begins with a design‑based estimator, e.g., the multivariate Horvitz-Thompson, and a probability sample. It can be a simple two‑phase sample, or a time‑series of multi‑stage samples for interpenetrating panels and post‑stratification with supplemental sampling frames. This produces a single vector of population estimates, and its covariance matrix, for the M study variables and J auxiliary residuals for each phase and/or stage in the sample. This is the design-based component.
GMDe follows with the model‑based component. GMDe is a simple linear function. It computes the product of a M×J coefficient matrix of known constants, i.e., "the model," and the J×1 vector partition for the design-based population estimates of J auxiliary residuals. That matrix product is a M×1 vector of "adjustments" to the M×1 vector of design-based population estimates for the M study variables.
GMDe uses the multivariate minimum variance criterion, and the M×J population correlation matrix between the M study variables and the J auxiliary variables from the design-based component, to parameterize the model's "optimal" M×J coefficient matrix. Therefore, there is an optimal scalar coefficient specific to each study variable for each auxiliary variable. The degree of variance reduction depends upon the strength of the correlations between the study variables and the auxiliary variables. GMDe estimates are conditional on the sample; a different realization of the sample would produce a different model coefficient matrix. Bootstrap methods can more fully consider the stochastic effects from all random errors.
There is a fundamental problem with GMDe. To parameterize the optimal M×J coefficient matrix, GMDe must invert the J×J partition of the design-based correlation matrix for the J auxiliary residuals. But that J×J partition can be rank‑deficient, which makes the matrix inverse infeasible.
There is a simple solution. The recursive version of GMDe, known as "rGMDe," is always feasible, regardless of the dimensions, condition, or rank of the correlation matrix. Rather than invert a J×J matrix, rGMDe uses a recursive sequence of J scalar inversions. Within each recursion, rGMDe employs orthogonalization and stepwise selection of the auxiliary residual that best improves population estimates for the study variables.
Within each recursion, rGMDe censors weak correlations between the study variables and the auxiliary residual to reduce risks from spurious correlations, resulting in a sparce M×J model coefficient matrix. rGMDe modifies the model's minimum variance coefficients to mitigate the influence of suspected outliers, impose inequality constraints, and assure a "coherent" M×M covariance matrix for the final M×1 vector of rGMDe estimates. Minor modifications to the "optimal" model coefficients minimally sacrifice statistical efficiency, but the gain in robustness is substantial.
rGMDe can impose equality constraints that assure "additivity" in statistical tables; they also assure that the sums of small‑area estimates agree with larger‑area estimates in a hierarchy of geospatial tessellations. Equality constraints improve accuracy for all estimates, much like a calibration estimator. The final M×1 vector of GMDe population estimates, and its M×M covariance matrix, support post‑rGMDe ratio and product estimators. A SQL routine could implement the rGMDe algorithm within a database environment, which would be useful in institutional surveys.
The accuracy of model‑based rGMDe estimates strongly depends on the M×J matrix of model coefficients. That depends on the initial design‑based estimate for the M×J population correlation matrix between the M study variables and the J auxiliary variables. Accurate estimates for strong correlations require measurements of each study variable matched to each auxiliary variable within domains, and the number of sampling units in each domain can be inadequate.
Exogenous sources can improve the accuracy of the estimated correlations. Consider this example. Each panel in a longitudinal survey uses a multi‑stage sampling design. Each panel produces a separate estimate of the M×J population correlation matrix. A Bayesian GMDe would combine estimates of the correlation matrix from multiple panels into a single composite estimate, which assumes the correlation matrix is time‑invariant. Success of GMDe depends on the statistician's innovation and skill to identify and accurately estimate strong correlations between the study variables and auxiliary variables.
Auxiliary residuals in a longitudinal survey can be differences between design‑based population estimates for the current panel, and predictions of those current estimates with a deterministic model for population dynamics, where previously measured panels provide the initial conditions for the deterministic model. Time‑series analyses of residuals, i.e., observed v. expected population estimates, can identify unexpected trends in the population, apply model‑based inference, and help improve the population model, all of which assists the analyst to better understand the population.
GMDe has the efficiency of the model-based method and the robustness of the design‑based method. GMDe simplifies a complex sample survey by splitting it into its design‑based and model‑based components. GMDe uses a simple linear function of the design‑based estimate for the population correlation matrix, and the design‑based estimates for the vector of auxiliary residuals, to adjust the design‑based estimate for the vector of study variables and its covariance matrix.
rGMDe uses inequality constraints and analyses of residuals to enhance robustness and reduce risks from spurious correlations and an inaccurate model. rGMDe is numerically dependable with high dimensions.
The analyst can use GMDe to help improve knowledge through deterministic models and model‑based inference. GMDe is adaptable to changes over time in objectives and technologies within a long‑term, institutional, sample survey program.
However, GMDe is a nascent method, and the assertions here have not been rigorously challenged. The author seeks a research partnership that compares GMDe to conventional methods for a complex sample survey design.
The 244-page book "Generalized Multivariate Difference Estimator (GMDe): The Recursive Algorithm (rGMDe)" is available at Amazon for US$ 20. ISBN-13 979-8874319038
Copyright © 2024 Environmetrika - All Rights Reserved.
Powered by GoDaddy Website Builder
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.