Abstract
The problem of small area estimation in sample surveys focuses on producing reliable estimates of subpopulation characteristics in domains with very small or even zero sample sizes while assessing their accuracy. Small area estimation methods not only enable testing and adjustment of census data counts, but also play an irreplaceable role in planning governmental programs related to education, health, environment, to name a few. Obtaining accurate small area estimation results is essential, but not straightforward, due to the limited sample size. Traditional inference methods yield large standard errors, resulting in unstable and unreliable estimates. Attempting to solve this problem by obtaining additional samples not only requires more human, material, and financial resources, but also contradicts the intention of the sampling survey. Therefore, fundamentally solving the small area estimation problem requires improving the estimation method from the inference stage. The small area estimation problem can be regarded as an extension of traditional statistical inference theory. Existing literature on small area estimation mainly focuses on leveraging auxiliary information, forming a framework according to the characteristics of the estimated quantities or estimation methods. In this study, we propose a novel approach by considering error correction, arguing that the core and difficulty of the small area estimation problem lies in correcting the error. We develop a framework for small area estimation based on error correction, and study three typical scenarios of small area estimation in practice: biased auxiliary variables, biased target variables, and biased samples. The corresponding small area estimation models are constructed to address these different sources of errors. On top of the theoretical study, the proposed approach is validated through simulation and empirical evidence, demonstrating its viability and feasibility for advancing small area estimation theory. The main research contents and innovations of this study are as follows:Chapter 2 constructs a framework of small area estimation based on error correction. First, the fundamental theoretical approaches to small area estimation are summarized. In order to obtain accurate small area estimation results, more attention is needed on the construction and inference of the small area estimation model. Subsequently, the small area estimation problem is interpreted from the error perspective, and a small area estimation framework based on error is proposed. We highlight that a small sample size is not the root cause of the difficulty in small area estimation, but the resulting larger standard error is the key factor to be controlled. Thus, we hypothesize that error correction is the vital element in small area estimation problem. This chapter summarizes the main sources of error in small area estimation into three categories: errors in data information, errors in sample information extraction, and unrepresentative samples. Along this vein, we introduce the correction idea, parameter estimation, model selection, and testing for different sources of error, making the small area estimation methods more explanatory and adaptable in a wide range of real-world application scenarios. Meanwhile, this chapter lays the foundation of the overall research idea of this paper. The subsequent chapters will focus on the above three error sources to construct the small area estimation models and correct different errors correspondingly.
Chapter 3 discusses small area estimation methods based on biased auxiliary variables. This chapter focuses on the first source of error, i.e., the case where there are errors in the data information, especially the case where there are non-negligible errors in the auxiliary variables. Due to the limited information in small area samples, the quality of the auxiliary variables is crucial to the estimation effect. However, in real sample surveys, data often have non-negligible errors, and their direct use may amplify the errors and produce erroneous conclusions. This chapter discusses the cases of continuous and discrete auxiliary variables with measurement errors, respectively, establishes a measurement error model to correct the errors of auxiliary variables, and estimates their bias mechanisms in a Bayesian framework to give a unit-level hierarchical Bayesian small area estimation model. In particular, considering that in the actual sampling process, the data are often discrete variables and require higher model stability, this chapter verifies the reasonableness of the uninformative prior Bayesian algorithm through rigorous theoretical derivation, refines the study of small area estimation of biased auxiliary variables, and enhances the applicability of the method at the same time. The simulation and empirical results show that correcting for the measurement errors of the auxiliary variables is necessary. Otherwise, there is a risk of reduced inference accuracy or even falsification of the conclusions.
Chapter 4 discusses small area estimation methods based on biased target variables. This chapter focuses more on the correction of the second source of error on the basis of considering the first source of error, especially in the case that the sample information is not fully exploited under the sample information extraction with errors. In addition to the use of external auxiliary information, the spatial hierarchy information of the small areas themselves is an often overlooked factor in the model construction process, as reflected in the model assumptions on the target variables. In this chapter, a multilevel model is constructed by portraying inter-domain heterogeneity and correlation, and the specified small area characteristics are estimated with the help of other area samples, which fully exploits the hidden information of the samples, and obtains a higher precision estimator in a bottom-up manner, thus achieving error correction. In addition, in order to simulate the common application scenarios in sampling surveys, this chapter discusses not only the small area mean estimator, but also the small area ratio estimator so as to increase the applicability of the model in practical applications and meet the application needs for different estimators in practice. The simulation and empirical results show that with limited auxiliary information, it is necessary to dig deeper into the internal information of the sample to correct the error, which can enhance the stability of the estimator and reduce the sensitivity of the model to outliers.
Chapter 5 discusses small area estimation methods based on biased samples. This chapter focuses more on the correction of the third source of error, i.e., the case where the sample is not representative, taking into account the second source of error. For non-probability samples with coverage error and selectivity bias, where the sample structure deviates significantly from the overall structure, the correction idea is selected as a data integration method between non-probability and probability samples. The overall representativeness of the probability sample can correct the coverage error and selectivity bias of the non-probability sample well, while the non-probability sample solves the problem of small sample size in the process of small area estimation to a certain extent. The population distribution is modeled using a model-assisted approach: for the non-probability samples, a propensity score model is built to give consistent and asymptotically unbiased pseudo-weighted estimates; for the probability samples, a prediction model is built and imputed to obtain consistent, unbiased estimates. To improve the stability of the model and correct for model assumption errors in the second source of error, a double robust model is constructed. This chapter gives a rigorous and careful reasoning proof process for the property and variance of the double robust estimator and further extends it to the Bayesian framework to fully incorporate the uncertainty factor, which provides new ideas and methods for the small area estimation problem and also opens up new paths for data integration problems. The simulation and empirical results show that the double robust model has higher stability and broader applicability, reducing the dependence on model assumptions.
| Date of Award | 24 Jul 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Chi Wing CHU (Supervisor), Yongjin JIN (External Supervisor) & Kwok Fai Geoffrey TSO (Supervisor) |
Keywords
- small area estimation
- error correction
- multilevel modeling
- data integration
- Bayesian approach