Semi-Supervised Statistical Learning Theory and Method for Mean Estimation
均值估計的半監督統計學習理論與方法
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 13 Sept 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(0684944e-f184-4a8c-b428-1ee1e9dc1d3a).html |
---|---|
Other link(s) | Links |
Abstract
With the advancement of technology, acquiring large-scale datasets has become increasingly straightforward. However, in numerous real-world applications, obtaining a sufficiently large labeled dataset is time-consuming and costly, especially in highly specialized fields. Therefore, strategies need to be developed to leverage the overall available data to improve the performance of models. Semi-supervised learning ingeniously leverages a small amount of labeled data and a large amount of unlabeled data, reducing annotation costs while enhancing model performance. This innovative approach holds immense potential for future advancements in various domains. In this paper, we propose a semi-supervised statistical learning method for estimating population means. Based on this method, we systematically study several key issues in semi-supervised learning.
Firstly, a class of efficient and adaptive estimators are proposed via two-step semiparametric imputation under the assumptions of SSL-MCAR and fixed-dimensional covariates. The index model is introduced for dimension reduction first, followed by a second step of reweighting the labeled and unlabeled data. We establish the asymptotic normality of the proposed estimators based on the expansion of influence functions. Our theoretical analysis proves that the convergence rate of the semi-supervised estimator depends on the number of labeled samples, and unlabeled samples can improve the estimation efficiency. Under the MCAR assumption, we establish the semiparametric efficiency of mean estimation within a given model class. At the same time, we verify the efficiency improvement relative to the supervised estimator both theoretically and experimentally. Furthermore, we introduce a variance estimation technique based on perturbation resampling.
Secondly, under the SSL-MAR assumption, we introduce the propensity score πM(X) and observation probability π*M, which are related to the total sample size M. We propose an estimator that does not require estimating the propensity score but is asymptotically equivalent to the estimator with known πM(X). Theoretical analysis shows that its convergence rate is the degenerate √Mπ*M, which is slower than the usual √M convergence rate. This is the most significant difference compared to the IPW estimator in traditional missing data literature. Furthermore, combining the idea of double machine learning, we propose a robust estimator that is unaffected by model misspecification. We also prove theoretical properties such as the convergence rate of a series of kernel estimators that violate the positivity overlap assumption. We further discuss the semiparametric efficiency lower bound under the SSL-MAR assumption. We successfully derive the asymptotic variance lower bound for the target parameter and prove that the series of estimators we proposed can achieve this lower bound when model is correctly specified.
Finally, we consider semi-supervised data integration methods for multi-source heterogeneous data, accommodating varying data structures across different sources. In particular, we allow fragmented missing covariate in unlabeled data. When at least one dataset has complete covariates and labels, other data sources have different missing patterns in the covariates and no labels. Our method directly imputes the loss function, and obtains the target parameter estimate based on empirical risk minimization. Furthermore, we introduce a meta estimator that utilizes only summary information. The proposed methods have theoretical guarantees, and their performance are validated through simulation experiments and real data applications.
Firstly, a class of efficient and adaptive estimators are proposed via two-step semiparametric imputation under the assumptions of SSL-MCAR and fixed-dimensional covariates. The index model is introduced for dimension reduction first, followed by a second step of reweighting the labeled and unlabeled data. We establish the asymptotic normality of the proposed estimators based on the expansion of influence functions. Our theoretical analysis proves that the convergence rate of the semi-supervised estimator depends on the number of labeled samples, and unlabeled samples can improve the estimation efficiency. Under the MCAR assumption, we establish the semiparametric efficiency of mean estimation within a given model class. At the same time, we verify the efficiency improvement relative to the supervised estimator both theoretically and experimentally. Furthermore, we introduce a variance estimation technique based on perturbation resampling.
Secondly, under the SSL-MAR assumption, we introduce the propensity score πM(X) and observation probability π*M, which are related to the total sample size M. We propose an estimator that does not require estimating the propensity score but is asymptotically equivalent to the estimator with known πM(X). Theoretical analysis shows that its convergence rate is the degenerate √Mπ*M, which is slower than the usual √M convergence rate. This is the most significant difference compared to the IPW estimator in traditional missing data literature. Furthermore, combining the idea of double machine learning, we propose a robust estimator that is unaffected by model misspecification. We also prove theoretical properties such as the convergence rate of a series of kernel estimators that violate the positivity overlap assumption. We further discuss the semiparametric efficiency lower bound under the SSL-MAR assumption. We successfully derive the asymptotic variance lower bound for the target parameter and prove that the series of estimators we proposed can achieve this lower bound when model is correctly specified.
Finally, we consider semi-supervised data integration methods for multi-source heterogeneous data, accommodating varying data structures across different sources. In particular, we allow fragmented missing covariate in unlabeled data. When at least one dataset has complete covariates and labels, other data sources have different missing patterns in the covariates and no labels. Our method directly imputes the loss function, and obtains the target parameter estimate based on empirical risk minimization. Furthermore, we introduce a meta estimator that utilizes only summary information. The proposed methods have theoretical guarantees, and their performance are validated through simulation experiments and real data applications.
- Semi-parametric Efficiency, Semi-supervised Learning, Block-wise Missing, Perturbation Resampling, Multi-Source Data Fusion