Divide and Conquer in High-dimensional Statistical Models
Project: Research
Description
Given the recent rapid increase in the availability of extremely large datasets, storage, access, and analysis of such data sets becomes critical. This proposed research is concerned with downstream statistical analysis of such big datasets. Since data sets are often too large to load into memory of a single machine, let alone conducting statistical analysis for the whole data sets at once, divide and conquer methodology has received significant attention. Conceptually, this simply involves distributing the entire data to multiple machines, carrying out standard statistical model fitting at each local machine separately to obtain multiple estimates of the same quantities/parameters of interest, and finally pooling the estimates into a single estimate on a central machine by a simple averaging step. This simple but powerful strategy fits into the well-known MapReduce framework of Hadoop and thus can be naturally implemented in a parallel computing environment, although the proposed research is mainly concerned with fundamental theoretical questions rather than detailed implementation. For many models, the simple divide and conquer method can be theoretically shown to achieve the same estimation performance as when the entire data set is analyzed by a single machine, which is called the oracle property of the divide and conquer method. However, for high-dimensional models where the number of parameters to estimate could exceed the number of observations, the case is more complicated. In particular, the naïve averaging fails due to the propagation of bias attributed to the penalty used to make high-dimensional estimation feasible. Thus debiasing is critical before aggregation. In this proposal, we plan to study divide and conquer method for several high-dimensional statistical models, including partially linear models, quantile regression models, and support vector classification. The purpose of this study is to propose debiasing method in these penalized regression models and establish rigorously the optimal convergence rate or even, in some cases, the asymptotic distribution of the aggregated estimates. The technical challenges include dealing with propagation of error from the linear part to the nonlinear part in partially linear models, estimating unknown conditional density function in quantile regression, and approximating the Dirac delta function for support vector classification. Once achieved, it will deepen our understanding of the divide and conquer strategy and significantly expand its applicability.Detail(s)
Project number | 9042684 |
---|---|
Grant type | GRF |
Status | Finished |
Effective start/end date | 1/10/18 → 24/08/23 |