Distributed Statistical Estimation and Inference in Big Data


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date18 Jun 2020


With the development of modern technology, large-scale datasets are becoming increasingly obtainable in various research fields, providing opportunities for more effective data analysis and statistical modeling. Conventional statistical inference methods are generally proposed to deal with the centralized paradigm, where the dataset is stored in one central machine, and the central machine can simultaneously process the entire dataset. However, these conditions no longer hold in the big data era, where the dataset is frequently preserved in different servers, and the memory and computational capacity of one machine is incapable of dealing with the entire dataset. This raises an urgent requirement to develop statistical inference methods in a distributed paradigm.

Based on the divide-and-conquer strategy, this dissertation develops novel distributed statistical inference methods in the context of big data. In particular, we focus on the parameter estimation and inferential procedure of three different types of critical statistical problems, including quantile regression model, accelerated failure time model, and U-statistics-based empirical risk minimization.

The first part of this dissertation focuses on quantile regression. Based on the divide-and-conquer strategy, we propose a simple and efficient distributed approach for the statistical inference of quantile regression models. Specifically, the proposed approach first obtains compact summary statistics from each data block, and then uses them to reconstruct the estimator of the entire data with asymptotically negligible approximation error. The proposed approach is a one-shot algorithm, which makes it particularly appealing when data comes in the form of the data stream. Furthermore, under proper partitioning of the data, the proposed estimator is shown to be consistent and asymptotically as efficient as the original estimating equation estimator based on the full sample. Empirical analyses based on both simulated and real massive data sets are conducted to demonstrate the merits of the proposed method.

The second part of this dissertation studies the distributed statistical inference for the accelerated failure time (AFT) model, which has been widely used in survival analysis due to its direct interpretation. Particularly, we focus on the robust rank-based inference for the AFT model. For large-scale datasets, such rank-based methods are computationally intensive and require a huge memory space, which limits the applications of the AFT model in big data analysis. To address this issue, under the framework of rank-based inference, we propose an effective distributed statistical inference method for the AFT model. The proposed method only requires to calculate an initial estimator based on a small subset of the entire data. It then will progressively improve the estimator with simple operations involving only local pairwise summations and linear combining. Asymptotic results show that the proposed estimator can achieve the same statistical efficiency as the benchmark estimator computed on the entire data after a small constant number of iterations. Besides, an efficient distributed inferential procedure, including the estimation of both the covariance matrix and confidence intervals, is provided. Extensive simulation studies including the sensitivity analysis are conducted to demonstrate the efficacy of the proposed method further.

The third part of this dissertation considers the U-statistics-based empirical risk minimization (U-ERM), whose loss functions depend on a pair of data points. Since the computational complexity of a U-statistic grows quadratically with the total sample size N, minimizing such U-statistic type empirical risks requires considerably high computational efforts and memory resources even for moderate N. This raises critical challenges for the application of U-ERM in big data analysis. To address this issue, we propose two computationally and statistically efficient distributed estimation methods. The first method is built on a surrogate empirical risk, which can be distributed computed and requires to consider only sample pairs in the same data blocks. The second approach extends the one-step updating scheme in the classical M-estimation to the case of pairwise losses. Furthermore, theoretical upper bounds for the approximation error and mean squared error are provided, and it is shown that the proposed estimators are both asymptotically as efficient as the benchmark estimator based on the entire data under certain conditions. To further relax theoretical conditions, we also introduce two distributed iterative algorithms, which can progressively improve the proposed estimators and just require a very small number of iterations to converge. Extensive simulation studies confirm the excellent properties of two proposed methods.