Addressing the Massive Data Challenge with Distributed Inference for U-Statistics-Based Empirical Risk Minimization

Project: Research

View graph of relations


Big data has presented the field of statistics with opportunities as well as challenges. One opportunity widely recognized is the possibility of unearthing subtle population patterns capable of informing decision making. On the other hand, challenges have arisen due to the sheer size of data sets which can result in pronounced performance bottlenecks if estimation is attempted using a single machine, even with current computing power and memory. A common solution is to “divide-and-conquer” the massive data into manageable blocks, stored on multiple machines, with analysis performed on the data subsets independently, then combining these independent results into a final decision. This strategy is now the popular approach for overcoming the large data sets problem, with various machine learning algorithms having been developed. Yet, the statistical validity of results generated by applying these distributed algorithms remains understudied. Traditionally, the validity of statistical procedures has been established under the assumption of centralized computing. These procedures may no longer produce valid results when they are executed in a distributed environment. The proposed project focuses on empirical risk minimization (ERM) with pairwise loss functions under distributed computing. ERM is a general statistical estimation principle, and pairwise losses that depend on pairs of examples have been found useful in many fields. Under this setup, the empirical risk estimator is a U-statistic. It is well-known that when minimizing U-statistic-type empirical risks, the computational complexity increases quadratically with the sample size, which virtually prohibits centralized computing. The large number of sample pairs also makes it difficult if not impossible to obtain a complete U-statistic. The latter represents a major issuewith distributed inference for pairwise problems, for which the objective function is inseparable across observations, rendering existing distributed computing methods inapplicable. The object of the proposed project is to develop new distributed methods for U-statistic-based ERM. In this proposal we lay out a systematic approach to tackle the above problem. We propose distributed methods which are shown by our exploratory study to be as efficient as the global U-estimator obtained in a centralized environment. We will further investigate their properties, develop valid inference procedures including methods of estimating the covariance matrices, and devise iterative techniques for improving efficiency. In addition, we will provide simulation verification to the theoretical results in real world settings, evaluate the proposed approaches on real data applications, and provide open source software for others to use and build upon. The requested funding is mainly to recruit a research assistant to conduct the empirical analysis. 


Project number9043419
Grant typeGRF
StatusNot started
Effective start/end date1/01/23 → …