Skip to main navigation Skip to search Skip to main content

Optimal distributed subsampling under heterogeneity

  • Yujing Shao
  • , Lei Wang*
  • , Heng Lian
  • *Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Distributed subsampling approaches have been proposed to process massive data in a distributed computing environment, where subsamples are taken from each site and then analyzed collectively to address statistical problems when the full data is not available. In this paper, we consider that each site involves a common parameter and site-specific nuisance parameters and then formulate a unified framework of optimal distributed subsampling under heterogeneity for general optimization problems with convex loss functions that could be nonsmooth. By establishing the consistency and asymptotic normality of the distributed subsample estimators for the common parameter of interest, we derive the optimal subsampling probabilities and allocation sizes under the A- and L-optimality criteria. A two-step algorithm is proposed for practical implementation and the asymptotic properties of the resultant estimator are established. For nonsmooth loss functions, an alternating direction method of multipliers method and a random perturbation procedure are proposed to obtain the subsample estimator and estimate the covariance matrices for statistical inference, respectively. The finite-sample performance of linear regression, logistic regression and quantile regression models is demonstrated through simulation studies and an application to the National Longitudinal Survey of Youth Dataset is also provided. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
Original languageEnglish
Article number26
JournalStatistics and Computing
Volume35
Issue number2
Online published6 Jan 2025
DOIs
Publication statusPublished - Apr 2025

Research Keywords

  • ADMM
  • Heterogeneity
  • Nonsmooth loss
  • Random perturbation
  • Site-specific nuisance parameters

Fingerprint

Dive into the research topics of 'Optimal distributed subsampling under heterogeneity'. Together they form a unique fingerprint.

Cite this