Optimal decorrelated score subsampling for generalized linear models with massive data

Junzhuo Gao, Lei Wang*, Heng Lian

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

7 Citations (Scopus)

Abstract

In this paper, we consider the unified optimal subsampling estimation and inference on the low-dimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensional generalized linear models (GLMs) with massive data. We first present a general subsampling decorrelated score function to reduce the influence of the less accurate nuisance parameter estimation with the slow convergence rate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelated score subsampling algorithm are established, and two optimal subsampling probabilities are derived under the A- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposed optimal subsampling probabilities provably improve the asymptotic efficiency upon the subsampling schemes in the low-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs. A two-step algorithm is further proposed to implement and the asymptotic properties of the corresponding estimators are also given. Simulations show satisfactory performance of the proposed estimators, and two applications to census income and Fashion-MNIST datasets also demonstrate its practical applicability. © 2023, Science China Press.
Original languageEnglish
Pages (from-to)405-430
JournalScience China Mathematics
Volume67
Issue number2
Online published29 Jun 2023
DOIs
Publication statusPublished - Feb 2024

Funding

This work was supported by the Fundamental Research Funds for the Central Universities, National Natural Science Foundation of China (Grant No. 12271272) and the Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin. The authors are grateful to the referees for their insightful comments and suggestions on this article, which have led to significant improvements.

Research Keywords

  • 62H12
  • 62R07
  • A-optimality
  • decorrelated score subsampling
  • high-dimensional inference
  • L-optimality
  • massive data

Fingerprint

Dive into the research topics of 'Optimal decorrelated score subsampling for generalized linear models with massive data'. Together they form a unique fingerprint.

Cite this