Optimal decorrelated score subsampling for generalized linear models with massive data

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

4 Scopus Citations
View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Original languageEnglish
Pages (from-to)405-430
Journal / PublicationScience China Mathematics
Volume67
Issue number2
Online published29 Jun 2023
Publication statusPublished - Feb 2024

Abstract

In this paper, we consider the unified optimal subsampling estimation and inference on the low-dimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensional generalized linear models (GLMs) with massive data. We first present a general subsampling decorrelated score function to reduce the influence of the less accurate nuisance parameter estimation with the slow convergence rate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelated score subsampling algorithm are established, and two optimal subsampling probabilities are derived under the A- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposed optimal subsampling probabilities provably improve the asymptotic efficiency upon the subsampling schemes in the low-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs. A two-step algorithm is further proposed to implement and the asymptotic properties of the corresponding estimators are also given. Simulations show satisfactory performance of the proposed estimators, and two applications to census income and Fashion-MNIST datasets also demonstrate its practical applicability. © 2023, Science China Press.

Research Area(s)

  • 62H12, 62R07, A-optimality, decorrelated score subsampling, high-dimensional inference, L-optimality, massive data