Advancing Data Efficiency in Deep Learning: A Focus on Fairness and Robustness

數據高效深度學習中的公平性與魯棒性優化

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
  • Minming LI (Supervisor)
  • Qing Li (External person) (External Co-Supervisor)
Award date30 Aug 2024

Abstract

Deep learning models are notoriously data-hungry, requiring vast amounts of data and extensive training time. Addressing the challenge of training deep models more efficiently has become a significant research focus. Data-efficient training aims to devise strategies that select or generate a smaller, yet representative, subset of the original data, enabling deep neural networks to be trained more quickly while maintaining comparable predictive performance. Various approaches, such as coreset selection, dataset condensation, and curriculum learning, have demonstrated remarkable success in reducing the time and hardware requirements for model training. For instance, applying dataset condensation to graph-structured data can reduce the graph size by over 99.9%, while still achieving approximately 95.3% of the original test accuracy of node classification on the Reddit dataset.

Despite the advancements in data-efficient training, existing research primarily emphasizes optimizing prediction accuracy, often neglecting the fairness and robustness of the trained models. Fairness in machine learning evaluates whether a model exhibits disparate performance across different subgroups (e.g., male vs. female, white vs. black). Robustness, on the other hand, assesses a model's ability to maintain consistent performance under small input perturbations such as image distortion or text misspelling. Both fairness and robustness are critical for the development of trustworthy AI systems. However, during the data selection process in data-efficient training, considerations for fair treatment of various subgroups are often overlooked, potentially leading to a loss of balance and diversity in the resultant data. Additionally, deep models trained on limited data are prone to overfitting, resulting in decreased robustness to input perturbations and poorer generalizability to unseen data.

In this thesis, we aim to bridge the gap between data-efficient training and the emerging need for model fairness and robustness. Specifically, the paradigm of data-efficient training can be divided into two stages: dataset summarization and model training. In the first stage, we utilize dataset condensation to compress the original, larger dataset into a synthetic, smaller one. Existing condensation methods often overlook subgroup fairness, leading to a loss of diversity and the generation of biased datasets due to lossy compression. To address this, we propose a novel framework that employs adversarial learning to create an unbiased agent model for condensation. Furthermore, recognizing the unclear reasons behind the worsening fairness in existing methods, we conduct a theoretical analysis to elucidate the relationship between original and condensed datasets. Based on these theoretical insights, we tackle the root cause of unfairness by aligning subgroup representations using multi-marginal optimal transport. In the second stage, we focus on training models on the resultant smaller datasets. Existing approaches often fail to maximize the synergy between dataset condensation and model training. Given that deep neural networks tend to overfit when trained on reduced data, we introduce a consistency learning framework. This framework incorporates both model- and data-level perturbations to smooth the model in the function space, thereby enhancing its robustness.

To summarize, in this thesis, we propose an adversarially-regularized dataset condensation framework aimed at mitigating the exacerbated unfairness issues observed in models trained on condensed datasets. Through a comprehensive theoretical analysis of existing methods, we reveal the relationship between condensed and original data. We address fairness by aligning subgroup representations during condensation, minimizing the Wasserstein distance between them. Additionally, to combat the unrobustness problem due to overfitting to small-scale datasets, we introduce a consistency learning algorithm. Experiments on real-world datasets demonstrate that our proposed methods significantly advance data-efficient training by enhancing the fairness and robustness of deep learning models.