A Comprehensive Test-Driven Framework for Enhancing Deep Learning Models


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date25 Sept 2023


Deep learning (DL) models have been widely deployed in many applications. Enhancing a DL model requires finding failing test cases of the deployed model, localizing model components with lower performance, and repairing the model. However, such an enhancement task needs to address several challenges.

Detecting failing test cases is challenging due to the need to select them before labeling among many samples. Existing techniques fail to detect failing test cases with higher prediction confidence effectively and only apply to the DL models of single model types. Besides, localizing erroneous components is challenging due to the unconditional feed-forward inference nature of DL models, making the profiling of internal model states to differentiate failing test cases from the rest ineffective. Moreover, maintaining a DL model for large robustness improvement without sacrificing the original standard accuracy is challenging. Existing maintenance techniques often trade off the achieved standard accuracy of the trained model to obtain large robustness improvements.

This thesis presents a comprehensive framework to address the abovementioned problems with the following contributions.

The first contribution is formulating a novel technique DeepPatch to protect the standard accuracy from deterioration while improving the robustness, especially under large perturbations. DeepPatch formulates a novel division of labor method to adaptively activate a subset of its inserted patching units to process individual samples. Its produced model can generate the original or replacement feature maps in each forward pass of the patched model, making the patched model carry an intrinsic property of behaving like the model under maintenance on demand. The experiment shows that DeepPatch successfully retains the standard accuracy of all pretrained models while improving the robustness substantially.

The second contribution of this thesis is to formulate a pair of novel techniques, A3Rank and EffiMAP, to tackle the ineffectiveness and generalization problems in test case prioritization. A3Rank is the first work to effectively prioritize failing test cases among test cases with high confidence. It proposes a novel augmentation alignment analysis to diagnose the intrinsic property of prediction consistency between test cases and their augmented variants for data-augmented DL models. The experiment shows that A3Rank outperforms peer techniques by 163.63% in detection rate. EffiMAP is the first generalized effective predictive mutation analysis technique for both classification and regression models. It predicts whether model mutants are killed by a test case by the execution trace innovatively without performing comprehensive mutation analysis in the test phase, and the experiment results validate the feasibility and effectiveness of EffiMAP in the DL model testing domain.

The third contribution is formulating FilterFuzz to address the ineffectiveness problem in faulty component localization. FilterFuzz localizes a set of suspicious components in DL models that ablating them would improve model performance. It is the first work to propose a test adequacy criterion, filter coverage, with a clear cause-and-effect chain. Our case study shows that FilterFuzz is significantly more effective than fuzzing guided by neuron coverage by 33%.

In summary, this thesis presents a comprehensive framework to enhance the DL models with three major contributions: finding failing test cases, localizing faulty components in DL models, and maintaining DL models for robustness improvement.