Abstract
Deep learning (DL) models have been widely applied in many application domains. Assurance of the robustness of these DL models before deployment is essential to identify their weaknesses in vulnerabilities and prevent exploiting these weaknesses from manifesting as damage in the real world. One of the most extensively studied methods for assurance is robustness testing. Robustness testing evaluates DL models for vulnerabilities by generating high-quality test suites composed of failing test cases from seeds, the input samples of the models under test. DL fuzzing techniques should address the quantity and quality issues of their generated test suites. They should effectively produce test cases that cover specific input scenarios represented by individual seeds while also generating comprehensive robustness-oriented test suites.
Many DL model fuzzing techniques generate failing test cases by perturbing each sample using its gradient data, known as gradient-based fuzzing techniques. Despite their high effectiveness compared to other fuzzing techniques, gradient-based fuzzing techniques typically produce failing test cases with low diversity in perturbation sources, often limited to the gradients or their variants of individual seeds. Several empirical studies further show that their testing metrics do not outperform random selection in guiding models to achieve substantially greater robustness. They also generate test suites with much lower quality when fuzzing models regularized on the gradients of the training samples.
This thesis presents a novel contextual robustness-oriented testing framework for the robustness testing of DL models to address the above-mentioned problems. It makes two main contributions:
The first contribution is the formulation of a fuzzing technique called Clover, which achieves high diversity in the gradient compositions of test cases within the generated test suites. These test suites are generated by a novel fuzzing algorithm guided by its novel testing metric called Contextual Confidence (CC). CC represents a novel class of testing metrics that evaluates individual samples based on their surrounding samples. It computes the mean prediction confidence of these surrounding samples predicted to the prediction labels of the respective perturbed samples. The fuzzing algorithm incrementally groups the past test cases by their seed labels and adversarial labels, identifies the one with the highest CC value for the same seed in each such group, and represents them by the perturbations that produce these test cases from their seeds. It combines those perturbations grouped under the same group as the seed under fuzzing with the latter seed to produce test cases. Experiments show Clover outperforms peer gradient-based fuzzing techniques in producing test suites with significantly greater diversity in both unique categories at the model level and prediction labels at the seed level, and a significantly greater guiding effect on robustness improvement.
The second contribution is the formulation of a pair of novel techniques Aster and Basil, designed to address the challenges of how to enable gradient-based fuzzers to be effective in fuzzing deep learning models regularized on the gradients of training samples. Aster proposes a seed generation technique with a novel reachability-based strategy to create replacement seeds by systematically encoding the perturbations of other seeds in the seeds being encoded. Experiments show that Aster outperforms the baseline methods by 22% to 31% in seed success rate. Basil further proposes a more generic seed generation technique that iteratively and incrementally encodes the vicinity perturbation constructed by the perturbation of the seed itself and another seed into the former seed while diversifying their relative prediction outputs until the diversity of the set of all resulting seeds converges. The experimental results show that Basil surpasses the baseline methods even more, achieving improvements of 27% to 54% in seed success rate, 44% to 71% and 57% to 108% to facilitate gradient-based fuzzers to generate test suites with greater diversity in unique categories and prediction labels at the seed level, and 18% to 46% in its guiding effect on robustness improvement, respectively. It also effectively addresses the problem of diminished fuzzing effectiveness.
In summary, this thesis presents a novel approach to robustness testing of deep learning models by contributing novel and effective fuzzing and seed generation algorithms that leverage information from the seed list and all test case candidates.
| Date of Award | 3 Apr 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Wing Kwong CHAN (Supervisor) |