Gradient-based Visual Explanation Methods and Applications for Object Detectors and CLIP

Student thesis: Doctoral Thesis

Abstract

The black-box nature of deep neural networks (DNNs) poses significant challenges to interpreting their decision-making processes, which limits our ability to understand their decision process and ultimately establish user trust. Visual explanation methods aim to explain DNNs by visualizing a heat map of important input regions used by the model to generate the output. The previous gradient-based visual explanation works focus on image classification or its variants. In this thesis, we propose methods for generating visual explanations that are suitable for two rarely explored model categories: object detectors and CLIP (Contrastive Language-Image Pre-training) models. Then, building on the proposed explanation methods, we further develop applications that enhance model performance and reliability. The main research results are as follows:

1) ODAM: Gradient-Weighted Activation Maps for Object Detectors
We propose the gradient-weighted Object Detector Activation Maps (ODAM), a visual explanation technique for interpreting the predictions of object detectors. ODAM produces heat maps that show the influence of regions on the detector’s decision for each predicted attribute. We then discuss two explanation tasks for object detection: a) object specification; and b) object discrimination. Aiming at these two aspects, we present a detailed analysis of the visual explanations of detectors.

2) ODAM-Driven Applications: Enhancing Detection via Explainability
For the explanation task of “object specification”, ODAM highlights the important regions for the specific prediction. Therefore, we propose ODAM-based knowledge distillation (ODAM-KD) for object detection. With the top-down attention provided by ODAM, a student detector is expected to learn better from the teacher detector. Then, for the explanation task “object discrimination”, ODAM makes interpretation about “which object was actually detected?” Therefore, we propose the ODAM-NMS to aid with duplicate removal while preserving overlapping detections of different instances in crowded scenes.

3) Grad-ECLIP: Gradient-based Visual-Textual Explanations for CLIP
We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Furthermore, a series of analysis are conducted based on our visual and textual explanation results.

4) Grad-ECLIP Application: Fine-Grained CLIP Alignment via Explainability
CLIP has been shown to have limitations in understanding fine-grained details, due to the pre-training focusing on matching the whole image to a text description. Based on the ability of explanation map that indicates text-specific saliency region of input image, we propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning.
Date of Award23 Jun 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorAntoni Bert CHAN (Supervisor)

Cite this

'