Towards Efficient Learning from Mono-Modal to Multi-Modal Data for Medical Image Diagnosis

醫療圖像診斷中單模態到多模態的高效學習研究

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date13 Sept 2024

Abstract

Medical imaging plays a crucial role in clinical diagnosis, providing valuable insights for healthcare professionals in assessing and treating various medical conditions. Recent deep learning techniques have revolutionized automatic medical image diagnosis on various imaging modalities, offering remarkable advancements in diagnosis accuracy and efficiency.

Despite achieving impressive advances, previous methods suffer from two main challenges when applied in real clinical practice. The first is label limitation issue. Most current methods rely on massive data annotations by professional physicians and perform a fully-supervised learning paradigm, where the expert annotations are difficult to manual labeling with expensive cost and time-consuming problems. To solve this, we aim to take advantage of the large amount of unlabeled samples that are usually easy to be accessed for providing more knowledge with limited or even no labeled medical data. The second challenge is incomplete multi-modal learning. In many diagnosis applications, physicians need to utilized multi-modal data to perform accurate diagnosis outcomes, therefore, leveraging multi-modal data to train strong deep learning models is also beneficial to better results. However, in real clinical scenarios, many medical centers may not access complete modalities and they can merely use incomplete multi-modal data to perform model inference, leading to remarkable performance drops on current multi-modal learning approaches. To address this problem, we aim to develop robust deep learning systems against the possible cases of missing modalities.

In the first part, in order to fully leverage unlabeled data with a small amount of labeled images, we propose a novel semi-supervised framework, TEmporal knowledge-Aware Regularization (TEAR). In TEAR, Adaptive Pseudo Labeling (AdaPL) is proposed as a mild learning strategy to relax hard pseudo labels to soft-form ones and provide a cautious training. Furthermore, to release the excessive dependency of biased pseudo labels, we take advantage of the temporal knowledge and propose Iterative Prototype Harmonizing (IPH) to encourage the model to learn discrimination representations in an unsupervised manner.

In the second part, we retrospect the whole training pipeline of semi-supervised learning, identifying three hierarchical biases, and propose HierArchical BIas miTigation (HABIT) framework consisting of three modules. Firstly, a Mutual Reconciliation Network is devised to jointly utilize convolution and permutator-based paths with a mutual information transfer module to exchanges features. Secondly, Recalibrated Feature Compensation is designed to adaptively adjust the strongly and weakly augmented distributions, maintaining a well-calibrated discrepancy between them. Thirdly, we tailor Consistency-aware Momentum Heredity (CMH) to enforce the consistency among different sample augmentations to improve the model dependability.

In the third part, we aim to leverage multi-modal medical data to perform multi-modal self-supervised pre-training and present an enhanced masked relation modeling (MRM++) framework. Instead of randomly masking input data as done in previous MIM methods, which can lead to the loss of disease-related semantic information, we design prior-guided relation masking to mask out token-wise feature relation in both self- and cross-modality levels, which preserves intact semantics within the input and allows the model to learn rich disease-related information. Moreover, to enhance semantic relation modeling, we propose relation matching to align the sample-wise relation between the intact and masked features. Additionally, to mind the gap between pre-training and downstream fine-tuning, we devise the task-oriented adapting as a mid-phase before fine-tuning to transfer compatible knowledge from pre-trained model into various downstream diagnosis tasks.

In the fourth part, we consider a practical clinical scenario that the available multi-modal data may be incomplete, hampering the data utilization of the model. To address this problem, we propose a dual-disentanglement network consisting of the modality disentanglement (MD)-Stage and tumor-region disentanglement (TD)-stage. The MD-Stage involves a novel spatial-frequency jointly modality contrastive (SFMC) learning scheme to help the model explicitly exploit the correlations. Moreover, the TD-Stage presents an affinity-guided dense tumor-region knowledge distillation (ADT-KD) mechanism to produce decoupled tumor-specific knowledge unrelated to MRI modalities.

In the last part, we formulate a new clinical scenario, i.e., federated incomplete multi-modal brain tumor segmentation, and present a Progressive distiLlation with Optimal Transport framework (PLOT) to gradually train a modality robust segmentation model at each client and achieve compatible model aggregation at the server. To remedy the issue of unstable local training caused by the random modality input, we present Modality Progressive Distillation (MPD), a multi-level knowledge distillation strategy guided by a modality routing mechanism. Moreover, to address the problem that the layer-wise knowledge from different models may contradict, at the server, we design Optimal Transport-guided Model Aggregation (OTMA) strategy, which yields a global alignment solution for model parameters via solving an optimal transport problem.

In conclusion, this thesis aims to efficiently learn from mono-modal to multi-modal data for medical image diagnosis by analyzing a series of issues from data and annotations aspects, including learning from limited labeled data, label-free training, modality missing and data decentralization. To tackle with these problems, we propose multiple novel methods, containing TEAR as a strong regularization for better semi-supervised learning, HABIT as a holistic framework to address biases of pseudo labeling baselines, MRM++ as an efficient self-supervised pre-training framework for transferable representation learning, D2Net for robust brain tumor segmentation against modality missing situations, and PLOT as a decentralized training strategy for superior FedIML. We consider the clinical scenarios including both mono-modality data, i.e., radiology images, wireless capsule endoscopy images, histopathology images, etc., and multi-modal data, i.e., multiple MRI sequences, radiology images with text reports, histopathology images with genetic profiles, etc. For the supervision manner of model training, we consider fully-supervised learning, semi-supervised learning, self-supervised learning, learning with missing modalities and decentralized training.