Exploiting Interaction and Co-occurrence of Multi-modal Features for Early Fusion in Retrieval
DescriptionIn this project, early fusion is investigated by assessing the interaction and co-occurrence of different modalities prior to semantic retrieval. Specifically, the inter-dependency of speech-based text and visual tokens is exploited by interacting and coupling each other for concept detection during retrieval. The concepts basically serve as semantic filters in answering multimedia queries. When investigating multi-modal feature interaction, multiple instance learning is proposed, which allows the learning of visual-text association from a small set of partially labelled examples. Most importantly, the issues of many-to-one region-to-word mapping are addressed through visual flow characterization and Expectation-Maximization algorithm. This is in contrast to the state-of-the-art techniques that mostly rely on the one-to-one region-to-word translation. In addition to modality interaction, event-based sensory retrieval is inspected by characterizing high-level concepts through the co-clustering of multi-modal features. This allows for the investigation of the co-utilization of different modalities when answering difficult queries that involve multimedia concepts.
|Effective start/end date
|1/01/07 → 1/03/10