Beyond Data Augmentation: Generative Modeling of Close-to-real Training Examples in Machine Learning through Domain Knowledge Injection
DescriptionDue to the highly data-driven nature of current machine learning frameworks, their applicability to scenarios involving rare yet important classes of events, such as traffic accident prediction through detection of unusual pedestrian movements, and disease probability estimation using gene expression profiles, will be severely hampered due to insufficient training data. While different attempts have been made to synthesize new training examples, including data augmentation techniques that transform existing training data to generate new samples, and Generative Adversarial Network (GAN)- based approaches that match the synthetic data probability distribution with the real distribution through adversarial learning, they cannot adequately capture fine-grained relationships across data features and instances for generating semantically-valid training examples. To address this problem, we propose a new framework for synthesizing close-to-real training examples, by capturing fine-grained high-order relationships across data features/instances. Based on preliminary human- and/or machine-annotated prior knowledge in the form of low-order relationships such as correlation between feature pairs, our framework will extrapolate these links into multiple high-order relationships across features/instances through a new graph convolutional neural network approach. The resulting enriched relationships, such as long-range image region correlations in computer vision, and multiple gene subset interactions in bioinformatics, can effectively constrain the data generation process to synthesize semantically-valid training examples. A key research question is how to learn an optimized embedded representation for these high-order relationships, and seamlessly integrate these representations with the learnt latent representations of training data to synthesize new examples. This will significantly broaden the type of prior knowledge for guiding the data synthesis process, beyond coarse-grained information such as class labels in GANs. Another important research question is how to extend the original discriminator in GAN into a generalized critic module for judging whether synthesized examples are semantically valid, by determining the presence/absence of high-order feature/instance relationships previously discovered. In this way, more refined control of the training data synthesis process can be achieved. In view of the pervasive data scarcity problem in machine learning, and the incapability of current approaches in synthesizing semantically-valid training examples, the proposed framework, which enables the judicious injection and enrichment of human and/or machine-annotated knowledge into the training data synthesis process, represents a significant advance beyond conventional coarse-grained data synthesis approaches. The resulting high-quality synthetic training data will significantly contribute to downstream machine learning applications supported by a small training set, including traffic accident prediction in self-driving vehicles, and disease classification based on gene expression profiles.
|Effective start/end date||1/01/23 → …|