Large-scale Multi-label Classification and Its Application to Unstructured Text Data

Project: Research

Project Details

Description

This proposed research project will develop a large-scale multi-label classification frame-work for text summarization, which aims at creating a set of tags to capture the mostessential aspects of the original text documents. A novel tagging loss function is in-troduced to measure the discrepancy between predicted and actual tag sets, which isexpressed in terms of a weighted sum of pairwise margins between two tags, weightedby their degrees of similarity. On this ground, a regularized empirical loss is constructedto incorporate certain linguistic knowledge, and identify a tagger maximizing the sepa-rations between the pairwise margins. One salient feature of the proposed method is itscapability of detecting novel tags absent from a training sample by exploring similarityamong existing tags. This is in sharp contrast to most existing summarization methodsthat may completely ignore the novel tags. The PI will investigate the theoretical proper-ties of the proposed summarization method, and establish asymptotic and finite-sampleupper bounds of its tagging error. The PI will also develop efficient computing algorithmsto facilitate large-scale optimization, integrating the strength of inexact alternating di-rection method of multipliers and parallel computing platform. The proposed methodwill be applied to summarize the Reuters dataset consisting of over 800,000 news stories.
Project number9042394
Grant typeGRF
StatusFinished
Effective start/end date1/01/171/12/20

Keywords

  • Statistical learning , Classification , Multiple responses , RKHS , Regularization

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.