Sequence Labeling with Multiple Annotations
多標簽序列標註
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution  

Supervisors/Advisors 

Award date  11 Sept 2020 
Link(s)
Permanent Link  https://scholars.cityu.edu.hk/en/theses/theses(4862291ce7ed42df91249e4359d9d963).html 

Other link(s)  Links 
Abstract
Sequence labeling, which refers to assign a label to each token in a given input sequence, has been widely applied in Natural Language Processing and Computational Biology. Many effective sequence models like Conditional Random Fields and structured SVMs have delivered promising results in tagging sequential data. In the framework of supervised learning, training these models requires large amount of sequential data with exact annotations, which is costly and laborious to produce. In recent years crowdsourcing has received increasingly attention as it provides an efficient and cheap way to obtain large labeled datasets from a group of ordinary people. Extending crowdsourcing to sequence labeling can accelerate the construction of sequential datasets. However, traditional sequence models cannot be directly applied for the input sequences with multiple annotations. In the thesis, we focus on handling sequence labeling with multiple annotations.
By analyzing the graph structure of existing probabilistic models for sequence labeling, we propose SemiMarkov Condition Random Fields with Duration Modeling (DMSMCRFs) and apply DMSMCRFs to keyphrase extraction. First, by assuming the independence between state transition and state duration, DMSMCRFs model the distribution of duration of keyphrase to further explore state duration information. Since the keyphrase is more likely to be in a form of specific number of words (e.g. two words), explicitly modeling the duration of keyphrase can help distinguish the size of keyphrase. Second, the constrained Viterbi algorithm is derived to improve the effectiveness of decoding in DMSMCRFs. Based on the convexity of parametric duration feature derived from duration distribution, subpaths that have no chance to result in the best predecessor of state KP can be pruned out. Third, in order to demonstrate the effectiveness of the proposed model, we collect datasets from various domains, such as Psychology, Economics and History. The experimental results show that our proposed approach outperforms the traditional methods.
However, existing sequence labeling models require exact annotations. We are often given with multiple annotated sequential data collected from crowdsourcing platforms. Different from tagging independent instances, for crowd sequential annotations the quality of the label sequence relies on the expertise level of annotators in capturing internal dependencies. In the thesis, we propose modeling Sequential Annotation for Sequence Labeling with Crowds (SASLC). First, a conditional probabilistic model is developed to jointly model sequential data and annotators' expertise, in which categorical distribution is introduced to estimate the reliability of each annotator in capturing local and nonlocal label dependency for sequential annotation. To accelerate the marginalization of the proposed model, Valid Label Sequence Inference (VLSE) method is proposed to derive the valid groundtruth label sequences from crowd sequential annotations. VLSE derives possible groundtruth labels from the tokenwise level and further prunes subpaths in the forward inference for label sequence decoding. VLSE, reduces the number of candidate label sequences and improves the quality of possible groundtruth label sequences. The experimental results on several sequence labeling tasks of Natural Language Processing show the effectiveness of the proposed model.
Different from crowd sequential annotations, partial sequence labeling assumes that the groundtruth label sequence is masked by multiple annotations. Therefore how to identify the groundtruth label from ambiguous annotations is more important for partial sequence labeling models. Existing disambiguation strategies for partial sequence labeling just cannot generalize well to solve the problem that there are some candidates which can be false positive or similar to the groundtruth label. In the thesis, we propose a novel Weak Disambiguation for Partial Sequence Labeling (WDPSL). First, a piecewise large margin formulation is generalized to partial sequence labeling, which effectively avoids handling large number of candidate structured outputs for complex structures. Second, in the proposed weak disambiguation strategy, each candidate label is assigned with a confidence value indicating how likely it is the true label, which aims to reduce the negative effects of wrong groundtruth label assignment in the learning process. Then two large margins are formulated to combine two types of constraints which are the disambiguation between candidates and noncandidates, and the weak disambiguation for candidates. In the framework of alternating optimization, a new 2nslack variables cutting plane algorithm is developed to accelerate each iteration of optimization. We conduct experiments on the tasks of PartOfSpeech tagging and Chunking to verify the proposed model.
Existing partial sequence labeling models mainly focus on maxmargin framework which fails to provide an uncertainty estimation of the prediction. Further, the unique ground truth disambiguation strategy employed by these models may include wrong label information for parameter learning. In the thesis, we propose Structured Gaussian Processes for Partial Sequence Labeling (SGPPSL), which encodes uncertainty in the prediction and does not need extra effort for model selection and hyperparameter learning. The model employs factoraspiece approximation that divides the linearchain graph structure into the set of pieces, which preserves the basic Markov Random Field structure and effectively avoids handling large number of candidate output sequences generated by partially annotated data. Then confidence measure is introduced in the model to address different contributions of candidate labels, which enables the groundtruth label information to be utilized in parameter learning. Based on the derived lower bound of the variational lower bound of the evidence for the proposed model, variational parameters and confidence measures are estimated in the framework of alternating optimization. Moreover, weighted Viterbi algorithm is proposed to incorporate confidence measure to sequence prediction, which considers label ambiguity arose from multiple annotations in the training data and thus helps improve the performance. SGPPSL is evaluated on several sequence labeling tasks and the experimental results show the effectiveness of the proposed model.
In summary, we explore the information of multiple annotations from annotators' perspective and label distribution. The proposed models, which are verified by extensive comparison, effectively solve the addressed problems in sequence labeling with multiple annotations.
By analyzing the graph structure of existing probabilistic models for sequence labeling, we propose SemiMarkov Condition Random Fields with Duration Modeling (DMSMCRFs) and apply DMSMCRFs to keyphrase extraction. First, by assuming the independence between state transition and state duration, DMSMCRFs model the distribution of duration of keyphrase to further explore state duration information. Since the keyphrase is more likely to be in a form of specific number of words (e.g. two words), explicitly modeling the duration of keyphrase can help distinguish the size of keyphrase. Second, the constrained Viterbi algorithm is derived to improve the effectiveness of decoding in DMSMCRFs. Based on the convexity of parametric duration feature derived from duration distribution, subpaths that have no chance to result in the best predecessor of state KP can be pruned out. Third, in order to demonstrate the effectiveness of the proposed model, we collect datasets from various domains, such as Psychology, Economics and History. The experimental results show that our proposed approach outperforms the traditional methods.
However, existing sequence labeling models require exact annotations. We are often given with multiple annotated sequential data collected from crowdsourcing platforms. Different from tagging independent instances, for crowd sequential annotations the quality of the label sequence relies on the expertise level of annotators in capturing internal dependencies. In the thesis, we propose modeling Sequential Annotation for Sequence Labeling with Crowds (SASLC). First, a conditional probabilistic model is developed to jointly model sequential data and annotators' expertise, in which categorical distribution is introduced to estimate the reliability of each annotator in capturing local and nonlocal label dependency for sequential annotation. To accelerate the marginalization of the proposed model, Valid Label Sequence Inference (VLSE) method is proposed to derive the valid groundtruth label sequences from crowd sequential annotations. VLSE derives possible groundtruth labels from the tokenwise level and further prunes subpaths in the forward inference for label sequence decoding. VLSE, reduces the number of candidate label sequences and improves the quality of possible groundtruth label sequences. The experimental results on several sequence labeling tasks of Natural Language Processing show the effectiveness of the proposed model.
Different from crowd sequential annotations, partial sequence labeling assumes that the groundtruth label sequence is masked by multiple annotations. Therefore how to identify the groundtruth label from ambiguous annotations is more important for partial sequence labeling models. Existing disambiguation strategies for partial sequence labeling just cannot generalize well to solve the problem that there are some candidates which can be false positive or similar to the groundtruth label. In the thesis, we propose a novel Weak Disambiguation for Partial Sequence Labeling (WDPSL). First, a piecewise large margin formulation is generalized to partial sequence labeling, which effectively avoids handling large number of candidate structured outputs for complex structures. Second, in the proposed weak disambiguation strategy, each candidate label is assigned with a confidence value indicating how likely it is the true label, which aims to reduce the negative effects of wrong groundtruth label assignment in the learning process. Then two large margins are formulated to combine two types of constraints which are the disambiguation between candidates and noncandidates, and the weak disambiguation for candidates. In the framework of alternating optimization, a new 2nslack variables cutting plane algorithm is developed to accelerate each iteration of optimization. We conduct experiments on the tasks of PartOfSpeech tagging and Chunking to verify the proposed model.
Existing partial sequence labeling models mainly focus on maxmargin framework which fails to provide an uncertainty estimation of the prediction. Further, the unique ground truth disambiguation strategy employed by these models may include wrong label information for parameter learning. In the thesis, we propose Structured Gaussian Processes for Partial Sequence Labeling (SGPPSL), which encodes uncertainty in the prediction and does not need extra effort for model selection and hyperparameter learning. The model employs factoraspiece approximation that divides the linearchain graph structure into the set of pieces, which preserves the basic Markov Random Field structure and effectively avoids handling large number of candidate output sequences generated by partially annotated data. Then confidence measure is introduced in the model to address different contributions of candidate labels, which enables the groundtruth label information to be utilized in parameter learning. Based on the derived lower bound of the variational lower bound of the evidence for the proposed model, variational parameters and confidence measures are estimated in the framework of alternating optimization. Moreover, weighted Viterbi algorithm is proposed to incorporate confidence measure to sequence prediction, which considers label ambiguity arose from multiple annotations in the training data and thus helps improve the performance. SGPPSL is evaluated on several sequence labeling tasks and the experimental results show the effectiveness of the proposed model.
In summary, we explore the information of multiple annotations from annotators' perspective and label distribution. The proposed models, which are verified by extensive comparison, effectively solve the addressed problems in sequence labeling with multiple annotations.