Abstract
Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world’s languages. In this paper, we present the first large-scale NLI dataset (consisting of ∼56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation. Instead, we elicit annotations from native speakers specializing in linguistics. We follow closely the annotation protocol used for MNLI, but create new strategies for eliciting diverse hypotheses. We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance (∼12% absolute performance gap), making it a challenging new resource that we hope will help to accelerate progress in Chinese natural language understanding. To the best of our knowledge, this is the first human-elicited MNLI-style corpus for a non-English language. © 2020 Association for Computational Linguistics
| Original language | English |
|---|---|
| Title of host publication | Findings of the Association for Computational Linguistics Findings of ACL |
| Subtitle of host publication | EMNLP 2020 |
| Editors | Trevor Cohn, Yulan He, Yang Liu |
| Publisher | Association for Computational Linguistics |
| Pages | 3512-3526 |
| Number of pages | 15 |
| ISBN (Electronic) | 978-1-952148-90-3 |
| DOIs | |
| Publication status | Published - Nov 2020 |
| Externally published | Yes |
| Event | 2020 Conference on Empirical Methods in Natural Language Processing - Virtual, Online Duration: 16 Nov 2020 → 20 Nov 2020 https://aclanthology.org/2020.emnlp-main.0/ https://aclanthology.org/volumes/2020.findings-emnlp/ |
Publication series
| Name | Findings of the Association for Computational Linguistics Findings of ACL: EMNLP |
|---|
Conference
| Conference | 2020 Conference on Empirical Methods in Natural Language Processing |
|---|---|
| Abbreviated title | EMNLP 2020 |
| City | Virtual, Online |
| Period | 16/11/20 → 20/11/20 |
| Internet address |
Funding
This work was supported by the CLUE benchmark and the Grant-in-Aid of Doctoral Research from Indiana University Graduate School. Special thanks to the beaker team at AI2 for providing technical support for the beaker experiment platform. Computations on beaker.org were supported in part by credits from Google Cloud.
Publisher's Copyright Statement
- This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/