OCNLI: Original Chinese Natural Language Inference

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kübler, Lawrence S. Moss

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

66 Citations (Scopus)
4 Downloads (CityUHK Scholars)

Abstract

Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world’s languages. In this paper, we present the first large-scale NLI dataset (consisting of ∼56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation. Instead, we elicit annotations from native speakers specializing in linguistics. We follow closely the annotation protocol used for MNLI, but create new strategies for eliciting diverse hypotheses. We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance (∼12% absolute performance gap), making it a challenging new resource that we hope will help to accelerate progress in Chinese natural language understanding. To the best of our knowledge, this is the first human-elicited MNLI-style corpus for a non-English language. © 2020 Association for Computational Linguistics
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics Findings of ACL
Subtitle of host publicationEMNLP 2020
EditorsTrevor Cohn, Yulan He, Yang Liu
PublisherAssociation for Computational Linguistics
Pages3512-3526
Number of pages15
ISBN (Electronic)978-1-952148-90-3
DOIs
Publication statusPublished - Nov 2020
Externally publishedYes
Event2020 Conference on Empirical Methods in Natural Language Processing - Virtual, Online
Duration: 16 Nov 202020 Nov 2020
https://aclanthology.org/2020.emnlp-main.0/
https://aclanthology.org/volumes/2020.findings-emnlp/

Publication series

NameFindings of the Association for Computational Linguistics Findings of ACL: EMNLP

Conference

Conference2020 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2020
CityVirtual, Online
Period16/11/2020/11/20
Internet address

Funding

This work was supported by the CLUE benchmark and the Grant-in-Aid of Doctoral Research from Indiana University Graduate School. Special thanks to the beaker team at AI2 for providing technical support for the beaker experiment platform. Computations on beaker.org were supported in part by credits from Google Cloud.

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'OCNLI: Original Chinese Natural Language Inference'. Together they form a unique fingerprint.

Cite this