A Sentence is Worth 128 Pseudo Tokens : A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

View graph of relations

Detail(s)

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationACL 2022
PublisherAssociation for Computational Linguistics
Pages246-256
ISBN (Print)978-1-955917-25-4
Publication statusPublished - 2022

Conference

Title60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)
LocationHybrid
PlaceIreland
CityDublin
Period22 - 27 May 2022

Link(s)

Abstract

Contrastive learning has shown great potential in unsupervised sentence embedding tasks, e.g., SimCSE (Gao et al., 2021). However, We find that these existing solutions are heavily affected by superficial features like the length of sentences or syntactic structures. In this paper, we propose a semantics-aware contrastive learning framework for sentence embeddings, termed Pseudo-Token BERT (PTBERT), which is able to exploit the pseudotoken space (i.e., latent semantic space) representation of a sentence while eliminating the impact of superficial features such as sentence length and syntax. Specifically, we introduce an additional pseudo token embedding layer independent of the BERT encoder to map each sentence into a sequence of pseudo tokens in a fixed length. Leveraging these pseudo sequences, we are able to construct same-length positive and negative pairs based on the attention mechanism to perform contrastive learning. In addition, we utilize both the gradientupdating and momentum-updating encoders to encode instances while dynamically maintaining an additional queue to store the representation of sentence embeddings, enhancing the encoder’s learning performance for negative examples. Experiments show that our model outperforms the state-of-the-art baselines on six standard semantic textual similarity (STS) tasks. Furthermore, experiments on alignments and uniformity losses, as well as hard examples with different sentence lengths and syntax, consistently verify the effectiveness of our method.

Research Area(s)

Citation Format(s)

A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings. / Tan, Haochen; Shao, Wei; Wu, Han et al.
Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, 2022. p. 246-256.

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Download Statistics

No data available