Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery

Tong Xie, Yuwei Wan, Ke Lu, Wenjie Zhang, Chunyu Kit*, Bram Hoex*

*Corresponding author for this work

Research output: Conference PapersRGC 32 - Refereed conference paper (without host publication)peer-review

Abstract

Exploring the predictive capabilities of natural language processing models in material science is a subject of ongoing interest. This study examines material property prediction, relying on models to extract latent knowledge from compound names and material properties. We assessed various methods for contextual embeddings and explored pre-trained models like BERT and GPT. Our findings indicate that using information-dense embeddings from the third layer of domain-specific BERT models, such as MatBERT, combined with the context-average method, is the optimal approach for utilizing unsupervised word embeddings from material science literature to identify material-property relationships. The stark contrast between the domain-specific MatBERT and the general BERT model emphasizes the value of domain-specific training and tokenization for material prediction. Our research identifies a "tokenizer effect", highlighting the importance of specialized tokenization techniques to capture material names effectively during the pretraining phase. We discovered that a tokenizer which preserves compound names entirely, while maintaining a consistent token count, enhances the efficacy of context-aware embeddings in functional material prediction.
Original languageEnglish
Number of pages14
Publication statusPublished - Dec 2023
Event37th Conference on Neural Information Processing Systems (NeurIPS 2023) - New Orleans Ernest N. Morial Convention Center, New Orleans, United States
Duration: 10 Dec 202316 Dec 2023
https://papers.nips.cc/paper_files/paper/2023
https://nips.cc/Conferences/2023

Conference

Conference37th Conference on Neural Information Processing Systems (NeurIPS 2023)
Abbreviated titleNIPS '23
Country/TerritoryUnited States
CityNew Orleans
Period10/12/2316/12/23
Internet address

Fingerprint

Dive into the research topics of 'Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery'. Together they form a unique fingerprint.

Cite this