From Material Descriptions to Embeddings: Language Models for Materials Discovery

Student thesis: Doctoral Thesis

Abstract

The technical language of science often requires different forms of description to serve distinct communicative functions and bridge knowledge gaps between experts and learners. In materials science, materials are described using various symbolic representations besides ordinal language, such as chemical compositions (e.g., “Fe2O3”), material names (e.g., “silicon carbide”), SMILES (Simplified Molecular Input Line Entry System, e.g., “CN1CCC[C@H]1c2cccnc2”), each providing a distinct perspective on the material in question. These material descriptions are dispersed across scientific literature, curated datasets, and patent documents, encompassing synthesis protocols, experimental procedures, property measurements, industrial applications, etc. A critical challenge in the materials science domain is the development of systematic approaches to bridge disparate material representations, thereby enabling the integration of implicit domain knowledge for use in downstream applications, particularly in prediction tasks such as property forecasting and synthesis planning. Valid methods capable of recognizing and unifying heterogeneous forms of material knowledge are demanded to address this challenge.

Recent advances in the field of natural language processing (NLP) offer promising approaches to systematically extract and represent fragmented material knowledge from unstructured sources. NLP techniques such as word embeddings and contextual language models (LMs) are largely grounded in the distributional hypothesis: words that appear in similar contexts tend to have similar meanings. Large language models (LLMs) further extend this principle through more complex architectures for sequence modeling and reasoning. This fundamental idea inherently aligns with materials science, where materials with similar compositions, structures, or properties often appear in similar scientific contexts. Although NLP models have shown impressive abilities in processing human language, applying them to materials science requires rigorous investigation of domain-specific challenges, with particular attention to design of data curation and preprocessing, model architecture selection, domain-specific training protocols, and evaluation metrics. This necessitates consideration of factors such as the nature of material system and systematic experiments to evaluate the effectiveness of various approaches in unifying different material representations, so as to establish the optimal mapping between integration strategies and specific materials discovery tasks. In this context, this dissertation, through three interconnected studies, specifically explores how the field of materials science can resort to text embeddings and LMs to bridge scattered knowledge and fragmented forms, with the goal of improving knowledge integration and supporting materials discovery.

The first study (Chapter 3) lays the foundation by demonstrating how static word embeddings can be leveraged for materials discovery through two complementary experiments. The first trains Word2vec embeddings for identifying potential passivating contact materials, showing how basic word embedding techniques can capture meaningful material–property relationships in a highly specialized subdomain. The second trains a novel Scientific Sentiment Network (SSNet) and utilizes pre-trained Mat2vec embeddings as input features, to effectively extract expert opinions from scientific literature and accurately categorize them into “challenges” and “opportunities”. Transformed into quantifiable sentiment features, these categorized opinions are used as input parameters for multiple downstream tasks, demonstrating remarkable versatility. Collectively, the two experiments establish a comprehensive framework for utilizing Word2vec embeddings for materials discovery.

The second study (Chapter 4) investigates the application of dynamic embeddings to enhance material property prediction in materials science. In this direction, the first step is to investigate the application of embeddings from various models, including Bidirectional Encoder Representations from Transformer (BERT) and Generative Pre-trained Transformer (GPT), to thermoelectric material ranking prediction (TMRP), a typical task in artificial intelligence (AI) for materials science. It produces results revealing a critical “tokenizer effect” that excessive subword tokenization of chemical compositions leads to information loss, thereby limiting model performance. Using the [CLS] token embedding of corpus sentences as a substitute for material name embeddings, or employing domain-specific tokenizers, can both effectively alleviate this issue. The second step develops a novel sentence embedding model called SentMatBERT_MNR to further enhance the effectiveness of dynamic embeddings. Combining a materials-specific BERT with specialized pooling layers, this model is fine-tuned through a contrastive learning framework using natural language inference (NLI) triplets and material description pairs. Its embeddings achieve a Spearman correlation of 0.59 on the TMRP task, outperforming two baselines, Density Functional Theory (a computational quantum modelling method to calculate electronic structures) (0.31) and Word2vec (0.52). In the third step, SentMatBERT_MNR embeddings are utilized to facilitate a retrieval module of retrieval-augmented generation (RAG) framework, enabling the identification of similar materials. For the task of predicting the synthesis protocols of a target material, the synthesis protocols of its similar materials are used as supplementary knowledge for generative models. By evaluating synthesis protocols generated by LLMs enriched with SentMatBERT_MNR, we demonstrate that the enhanced retrieval mechanism significantly improves predictive performance, validating the model’s potential for materials discovery.

Unlike the first two studies, which directly employ material name representations as input features for downstream tasks, the third study (Chapter 5) explores alternative approaches to materials knowledge acquisition through two experiments. The first one investigates the inherent potential of LLMs in bridging diverse material representations by means of multi-task fine-tuning. Its results verify that pre-trained language models (PLMs) can effectively integrate various material representations, from chemical formulas to SMILES notations. This finding suggests that LMs’ understanding of scientific concepts builds upon their general language capabilities that underpin domain-specific knowledge. By merging diverse representations through a unified foundation model, this approach offers a pathway toward universal material representation. In the second experiment, we explore the integration between material name embeddings and structural representations in experimental bandgap prediction tasks, with each modality contributing complementary information to enhance prediction accuracy. This finding underscores the untapped potential of integrating linguistic knowledge with structural information in materials science, pointing toward a new paradigm of materials prediction that utilizes material names not merely as identifiers but as valuable sources of implicit domain knowledge that can enhance the accuracy and reliability of material property prediction models.

In summary, this thesis research has endeavored to establish comprehensive frameworks for leveraging NLP techniques to accelerate materials discovery, with a particular focus on harnessing potential of LMs and aligned representations in scientific applications. It has also laid a foundation for potential data-driven advancements in related fields, in hopes of promoting more efficient solutions to pressing global challenges by the aid of AI for materials science. Future directions include enhancing LLM integration with experimental and computational pipelines, designing multimodal models that combine textual, structural, and spectroscopic data, and fostering collaborative platforms for AI-powered materials innovation.
Date of Award2 Sept 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorChun Yu KIT (Supervisor)

Cite this

'