Discovering New Molecules and Materials Using Machine Learning


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date4 Jan 2023


Optimizing reaction conditions and designing new molecules or materials are crucial in solving energy and environmental challenges in human society. Testing different reaction conditions or synthesizing new molecules and materials costs a lot of time and resources. High-throughput screening based on theoretical calculations is a practical strategy for making experiments cost-effective. The screening is usually combined with machine learning (ML) methods, which can reduce the costs of theoretical calculations and directly predict experimental results. A key point is constructing descriptors for ML models to recognize molecules and materials.

In Chapter 1, basic concepts of machine learning have been summarized. Then applications of machine learning in the discovery of molecules and materials are introduced. In Chapter 2, the density functional theory (DFT) was briefly introduced, which is indispensable in the thesis for constructing descriptors. Then, the principles of several descriptors were introduced. Finally, the neural network and machine learning models used in the thesis are discussed.

In Chapter 3, harmonic vibrational frequencies of several semiempirical methods (the PM6, PM7, and GFN2-xTB methods) were used as the frequency descriptor (FD) in Δ-machine learning (Δ-ML). The FD generated by the GFN2-xTB method has excellent performance among several semiempirical methods. The chemical accuracy can be achieved with a small training set size according to the combination of single-point calculations at density functional theory levels. In addition, we further included infrared intensities to the FD, namely the FD-II. The chemical accuracy of energies can be achieved with a small training set size (3\%). It is possible to accelerate various property predictions with this approach in the future.

In Chapter 4, a screening framework was implemented for discovering promising photovoltaic materials from double hybrid organic-inorganic perovskites (DHOIPs). DHOIPs are promising in photovoltaic applications due to their excellent optoelectronic properties and low production costs. Accelerated discovery of DHOIPs has been realized by combining ML techniques, high-throughput screening, and density functional theory calculations. Different from the previous works, the anisotropy of organic cations of DHOIPs was first considered, and Δ-ML was used in high-throughput of DHOIPs to improve the accuracy of ML models further. 19 promising DHOIPs with appropriate bandgaps for solar cells were screened out from 78400 DHOIPs and verified by performing HSE06 calculations. This work demonstrates an effective method for predicting and discovering hidden novel photovoltaic materials.

Chapter 5 combined the steric and electronic effects into a descriptor to construct volcano plots of reaction yields in cross-coupling catalysis. More experiments with other metal catalysts are required to verify the performance of the proposed descriptor, %Vbur (min) - 3 · HOMO–LUMO gap (eV). It has grasped fundamental factors, steric and electronic effects, which influence reaction outcomes together. It is also straightforward with precise physical meanings. The concept of volcano plots has universal validity and can act as a predictive tool to accelerate the understanding of cross-coupling catalysis.

Chapter 6 includes brief conclusions of the thesis and an outlook for future work. Machine learning methods are popular and influential in predicting the properties of molecules and materials. Developing effective descriptors play a critical role in the performance of models.