Getting your Trinity Audio player ready...
|
Scientists at MIT and the collaborative research lab have created a new unified framework, enabling the simultaneous prediction of molecular properties and the generation of new molecules with superior efficiency compared to conventional deep-learning methods.
To train a machine learning model to predict a molecule’s biological or mechanical attributes, researchers typically present it with millions of labelled molecular structures. However, obtaining such extensive training datasets is often challenging and expensive due to the difficulties of discovering and hand-labelling numerous structures. As a result, the efficacy of machine learning techniques could be improved.
In contrast, the system developed by MIT researchers demonstrates the ability to forecast molecular properties with a minimal amount of data accurately. The system is foundational to the principles governing the combination of building blocks to form valid molecules. These principles capture the resemblances between molecular structures, enabling the system to generate new molecules and predict their properties highly efficiently, even with limited data.
This approach surpassed alternative machine learning methods when tested on datasets of varying sizes, delivering precise predictions of molecular properties and producing viable molecules when provided with datasets containing fewer than 100 samples.
Minghao Guo, a graduate student in computer science and electrical engineering (EECS) and the study’s lead author, explains that this project’s objective is to leverage data-driven techniques to accelerate the process of discovering novel molecules.
The aim is to train a model capable of making predictions without relying on expensive experimental procedures. The intention is to reduce costs and expedite the molecular discovery process by implementing these data-driven methods.
To achieve optimal outcomes with machine learning models, researchers require extensive training datasets consisting of millions of molecules that exhibit similar properties to the ones they aim to discover. However, in practice, these domain-specific datasets are often limited in size. Consequently, researchers employ pre-trained models on large datasets encompassing general molecules.
These pre-trained models are then applied to the smaller, targeted datasets. Unfortunately, their performance tends to be defective due to the need for substantial domain-specific knowledge in these models.
The researchers at MIT adopted a unique approach. They developed a machine learning system that autonomously learns molecules’ intricate “language”, referred to as molecular grammar, using a limited, domain-specific dataset. This system utilises the acquired molecular grammar to generate viable molecules and accurately predict their properties.
Guo underscored the substantial efficacy of the grammar-based representation employed in this research. The generality of the grammar itself enables its application to diverse types of graph-based data. The researchers actively identify additional domains beyond chemistry or material science where this powerful representation can be successfully deployed.
“This grammar-based representation is very efficacious. And due to its general nature, the grammar can be applied to diverse types of graph-based data. Our aim is to explore additional applications beyond the domains of chemistry and material science,” Guo said.
The researchers envision expanding their molecular grammar to encompass the three-dimensional (3D) geometry of molecules and polymers. Understanding the interactions between polymer chains is crucial in this term.
Additionally, they are in the process of developing an interface that would allow users to view the learned grammar production rules. This interface also enables users to provide feedback to rectify any potentially inaccurate rules, enhancing the system’s accuracy.