laitimes

Liu Qi developed a large-scale small molecule pretraining model "X-MOL" to help design AI drug molecules

author:BioArtMED
Liu Qi developed a large-scale small molecule pretraining model "X-MOL" to help design AI drug molecules

Ai-based drug small molecule design has accelerated the process of drug development and is an important research direction in the field of drug research and development in recent times. Effective characterization and understanding of small molecules is a core issue in AI drug design. Although various AI models in the field are emerging in an endless stream, there is still a lack of a universal computing framework that can uniformly model various individual tasks such as the generation, optimization, attribute prediction, and interaction of small molecules.

Recently, the Department of Bioinformatics, School of Life Science and Technology, Tongji University, Professor Liu Qi's research group and Baidu's natural language processing group jointly published a paper entitled X-MOL: Large-scale pre-training for molecular understanding and diverse molecular in The Science Bulletin Analysis' paper, published the large-scale small molecule pretrained model X-MOL (Figure 1) and its open source model (https://github.com/bm2-lab/x-mol). In this work, the researchers constructed a large-scale Transformer-based model, combined massive training data with powerful computational resources, trained a large-scale pre-trained model X-MOL for efficient characterization of small molecules, and verified the performance gains brought about by small molecule pretraining in five different downstream tasks, including molecular activity prediction, chemical reaction yield prediction, drug-drug interaction prediction, Small molecules are generated from scratch with small molecule optimization (Figure 1a).

Liu Qi developed a large-scale small molecule pretraining model "X-MOL" to help design AI drug molecules
Liu Qi developed a large-scale small molecule pretraining model "X-MOL" to help design AI drug molecules

Figure 1. X-MOL compute framework

The core part of the X-MOL process is the design of a self-supervised pre-training strategy. The researchers chose SMILES [1] as the representation of small molecules, and designed a pre-training model of generative formula: that is, another random SMILES of the small molecule is generated by a random SMILES of the small molecule, so that the model learns the grammar rules of SMILES and its effective representation in the process of "SMILES transformation", so that the computer can "understand" the semantic rules of SMILES. Since this well-designed generative pre-training strategy used in X-MOL is different from the traditional Mask Language Model (MLM) model, the conventional Transformer [2] model is not directly applicable here, so the researchers proposed a mixed-attention Transformer model that integrates a two-way attention mechanism and a one-way attention mechanism (Figure 1b), This allows X-MOL to achieve the effect of Encoder-Decoder structure on a Transformer encoder model, achieving the purpose of small molecule generation.

At model scale, the X-MOL consists of a 12-layer, 768-dimensional Transformer encoder model, with 12 heads in the attention mechanism of each layer. In order to effectively train such a huge model, the team's researchers used all the small molecules in the ZINC15[3] database as training data for pre-training X-MOL, containing more than 1.1 billion small molecules. The entire training process of the model is completed with the help of Baidu's cloud computing platform, and 8/16 GPUs are called for each training.

Most of the pre-trained models published in the field use conventional MLM-style pre-trained models and are not suitable for generative downstream tasks. X-MOL, which employs a generative pre-training strategy, can be fine-tuned to more types of downstream tasks. These tasks include: (1) Property prediction of small molecules, including some physicochemical properties of small molecules and ADMET prediction. In this category, the researchers selected 7 tasks from MoleculeNet [4], including 4 classification tasks: BACE, HIV, BBBP, ClinTox, and three regression tasks, Lipophilicity, ESOL, and FreeSolv, and ultimately X-MOL performed best out of all 7 tasks (Figure 2a). (2). Chemical reaction yield prediction. X-MOL eventually reached RMSE of 0.0626, significantly surpassing rmSE of 0.078 at baseline [5] and the latest Yield-BERT [6] level in R2 (Figure 2b). (3). Drug-drug interactions. The researchers chose the classic work DeepDDI [7] and CASTER [8] as the baseline. In the end, X-MOL achieved a prediction accuracy of 0.952, surpassing DeepDDI's 0.924, and on the ROC-AUC, PR-AUC and F1 score indicators, X-MOL also outperformed the two baseline work (Figure 2c). (4). Small molecule generation, including distribution-learning and Real-directed generation methods [9]. The former focuses on the quality of small molecule generation in the evaluation, while the latter pays more attention to whether the small molecule generated by the model meets the stated goal. In the Distribution-learning generation, X-MOL reached the level of the Graph-based model on all three evaluation indicators. In the Real-directed generation, the Top 3 molecules generated by X-MOL all reached the QED [10] value set by the generation target, while the best performance of the previous Graph-based model could only reach the level where the Top 2 molecules met the generation target (Figure 2d). (5). Small molecule optimization task. In this task, both pre-trained X-MOL and untrained cold-start X-MOL are able to effectively perform specific optimizations for input small molecules (Figure 2e).

Liu Qi developed a large-scale small molecule pretraining model "X-MOL" to help design AI drug molecules

Figure 2. Comparison of X-MOL performance on various downstream tasks

In addition to the above five downstream tasks that contain only small molecule characterization, the research team also demonstrated that X-MOL can effectively improve the performance of the "ligand-protein interaction prediction" task, which contains protein entities other than small molecules, indicating that the effective characterization of X-MOL for small molecules can be generalized to more types of downstream tasks.

The researchers further attempted to demonstrate X-MOL's understanding of small molecules in individual tasks by visualizing attention mechanisms (Figure 3). The researchers took the attention matrix of X-MOL in the middle layer of the task of fine-tuning small molecule activity to visualize. This example further shows that the X-MOL model has some interpretability.

Liu Qi developed a large-scale small molecule pretraining model "X-MOL" to help design AI drug molecules

Figure 3. Visualization of X-MOL attention mechanisms

Taken together, X-MOL has been shown to achieve state-of-the-art performance on different small molecule-related downstream tasks, taking into account good interpretability. X-MOL will further promote the AI pharmaceutical industry to use large-scale pre-training and fine-tuning strategies to unify the existing various AI-assisted small molecule design tasks, providing a universal AI computing framework and open source platform for the AI pharmaceutical field.

The first author of the paper is Xue Dongyu, Dr. Chen Xiaohan of Professor Liu Qi's research group of School of Life Science and Technology of Tongji University, and Zhang Han of Baidu's natural language processing department, the corresponding authors are Professor Liu Qi and Baidu Li Yukun, and Baidu's Sun Yu, Tian Hao, Wu Hua, etc. of Baidu Company provide useful guidance for this work. The work has also received strong support from the Baidu Flying Propeller Platform and the Intelligent Discipline Direction of Tongji University-Shanghai Autonomous Intelligent Unmanned System Science Center.

Original link:

https://www.sciencedirect.com/science/article/abs/pii/S2095927322000445

Model Maker: Eleven

bibliography

[1] Weininger D. Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 1988, 28: 31-36

[2] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the Advances in neural information processing systems, 2017.

[3] Sterling T, Irwin JJ. Zinc 15–ligand discovery for everyone. Journal of chemical information and modeling, 2015, 55: 2324-2337

[4] Wu Z, Ramsundar B, Feinberg EN, et al. Moleculenet: A benchmark for molecular machine learning. Chemical science, 2018, 9: 513-530

[5] Ahneman DT, Estrada JG, Lin S, et al. Predicting reaction performance in c–n cross-coupling using machine learning. Science, 2018, 360: 186-190

[6] Schwaller P, Vaucher AC, Laino T, et al. Prediction of chemical reaction yields using deep learning. Machine Learning: Science and Technology, 2021, 2: 015016

[7] Ryu JY, Kim HU, Lee SY. Deep learning improves prediction of drug–drug and drug–food interactions. Proceedings of the National Academy of Sciences, 2018, 115: E4304-E4311

[8] Huang K, Xiao C, Hoang T, et al. Caster: Predicting drug interactions with chemical substructure representation. In: Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

[9] Nathan, Brown, Marco, et al. Guacamol: Benchmarking models for de novo molecular design. Journal of chemical information and modeling, 2019,

[10] Bickerton GR, Paolini GV, Besnard J, et al. Quantifying the chemical beauty of drugs. Nature chemistry, 2012, 4: 90-98

Reprint Notice

【Non-original article】The copyright of this article belongs to the author of the article, welcome to forward and share, unauthorized prohibition of reprinting, the author has all legal rights, violators will be investigated.

Read on