Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

来源：ScienceAI本文约6000字，建议阅读5分钟使用提示来指导LLMs在分子和分子文本描述之间进行翻译。

Thesis Title:

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

Paper Link:

https://arxiv.org/abs/2306.06615

Project Link:

https://github.com/phenixace/MolReGPT

01 Introduction

Figure 1. A description of the mutual translation between a molecule and a molecule caption in molecular discovery. (a) Molecules can be represented by chemical formulas, SMILES strings and two-dimensional molecular diagrams. (b) Molecular Captioning The purpose of molecular text description is to generate a text describing the structure, properties and functions of a molecule for better understanding by humans. (c) Given a textual description of a molecule, text-based molecule generation aims to generate a corresponding molecule generation. (d) Large language models (e.g., ChatGPT) can achieve molecular text description generation (Mol2Cap) and text-based molecular generation tasks (Cap2Mol) through corresponding well-designed prompts.

Molecules are the basic building blocks of matter and make up the complex systems of the world around them. Molecules are made up of multiple atoms, held together in unique chemical ways, such as chemical bonds, and retain unique chemical properties determined by their specific structure. With a comprehensive understanding of molecules, scientists can effectively design materials, drugs, and products with different properties and functions.

However, traditional molecular discovery has a long, expensive, and fail-prone process, with limitations in scalability, accuracy, and data management. To overcome these challenges, computing technologies such as artificial intelligence (AI) have become powerful tools to accelerate the discovery of new molecules.

Specifically, a molecule can be represented as a simplified string of molecules (SMILES). As shown in Figure 1(a), the structure of phenol can be represented by the SMILES string, which consists of a benzene ring and a hydroxyl group. In order to generate and better understand molecules, Text2Mol [1] and MolT5 [2] propose a new task for translating between molecules and natural languages, that is, the task of translating molecules and text descriptions to each other.

It consists of two subtasks: molecular text description generation (Mol2Cap) and text-based molecular generation (Cap2Mol). As shown in Figure 1 (b-c), the goal of molecular text description generation is to generate a text that describes the SMILES string of the molecule in order to provide a better molecular understanding for people; Text-based molecule generation, on the other hand, aims to generate corresponding molecules (i.e., SMILES' strings) based on given natural language descriptions such as properties and functional groups.

Imagine a scenario like this:

• 【Molecular Translation into Text Description/Molecular Text Description Generation Task Molecule Captioning (Mol2Cap)】A doctor wants to know the properties of the drug, so he gives the drug molecule and his own problem to a large language model, and the model analyzes and predicts the characteristics of the molecule, so as to help the doctor better prescribe the right medicine. As shown in Figure 1-b;

Text-based Molecule Generation (Mol2Cap) A chemist states his needs directly to a large language model, which helps him generate one or more candidate molecules, and further experiments with candidate molecules can greatly simplify the molecular or drug discovery process. As shown in Figure 1-c.

Although most existing work has made satisfactory progress in the task of translating molecular and text descriptions to each other, they all have several limitations. First, the design of model architectures in molecular-text description reciprocal translation tasks relies heavily on domain experts, which greatly limits the development of AI-driven molecular discovery. Second, most existing methods follow "pre-trained & fine-tuned" models, which require excessive computational costs. Third, existing methods, such as Text2Mol[1] and MolT5[2], cannot reason about complex tasks or generalize unseen samples.

Recently, large language models (LLMs) have made great strides in the field of natural language processing (NLP). LLMs demonstrate strong generalization and reasoning capabilities in addition to impressive capabilities in natural language understanding and generation. It can be generalized to other unseen tasks through In-Context Learning (ICL) without fine-tuning, greatly reducing computational costs. As a result, LLMs have unprecedented potential to advance molecular discovery, especially in the task of inter-molecular-text translation of molecule-text descriptions.

While constructing specific LLMs in molecular discovery has great potential to advance scientific research, it also faces significant challenges. First, due to privacy and security concerns, many advanced large-scale language models (such as ChatGPT and GPT 4.0) are not publicly available, that is, the architecture and parameters of LLMs are not publicly released and cannot be fine-tuned in downstream tasks. Second, training advanced LLMs requires significant computing resources due to their complex architecture and the large amount of data required. Therefore, it is very challenging to redesign your own LLMs, pre-train and fine-tune. Finally, designing appropriate guidelines or cues, accompanied by a small number of high-quality examples, is essential to improve LLMs' understanding and reasoning about molecular discovery.

To address these issues, researchers from the Hong Kong Polytechnic University and Michigan State University have explored ways to harness the power of LLMs in the field of molecular discovery. They propose a novel solution that uses hints to guide LLMs in translating between molecules and molecular text descriptions, as shown in Figure 1(d). Specifically, inspired by the latest ChatGPT, they developed a retrieval-based prompt paradigm, MolReGPT [5], which performs two subtasks (i.e., molecular text description generation and text-based molecular text generation and text-based molecular generation through molecular Morgan fingerprint-based similarity retrieval/BM25-based molecular text description retrieval and context learning (ICL) without fine-tuning. Experiments show that MolReGPT can reach 0.560 in Mol2Cap generation and 0.571 in Cap2Mol generation, surpassing the fine-tuned MolT5-base in both subtasks of molecular and description translation. MolReGPT even surpassed MolT5 in text-based molecule generation, improving the Text2Mol metric by 3%. It is worth noting that all the boosts of MolReGPT on tasks are achieved without any fine-tuning steps.

02 Method

Figure 2: Overall process framework of MolReGPT.

Training and fine-tuning LLMs on specific corpus in the field of molecular discovery is often not feasible in practice due to the huge computational cost. To address these limitations, the researchers harnessed the power of LLMs without altering LLMs to propose an innovative framework, MolReGPT, that enables ChatGPT to translate molecular-text descriptions to each other. Specifically, to improve the quality of guidance/prompts, they introduced a retrieval-based prompt paradigm that guides ChatGPT for two molecular-related tasks under context learning: molecular text description generation (MolCap) and text-based molecular generation (Cap2Mol). The framework of MolReGPT, shown in Figure 2, consists of four main stages: molecule-text description retrieval, prompt prompt management, contextual small-sample molecular learning, and generative calibration.

Molecule-Caption Retrieval (Figure 3): This stage is used to retrieve n molecular-molecule description pairs from the database that are most similar to the input molecule or molecular text description (i.e., an example of small-shot learning). This process mainly relies on two retrieval methods: molecular Morgan fingerprint (for Mol2Cap) and BM25 (for Cap2Mol).

Figure 3: Molecule-Caption Retrieval.

Morgan fingerprint-based molecular retrieval (for Mol2Cap)

Figure 4: Molecular Morgan fingerprint and Dice similarity diagram. Green corresponds to substructures that contribute positively to the intermolecular similarity score, and purple corresponds to substructures that have a negative contribution or have differences in the intermolecular similarity score.

To extract Morgan's fingerprint, the SMILES representation of the molecule is converted into an rdkit object using the rdkit library. Dice similarity is then applied to measure the similarity between the input molecules and those in the local database, as shown in Figure 3. Mathematically, it can be expressed as:

where A and B are the Morgan fingerprints of the two moleculesA| and |B|Indicates the cardinality of A and B (for example, the number of substructures).|A ∩ B| indicates the number of substructures common to A and B. Dice similarity ranges from 0 to 1, where 0 indicates no overlap or similarity between molecules and 1 indicates complete overlap.

Molecular text generation retrieval for basic BM25 (for Cap2Mol)

BM25 is one of the most representative ranking methods in information retrieval to calculate the relevance of documents to a given query. In the Cap2Mol task, the input text description is used as the query sentence, and the text description in the local database is used as the corpus of documents, where each text description represents a document. Mathematically, the BM25 formula can be defined as follows:

where D is the text description corpus and Q makes the text description of the query. N is the number of query terms in the query text description, Qi is the ith query term, IDF (Qi) is the inverse document frequency of Qi, f(Qi, D) is the word frequency of Qi in D, k1 and b are adjustment parameters, |D| is the length of D, and avgdl is the length of the average text description in the corpus. In text description retrieval, BM25 is used to calculate similarity scores between text descriptions, so that the corresponding molecular structure in the text description can be learned by screening molecule-text description pairs.

2. Prompt Management (Figure 5): This stage mainly manages and builds ChatGPT's system prompts, which mainly include four parts: Role Identification, Task Description, Retrieved Examples and Output Instructions. Examples will be given by the retrieval process in the first step. Each part serves as a specific guide to the output.

Figure 5: Prompt Management.

Role Identification

The purpose of role recognition is to help LLMs recognize their role as experts in the field of chemistry and molecular discovery. By recognizing this role, LLMs are encouraged to produce responses consistent with the expertise expected in a particular area.

Task Description

The task description provides a comprehensive explanation of the content of the task, ensuring that the LLM has a clear understanding of the specific task they need to deal with. It also includes key definitions to clarify technical terms or concepts in the task of inter-molecular and textual translation.

c. Examples of search

Using the retrieved examples as user input prompts enables LLMs to respond better with the information contained in the learning examples in small samples.

Output Instruction

The output indication dictates the format of the response. Here, the researchers limit the output to JSON format. Choosing JSON format allows for quick and efficient validation of LLMs' responses, ensuring that they match the expected results for further processing and analysis.

In-Context Few-Shot Moleule Learning (Figure 6): At this stage, system prompts and user input prompts will be provided to ChatGPT for contextual small-shot molecular learning. This process is based on the contextual learning ability of the large language model, relying only on a small number of similar samples, it can capture the characteristics corresponding to the structure of the molecule to carry out the task of mutual translation between the molecule-text description, without the need to fine-tune the large language model.

The combination of system prompts and user input prompts provides clear guidance for ChatGPT through contextual learning, which establishes a task framework for cross-translation between molecular-text descriptions and molecular domain expertise, while user prompts narrow down the scope and direct the model's attention to specific user input.

Figure 6: In-Context Few-Shot Moleule Learning.

Generation Calibration (Figure 7): At this stage, the output of ChatGPT is calibrated to ensure that it conforms to the expected format and requirements. If the output does not meet expectations, the system will re-hand over to ChatGPT for generation until the maximum number of errors allowed is reached.

Despite specifying the desired output format, LLMs such as ChatGPT occasionally produce unexpected responses, including incorrect output formats and refusal to answer. To address these issues, the researchers introduced a generative calibration mechanism to verify the response of ChatGPT. In generative calibration, they first check the format of the original response by parsing it as a JSON object. If the parsing process fails, indicating a deviation from the expected format, several predefined format correction strategies, such as regular matching, are introduced to correct the format and extract the desired result from the response. If the original answer successfully passes the format check, or can be calibrated using a format correction strategy, then it is considered valid and accepted as the final answer. However, if the original response does not pass the format check and cannot be corrected in the predetermined policy, we initiate a requery. It's worth noting that there is a special case for requeries. When the original response reports a "Maximum input length limit exceeded" error, the longest example is removed during the requery phase until the query length meets the length limit. The requery process involves additional queries against LLM until a valid response is obtained or the maximum error allowable value is reached. This maximum allowable error value is set to ensure that the system does not get stuck in an endless loop, but rather provides a suitable response to the user within an acceptable range.

By employing a generative calibration phase, unexpected deviations from the desired output format can be reduced and ensure that the final response is consistent with the expected format and requirements.

Figure 7: Generation Calibration.

03 Results

Molecular Description Generation Task (Mol2Cap)

Table 1: Performance comparison of different models on the molecular description generation (Mol2Cap) task on the ChEBI-20 dataset [3,4].

Table 3: MolReGPT uses N-shot to compare the performance of molecular description generation (Mol2Cap) tasks.

The results of the Mol2Cap task are shown in Tables 1 and 3, where the MolReGPT method can obtain ROUGE scores comparable to the fine-tuned MolT5-base[2] while outperforming all selected baseline models on the remaining metrics.

In addition, in the ablation experiment, the performance of three retrieval strategies was mainly compared, as shown in Table 3: randomization, BM25, and Morgan FTS (employed in MolReGPT). The random strategy refers to retrieving n random examples, while BM25 uses the character-level BM25 algorithm for the SMILES string representation of molecules. Of the three retrieval strategies, Morgan FTS performed best with the same number of samples learned from a small sample, and even outperformed BM25 by 37% in the Text2Mol[1] metric.

In addition, Morgan FTS nearly doubled its ROUGE-L score compared to the random or BM25 search strategy. The use of the Morgan FTS retrieval strategy shows that by comparing unique structural features, such as functional groups, it is possible to better estimate the structural similarity between molecules, which are often reflected in detailed descriptions in the description of molecules. In this case, retrieving similar molecules by Morgan FTS can effectively guide LLM to learn the association between molecular structure and molecular description, resulting in a more accurate and desirable output.

Figure 8 shows an example of molecular text description generation to compare the performance of different models. From the given example, it can be noted that MolReGPT can generate text descriptions containing key information about the input molecule. What's more, the generated titles are grammatically more polished and easy for humans to understand.

Figure 8: Examples of molecular descriptions generated by different models (where the SMILEs string is converted to a molecular diagram for better presentation).

Text-based molecular generation task (Cap2Mol)

Table 2: Performance comparison of different models on text-based molecular generation (Cap2Mol) tasks on the ChEBI-20 dataset.

Table 4: Performance comparison of MolReGPT using N-shot on text-based molecular generation (Mol2Cap) tasks.

Given a textual description of a molecule (containing structure and properties), Cap2Mol's goal is to generate the corresponding molecule (i.e., the SMILES string) for molecular discovery. Specific results are presented in tables 2 and 4. Comparing all baseline models, it can be found that the 10-shot MolReGPT significantly enhances the capabilities of GPT-3.5-turbo to achieve the best overall performance. In molecular evaluation indicators such as MACCS FTS, RDK FTS and Morgan FTS, MolReGPT achieved a significant improvement of 15% in the Text2Mol indicator compared to MolT5-base. Considering molecular fingerprint scores, the 10-shot MolReGPT also received an average improvement of 18% compared to the MolT5-base. In addition, MolReGPT also achieved the highest exact match score, with 13.9% of the examples exactly in line with the ground truth. It's worth noting that all of the above impressive results were achieved without additional training or fine-tuning.

Figure 9 shows examples of text-based molecular generation results to compare performance between different models. As can be seen from the given example, MolReGPT is able to generate structures that are more similar to ground truth.

Figure 9: Examples of molecules generated by different models (where the SMILEs string is converted into a molecular diagram for better presentation).

04 Discuss

Figure 10: Comparison of MolT5 and MolReGPT generating molecules given the input.

The paper also further explores the task of molecular generation based on customized text. As shown in Figure 10, the inputs in Example 1 highlight the five benzene rings and hydrophobic groups in the structure. However, the results of MolT5 produced an incorrect number of benzene rings, and the resulting structure contained some hydrophilic groups. In contrast, MolReGPT gives the correct structure corresponding to the input. In Example 2, both MolT5 and MolReGPT generate the correct number of benzene rings, while MolReGPT generates more hydrophilic groups that better match our given input.

05 Conclusion

This paper proposes MolReGPT, a generic prompt paradigm for retrieval-based contextual, small-sample molecular learning, that imparts molecular discovery capabilities to large language models such as ChatGPT. MolReGPT uses the principle of molecular similarity to retrieve molecule-molecule text description pairs from local databases as examples in context learning, and guides large language models to generate SMILE strings of molecules, eliminating the need to fine-tune large language models.

The method of this work focuses on the task of molecular text description mutual translation, including molecular text description generation (Mol2Cap) and text-based molecular generation (Cap2Mol), and evaluates the capabilities of large language models on this task. The experimental results show that MolReGPT can make ChatGPT reach Text2Mol scores of 0.560 and 0.571 in molecular description generation and molecular generation, respectively. From both molecular understanding and text-based molecular generation perspectives, its performance exceeds fine-tuned models such as MolT5-base, and can even be comparable to fine-tuned MolT5-large. In conclusion, MolReGPT provides a novel, multifunctional integrated paradigm for deploying large language models in molecular discovery through context learning, which greatly reduces the cost of domain transfer and explores the potential of large language models in molecular discovery.

bibliography

[1] Edwards, C., Zhai, C., and Ji, H. Text2mol: Cross-modal molecule retrieval with natural language queries. In Pro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 595–607, 2021.

[2] Edwards, C., Lai, T., Ros, K., Honke, G., Cho, K., and Ji, H. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 375–413, Abu Dhabi, United Arab Emirates, December 2022. As- sociation for Computational Linguistics.

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At- tention is all you need. Advances in neural information processing systems, 30, 2017.

[4] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.

[5] Li, J., Liu, Y., Fan, W., Wei, X. Y., Liu, H., Tang, J., & Li, Q. (2023). Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective. arXiv preprint arXiv:2306.06615.

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Read on

Nature: ChatGPT breaks the Turing test – a race to find new ways to evaluate artificial intelligence is underway. Large language models mimic human conversation, but science

#New forces in finance#With the rapid development of the field of artificial intelligence, large-scale language models and chatbots have become hot spots in the industry. Especially in ChatGPT, which was released by OpenAI

"Why do daughters-in-law and mother-in-law have irreconcilable contradictions?" Today, Xiaobian began to test Baidu's "Wen Xin Yiyan", which is known as the Chinese version of chatGPT! [Like] [Like] [Like

Summary of Popular Large Language Models (LLMs) in 2023

With the rise of telemedicine, online consultation and consultation has become the preferred convenient and efficient medical support method for patients. Recently, Large Language Models (LLMs) have shown powerful nature

Using the python programming language, build a large language model to define knowledge for the robot

Nvidia is one of the hottest companies in the United States, and its market value is now several times that of Intel. The graphics card produced is difficult to find. Especially for superchips for large-scale language computing. One sheet

Amazon Selection and Applications in Operations: Don't panic. With the rapid development of large-scale language models, GenerativeAI has been

Databrick Dolly: A large language model that follows instructions

Large language model: SBERT — sentence BERT

Do large language models know what they don't?

The paper, titled SortedLLaMA, aims to reveal the potential advantages of the middle layer of large language models. The paper proposes a method called SoFT, which utilizes an intermediate layer

Intel CEO Pat Gelsinger unveiled the next era of personal computers: AI PCs generation Intel CEO Pat Gelsinger recently in San Jose

A beginner's guide to "Artificial Intelligence" Large Language Model (LLM).

The Game of Thrones author sued ChatGPT, some of the world's best-known novelists, and this week banded together to sue ChatGPT maker OpenAI, saying it used him

David Chalmers: Large language models predict that conscious AI will be possible in less than a decade

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

The tuyere belonging to major technology companies is here again! This large language model leads to the "new industrial revolution."

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

探索大语言模型：理解Self Attention| 京东物流技术团队

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills