laitimes

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

author:Data-pie THU
Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other
来源:ScienceAI本文约6000字,建议阅读5分钟使用提示来指导LLMs在分子和分子文本描述之间进行翻译。           
Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other
Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Thesis Title:

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

Paper Link:

https://arxiv.org/abs/2306.06615

Project Link:

https://github.com/phenixace/MolReGPT

01 Introduction

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 1. A description of the mutual translation between a molecule and a molecule caption in molecular discovery. (a) Molecules can be represented by chemical formulas, SMILES strings and two-dimensional molecular diagrams. (b) Molecular Captioning The purpose of molecular text description is to generate a text describing the structure, properties and functions of a molecule for better understanding by humans. (c) Given a textual description of a molecule, text-based molecule generation aims to generate a corresponding molecule generation. (d) Large language models (e.g., ChatGPT) can achieve molecular text description generation (Mol2Cap) and text-based molecular generation tasks (Cap2Mol) through corresponding well-designed prompts.

Molecules are the basic building blocks of matter and make up the complex systems of the world around them. Molecules are made up of multiple atoms, held together in unique chemical ways, such as chemical bonds, and retain unique chemical properties determined by their specific structure. With a comprehensive understanding of molecules, scientists can effectively design materials, drugs, and products with different properties and functions.

However, traditional molecular discovery has a long, expensive, and fail-prone process, with limitations in scalability, accuracy, and data management. To overcome these challenges, computing technologies such as artificial intelligence (AI) have become powerful tools to accelerate the discovery of new molecules.

Specifically, a molecule can be represented as a simplified string of molecules (SMILES). As shown in Figure 1(a), the structure of phenol can be represented by the SMILES string, which consists of a benzene ring and a hydroxyl group. In order to generate and better understand molecules, Text2Mol [1] and MolT5 [2] propose a new task for translating between molecules and natural languages, that is, the task of translating molecules and text descriptions to each other.

It consists of two subtasks: molecular text description generation (Mol2Cap) and text-based molecular generation (Cap2Mol). As shown in Figure 1 (b-c), the goal of molecular text description generation is to generate a text that describes the SMILES string of the molecule in order to provide a better molecular understanding for people; Text-based molecule generation, on the other hand, aims to generate corresponding molecules (i.e., SMILES' strings) based on given natural language descriptions such as properties and functional groups.

Imagine a scenario like this:

• 【Molecular Translation into Text Description/Molecular Text Description Generation Task Molecule Captioning (Mol2Cap)】A doctor wants to know the properties of the drug, so he gives the drug molecule and his own problem to a large language model, and the model analyzes and predicts the characteristics of the molecule, so as to help the doctor better prescribe the right medicine. As shown in Figure 1-b;

Text-based Molecule Generation (Mol2Cap) A chemist states his needs directly to a large language model, which helps him generate one or more candidate molecules, and further experiments with candidate molecules can greatly simplify the molecular or drug discovery process. As shown in Figure 1-c.

Although most existing work has made satisfactory progress in the task of translating molecular and text descriptions to each other, they all have several limitations. First, the design of model architectures in molecular-text description reciprocal translation tasks relies heavily on domain experts, which greatly limits the development of AI-driven molecular discovery. Second, most existing methods follow "pre-trained & fine-tuned" models, which require excessive computational costs. Third, existing methods, such as Text2Mol[1] and MolT5[2], cannot reason about complex tasks or generalize unseen samples.

Recently, large language models (LLMs) have made great strides in the field of natural language processing (NLP). LLMs demonstrate strong generalization and reasoning capabilities in addition to impressive capabilities in natural language understanding and generation. It can be generalized to other unseen tasks through In-Context Learning (ICL) without fine-tuning, greatly reducing computational costs. As a result, LLMs have unprecedented potential to advance molecular discovery, especially in the task of inter-molecular-text translation of molecule-text descriptions.

While constructing specific LLMs in molecular discovery has great potential to advance scientific research, it also faces significant challenges. First, due to privacy and security concerns, many advanced large-scale language models (such as ChatGPT and GPT 4.0) are not publicly available, that is, the architecture and parameters of LLMs are not publicly released and cannot be fine-tuned in downstream tasks. Second, training advanced LLMs requires significant computing resources due to their complex architecture and the large amount of data required. Therefore, it is very challenging to redesign your own LLMs, pre-train and fine-tune. Finally, designing appropriate guidelines or cues, accompanied by a small number of high-quality examples, is essential to improve LLMs' understanding and reasoning about molecular discovery.

To address these issues, researchers from the Hong Kong Polytechnic University and Michigan State University have explored ways to harness the power of LLMs in the field of molecular discovery. They propose a novel solution that uses hints to guide LLMs in translating between molecules and molecular text descriptions, as shown in Figure 1(d). Specifically, inspired by the latest ChatGPT, they developed a retrieval-based prompt paradigm, MolReGPT [5], which performs two subtasks (i.e., molecular text description generation and text-based molecular text generation and text-based molecular generation through molecular Morgan fingerprint-based similarity retrieval/BM25-based molecular text description retrieval and context learning (ICL) without fine-tuning. Experiments show that MolReGPT can reach 0.560 in Mol2Cap generation and 0.571 in Cap2Mol generation, surpassing the fine-tuned MolT5-base in both subtasks of molecular and description translation. MolReGPT even surpassed MolT5 in text-based molecule generation, improving the Text2Mol metric by 3%. It is worth noting that all the boosts of MolReGPT on tasks are achieved without any fine-tuning steps.

02 Method

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 2: Overall process framework of MolReGPT.

Training and fine-tuning LLMs on specific corpus in the field of molecular discovery is often not feasible in practice due to the huge computational cost. To address these limitations, the researchers harnessed the power of LLMs without altering LLMs to propose an innovative framework, MolReGPT, that enables ChatGPT to translate molecular-text descriptions to each other. Specifically, to improve the quality of guidance/prompts, they introduced a retrieval-based prompt paradigm that guides ChatGPT for two molecular-related tasks under context learning: molecular text description generation (MolCap) and text-based molecular generation (Cap2Mol). The framework of MolReGPT, shown in Figure 2, consists of four main stages: molecule-text description retrieval, prompt prompt management, contextual small-sample molecular learning, and generative calibration.

Molecule-Caption Retrieval (Figure 3): This stage is used to retrieve n molecular-molecule description pairs from the database that are most similar to the input molecule or molecular text description (i.e., an example of small-shot learning). This process mainly relies on two retrieval methods: molecular Morgan fingerprint (for Mol2Cap) and BM25 (for Cap2Mol).

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 3: Molecule-Caption Retrieval.

Morgan fingerprint-based molecular retrieval (for Mol2Cap)

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 4: Molecular Morgan fingerprint and Dice similarity diagram. Green corresponds to substructures that contribute positively to the intermolecular similarity score, and purple corresponds to substructures that have a negative contribution or have differences in the intermolecular similarity score.

To extract Morgan's fingerprint, the SMILES representation of the molecule is converted into an rdkit object using the rdkit library. Dice similarity is then applied to measure the similarity between the input molecules and those in the local database, as shown in Figure 3. Mathematically, it can be expressed as:

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

where A and B are the Morgan fingerprints of the two moleculesA| and |B|Indicates the cardinality of A and B (for example, the number of substructures).|A ∩ B| indicates the number of substructures common to A and B. Dice similarity ranges from 0 to 1, where 0 indicates no overlap or similarity between molecules and 1 indicates complete overlap.

Molecular text generation retrieval for basic BM25 (for Cap2Mol)

BM25 is one of the most representative ranking methods in information retrieval to calculate the relevance of documents to a given query. In the Cap2Mol task, the input text description is used as the query sentence, and the text description in the local database is used as the corpus of documents, where each text description represents a document. Mathematically, the BM25 formula can be defined as follows:

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

where D is the text description corpus and Q makes the text description of the query. N is the number of query terms in the query text description, Qi is the ith query term, IDF (Qi) is the inverse document frequency of Qi, f(Qi, D) is the word frequency of Qi in D, k1 and b are adjustment parameters, |D| is the length of D, and avgdl is the length of the average text description in the corpus. In text description retrieval, BM25 is used to calculate similarity scores between text descriptions, so that the corresponding molecular structure in the text description can be learned by screening molecule-text description pairs.

2. Prompt Management (Figure 5): This stage mainly manages and builds ChatGPT's system prompts, which mainly include four parts: Role Identification, Task Description, Retrieved Examples and Output Instructions. Examples will be given by the retrieval process in the first step. Each part serves as a specific guide to the output.

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 5: Prompt Management.

Role Identification

The purpose of role recognition is to help LLMs recognize their role as experts in the field of chemistry and molecular discovery. By recognizing this role, LLMs are encouraged to produce responses consistent with the expertise expected in a particular area.

Task Description

The task description provides a comprehensive explanation of the content of the task, ensuring that the LLM has a clear understanding of the specific task they need to deal with. It also includes key definitions to clarify technical terms or concepts in the task of inter-molecular and textual translation.

c. Examples of search

Using the retrieved examples as user input prompts enables LLMs to respond better with the information contained in the learning examples in small samples.

Output Instruction

The output indication dictates the format of the response. Here, the researchers limit the output to JSON format. Choosing JSON format allows for quick and efficient validation of LLMs' responses, ensuring that they match the expected results for further processing and analysis.

In-Context Few-Shot Moleule Learning (Figure 6): At this stage, system prompts and user input prompts will be provided to ChatGPT for contextual small-shot molecular learning. This process is based on the contextual learning ability of the large language model, relying only on a small number of similar samples, it can capture the characteristics corresponding to the structure of the molecule to carry out the task of mutual translation between the molecule-text description, without the need to fine-tune the large language model.

The combination of system prompts and user input prompts provides clear guidance for ChatGPT through contextual learning, which establishes a task framework for cross-translation between molecular-text descriptions and molecular domain expertise, while user prompts narrow down the scope and direct the model's attention to specific user input.

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 6: In-Context Few-Shot Moleule Learning.

Generation Calibration (Figure 7): At this stage, the output of ChatGPT is calibrated to ensure that it conforms to the expected format and requirements. If the output does not meet expectations, the system will re-hand over to ChatGPT for generation until the maximum number of errors allowed is reached.

Despite specifying the desired output format, LLMs such as ChatGPT occasionally produce unexpected responses, including incorrect output formats and refusal to answer. To address these issues, the researchers introduced a generative calibration mechanism to verify the response of ChatGPT. In generative calibration, they first check the format of the original response by parsing it as a JSON object. If the parsing process fails, indicating a deviation from the expected format, several predefined format correction strategies, such as regular matching, are introduced to correct the format and extract the desired result from the response. If the original answer successfully passes the format check, or can be calibrated using a format correction strategy, then it is considered valid and accepted as the final answer. However, if the original response does not pass the format check and cannot be corrected in the predetermined policy, we initiate a requery. It's worth noting that there is a special case for requeries. When the original response reports a "Maximum input length limit exceeded" error, the longest example is removed during the requery phase until the query length meets the length limit. The requery process involves additional queries against LLM until a valid response is obtained or the maximum error allowable value is reached. This maximum allowable error value is set to ensure that the system does not get stuck in an endless loop, but rather provides a suitable response to the user within an acceptable range.

By employing a generative calibration phase, unexpected deviations from the desired output format can be reduced and ensure that the final response is consistent with the expected format and requirements.

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 7: Generation Calibration.

03 Results

Molecular Description Generation Task (Mol2Cap)

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Table 1: Performance comparison of different models on the molecular description generation (Mol2Cap) task on the ChEBI-20 dataset [3,4].

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Table 3: MolReGPT uses N-shot to compare the performance of molecular description generation (Mol2Cap) tasks.

The results of the Mol2Cap task are shown in Tables 1 and 3, where the MolReGPT method can obtain ROUGE scores comparable to the fine-tuned MolT5-base[2] while outperforming all selected baseline models on the remaining metrics.

In addition, in the ablation experiment, the performance of three retrieval strategies was mainly compared, as shown in Table 3: randomization, BM25, and Morgan FTS (employed in MolReGPT). The random strategy refers to retrieving n random examples, while BM25 uses the character-level BM25 algorithm for the SMILES string representation of molecules. Of the three retrieval strategies, Morgan FTS performed best with the same number of samples learned from a small sample, and even outperformed BM25 by 37% in the Text2Mol[1] metric.

In addition, Morgan FTS nearly doubled its ROUGE-L score compared to the random or BM25 search strategy. The use of the Morgan FTS retrieval strategy shows that by comparing unique structural features, such as functional groups, it is possible to better estimate the structural similarity between molecules, which are often reflected in detailed descriptions in the description of molecules. In this case, retrieving similar molecules by Morgan FTS can effectively guide LLM to learn the association between molecular structure and molecular description, resulting in a more accurate and desirable output.

Figure 8 shows an example of molecular text description generation to compare the performance of different models. From the given example, it can be noted that MolReGPT can generate text descriptions containing key information about the input molecule. What's more, the generated titles are grammatically more polished and easy for humans to understand.

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 8: Examples of molecular descriptions generated by different models (where the SMILEs string is converted to a molecular diagram for better presentation).

Text-based molecular generation task (Cap2Mol)

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Table 2: Performance comparison of different models on text-based molecular generation (Cap2Mol) tasks on the ChEBI-20 dataset.

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Table 4: Performance comparison of MolReGPT using N-shot on text-based molecular generation (Mol2Cap) tasks.

Given a textual description of a molecule (containing structure and properties), Cap2Mol's goal is to generate the corresponding molecule (i.e., the SMILES string) for molecular discovery. Specific results are presented in tables 2 and 4. Comparing all baseline models, it can be found that the 10-shot MolReGPT significantly enhances the capabilities of GPT-3.5-turbo to achieve the best overall performance. In molecular evaluation indicators such as MACCS FTS, RDK FTS and Morgan FTS, MolReGPT achieved a significant improvement of 15% in the Text2Mol indicator compared to MolT5-base. Considering molecular fingerprint scores, the 10-shot MolReGPT also received an average improvement of 18% compared to the MolT5-base. In addition, MolReGPT also achieved the highest exact match score, with 13.9% of the examples exactly in line with the ground truth. It's worth noting that all of the above impressive results were achieved without additional training or fine-tuning.

Figure 9 shows examples of text-based molecular generation results to compare performance between different models. As can be seen from the given example, MolReGPT is able to generate structures that are more similar to ground truth.

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 9: Examples of molecules generated by different models (where the SMILEs string is converted into a molecular diagram for better presentation).

04 Discuss

Use large language models to explore molecular discovery – translating molecules and textual descriptions to each other

Figure 10: Comparison of MolT5 and MolReGPT generating molecules given the input.

The paper also further explores the task of molecular generation based on customized text. As shown in Figure 10, the inputs in Example 1 highlight the five benzene rings and hydrophobic groups in the structure. However, the results of MolT5 produced an incorrect number of benzene rings, and the resulting structure contained some hydrophilic groups. In contrast, MolReGPT gives the correct structure corresponding to the input. In Example 2, both MolT5 and MolReGPT generate the correct number of benzene rings, while MolReGPT generates more hydrophilic groups that better match our given input.

05 Conclusion

This paper proposes MolReGPT, a generic prompt paradigm for retrieval-based contextual, small-sample molecular learning, that imparts molecular discovery capabilities to large language models such as ChatGPT. MolReGPT uses the principle of molecular similarity to retrieve molecule-molecule text description pairs from local databases as examples in context learning, and guides large language models to generate SMILE strings of molecules, eliminating the need to fine-tune large language models.

The method of this work focuses on the task of molecular text description mutual translation, including molecular text description generation (Mol2Cap) and text-based molecular generation (Cap2Mol), and evaluates the capabilities of large language models on this task. The experimental results show that MolReGPT can make ChatGPT reach Text2Mol scores of 0.560 and 0.571 in molecular description generation and molecular generation, respectively. From both molecular understanding and text-based molecular generation perspectives, its performance exceeds fine-tuned models such as MolT5-base, and can even be comparable to fine-tuned MolT5-large. In conclusion, MolReGPT provides a novel, multifunctional integrated paradigm for deploying large language models in molecular discovery through context learning, which greatly reduces the cost of domain transfer and explores the potential of large language models in molecular discovery.

bibliography

[1] Edwards, C., Zhai, C., and Ji, H. Text2mol: Cross-modal molecule retrieval with natural language queries. In Pro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 595–607, 2021.

[2] Edwards, C., Lai, T., Ros, K., Honke, G., Cho, K., and Ji, H. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 375–413, Abu Dhabi, United Arab Emirates, December 2022. As- sociation for Computational Linguistics.

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At- tention is all you need. Advances in neural information processing systems, 30, 2017.

[4] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.

[5] Li, J., Liu, Y., Fan, W., Wei, X. Y., Liu, H., Tang, J., & Li, Q. (2023). Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective. arXiv preprint arXiv:2306.06615.

Read on