Exploring the potential of large language models for graph learning (with link)

来源：NewbeeNLP本文约6800字，建议阅读14分钟本文主要采用了节点分类这一任务作为研究对象，我们会在最后讨论这一选择的局限性，以及拓展到别的任务上的可能。

Paper URL:

https://arxiv.org/abs/2307.03393

Code address

https://github.com/CurryTang/Graph-LLM

Graph is a very important structured data with broad application scenarios. In the real world, the nodes of a graph tend to be associated with attributes in some text form. Taking the commodity graph (OGBN-Products dataset) in the e-commerce scenario as an example, each node represents the product on the e-commerce website, and the introduction of the product can be used as the corresponding attribute of the node. In the field of graph learning, related work often calls this type of graph with text as a node attribute as a text-attributed graph (hereinafter referred to as TAG). TAG is very common in the study of graph machine learning, for example, several of the most commonly used paper citation related datasets in graph learning belong to TAG. In addition to the structural information of the graph itself, the text properties corresponding to the nodes also provide important text information, so it is necessary to take into account the structural information, text information and the interrelationship between the two at the same time. However, in the past research process, the importance of textual information has often been overlooked. For example, common datasets available in popular libraries such as PYG and DGL (such as the most classic Cora dataset) do not provide raw text properties, but only bag-of-word features in embedded form. In the research process, the commonly used GNN focuses more on modeling the topology of the graph, and lacks the understanding of node attributes.

Compared with the previous work, this paper wants to investigate how to better process text information, and how different text embeddings combined with GNN affect the performance of downstream tasks. To better process text information, the most popular tool today is the Large Language Model (LLM) (this paper considers BERT to GPT4 language models that have been pre-trained on large-scale corpus, so LLM is used to refer to these models in general). Compared to the text features based on the bag-of-word model such as TF-IDF, LLM has the following potential advantages.

First, LLM has the ability to be context-aware and can better handle polysemous words.
Secondly, through pre-training on large-scale corpus, LLM is generally considered to have stronger semantic comprehension, which can be reflected in its excellent performance in various NLP tasks.

Given the diversity of LLMs, the goal of this paper is to design suitable frameworks for different types of LLMs. In view of the fusion problem of LLM and GNN, this paper first classifies LLM into two categories: embedding visible and embedding invisible. LLMs like ChatGPT, which can only interact through interfaces, fall into the latter category. Second, for LLM embedding visible classes, this paper considers three paradigms:

A pre-trained language model based on encoder-only structure represented by BERT. Such models typically need to be fine-tuned in downstream data.
The sentence embedding model represented by Sentence-BERT generally undergoes further supervised/unsupervised training on the basis of the first type of model, and does not need to be fine-tuned for downstream data. This paper also considers commercial embedding models represented by Openai's text-ada-embedding.
The open source decoder-only large model represented by LLaMA generally has a much larger number of parameters than the first type of model. Considering the cost of fine-tuning and the existence of catastrophic forgetting, this article mainly reviews the untuned base model.

For these large models where embeddings are visible, they can be used first to generate text embeddings, and then text embeddings can be used as initial features of GNNs to fuse the two types of models together. However, for LLMs such as embedded invisible ChatGPT, how to apply their powerful capabilities to graph learning-related tasks becomes a challenge.

To solve these problems, this paper proposes a framework for applying LLM to graph learning-related tasks, as shown in Figure 1 and Figure 2 below. For the first mode, LLMs-as-Enhancers, the ability of the large model is mainly used to enhance the original node attributes, and then input into the GNN model to improve the performance of downstream tasks. For LLM with visible embedding, feature-level enhancement is taken, and then the language model is combined with GNN using hierarchical or iterative (GLEM, ICLR 2023) optimization methods. For LLM that is embedded invisible, text-level enhancement is adopted to augment the original node properties through LLM. Considering the zero-sample learning and reasoning ability of LLM represented by ChatGPT, this paper further explores the use of prompt form to represent the attributes and structure of graph nodes, and then uses large models to directly generate prediction patterns, and this paradigm is called LLMs-as-Predictors.

In the experimental part, this paper mainly uses the task of node classification as the research object, and we will discuss the limitations of this choice and the possibility of extending it to other tasks at the end. Next, following the structure in the paper, here is a brief share of interesting findings in various modes.

Figure 1. Schematic diagram of LLMs-as-Enhancers. For large language models where embeddings are visible, text embeddings are directly generated as the initial node features of GNNs. For large models that are embedded invisible, design prompts to enhance the original node properties.

Figure 2. Schematic diagram of LLMs-as-Predictors. For large language models that are embedded in the invisible world, try to go a step further and directly design prompts to have LLM output the final result.

Feature enhancement with LLMs: LLMs-as-Enhancers

First, this paper examines the pattern of using LLM to generate text embeddings and then input into GNNs. In this mode, according to whether the LLM embedding is visible, feature-level enhancement and text-level enhancement are proposed. For the enhancement of feature level, the optimization process between the language model and GNN is further considered, and it is further subdivided into cascading structure and iterative structure. The following describes each of the two enhancement methods.

Figure 3. Flow diagram of LLMs-as-Enhancers. The first two pictures correspond to feature-level enhancements, which are cascading structures and iterative structures. The latter sheet corresponds to a text-level enhancement.

Feature-level enhancements

For feature-level enhancement, this paper mainly considers three factors: language model, GNN, and optimization method. In terms of language model, this paper considers the pre-trained language model represented by Deberta, the open source sentence embedding model represented by Sentence-BERT, the commercial embedding model represented by text-ada-embedding-002, and the open source large model represented by LLaMA. For these language models, this paper mainly considers the impact of these models on downstream tasks from the types of models and the parameter scale of the models.

From the perspective of GNN, this paper mainly considers the impact of the messaging mechanism in GNN design on downstream tasks. This paper mainly selects GCN, SAGE and GAT two representative models, and for the dataset on OGB, this paper selects the top models RevGAT and SAGN. This paper also includes the performance of MLP to investigate the downstream task performance of the original embedding.

From the perspective of optimization methods, this paper mainly examines cascading structures and iterative structures.

For cascading structures, this article considers outputting text embeddings directly through the language model. For those smaller models that can be fine-tuned, this paper considers text-based fine-tuning versus structure-based self-supervised training (ICLR 2022, GIANT). Either way, you end up with a language model that you can then use to generate text embeddings. In this process, the training of the language model is separate from the training of the GNN.
For iterative structures, this paper focuses on the GLEM method (ICLR 2023), which uses EM and variational inference to iteratively co-train GNN and language models.

In the experimental part, this paper selects several representative common TAG datasets, and the specific experimental settings can refer to our paper. Next, first show the experimental results of this part (given the limited space, the results of the experiments on the two large figures are shown here), and then briefly discuss some interesting experimental results.

Figure 4. The experimental results on the Arxiv and Products datasets, the left corresponds to different language models, and the upper corresponds to different GNNs. Yellow, green and red correspond to the combination of first, second and third, respectively.

From the experimental results, there are several interesting conclusions.

First, GNNs have very different effectiveness for different text embeddings. A particularly obvious example occurs on the Products dataset, where embeddings of the fine-tuned pre-trained language model Deberta-base are much better than TF-IDF when MLP is used as the classifier. However, when using the GNN model, the difference between the two is small, especially when using the SAGN model, TF-IDF performs better. This phenomenon may be related to the over-smoothness and correlation of GNN, but there is no complete explanation, so it is also an interesting research topic.

Second, using the sentence vector model as an encoder and then cascading with GNN can obtain good downstream task performance. Especially on the Arxiv dataset, simply cascading Sentence-BERT with RevGAT can achieve performance close to GLEM, even surpassing GIANT with self-supervised training. Note that this is not because a language model with a larger number of parameters is used, and the Sentence-BERT used here is the MiniLM version, even smaller than the amount of BERT parameters used by GIANT. One possible reason here is that Sentence-BERT, trained on the Natural Language Inference (NLI) task, provides implicit structural information, and NLI also has some similarities in form to link prediction. Of course, this is only a very preliminary conjecture, and the specific conclusions need to be further explored. In addition, some inspiration is also given from this result, such as considering whether a pre-trained model on the graph can be directly pre-trained by a language model, and whether a more mature solution through pre-training of language models can also obtain better results than pre-trained GNN. At the same time, the fee-based embedding model provided by OpenAI has little improvement over the open source model in the task of node classification.

Third, LLaMA can achieve better results than untuned Deberta, but there is still a gap with models such as sentence embedding. This shows that the type of model may be a more important consideration than the parameter size of the model. For Deberta, this article uses [CLS] as a sentence vector. For LLaMA, this article uses llama-cpp-embedding in langchain, which uses [EOS] as a sentence vector in its implementation. In previous studies, some work has been done to illustrate why [CLS] performs poorly when not fine-tuned, mainly due to its various anisotropies, resulting in poor separability. After experiments, LLaMA-generated text embeddings can achieve good downstream task performance under high sample rates, which shows that the increase in the number of parameters of the model may alleviate this problem to some extent.

Text-level enhancements

For feature-level enhancement, this paper gives some interesting results. However, feature-level enhancement still requires that the language model be embedded and visible. For models that embed invisible embedding such as ChatGPT, text-level enhancements can be used. For this part, this paper first examines a recently hung Arxiv article, Explanation as Features (TAPE), which uses the interpretation of predictions generated by LLM as an enhanced attribute and ranks first on the OGB Arxiv list through an integrated approach. In addition, this paper also proposes a means of knowledge enhancement using LLM Knowledge-Enhanced Augmentation (KEA), the core idea of which is to use LLM as a knowledge base, discover the key information related to knowledge in the text, and then generate a more detailed explanation, mainly to lack the lack of knowledge information of the language model itself with a small number of parameters. A schematic diagram of the two models is shown below.

Figure 5.Schematic diagram of text level enhancement

Figure 6. Sample output of TAPE. The enhanced attribute has three parts: the original attribute TA, the generated interpretation E, and the pseudo-label P.

Figure 7. Sample output from KEA. Enhanced to give a dictionary-like mapping of keywords to corresponding explanations. This paper tries two methods to stitch the original attribute and the enhanced attribute, the first is to directly stitch at the text level, and the other is to encode and integrate separately, respectively recorded as KEA-I and KEA-S.

To test the effectiveness of both methods, the experimental setup of Part I is followed. At the same time, considering the cost of using LLM, this paper conducts experiments on two small graphs, Cora and Pubmed. For LLM, we chose gpt-3.5-turbo, also known as ChatGPT. First, to better understand how text-level enhancement is done and the effectiveness of TAPE, we conduct detailed ablation experiments for TAPE.

Figure 8. The ablation experimental results of TAPE, TA represents the original feature, E represents the prediction and interpretation of LLM generation, and P represents the pseudo-label generated by LLM

In the ablation experiment, we mainly considered the following questions

The effectiveness of TAPE is mainly derived from whether the generated interpretation E or pseudo-label P is generated
Which language model is most appropriate to use to encode enhanced properties

From the experimental results, it can be seen that pseudo-tags are very dependent on the zero shot prediction ability of LLM itself (discussed in more detail in the next chapter), and may drag down the performance after integration at low sample rates. Therefore, in subsequent experiments, this paper only uses the original attribute TA with Interpretation E. Secondly, sentence encoding can achieve better results at low labeling rate than fine-tuning pre-training model, so this paper adopts the sentence coding model E5. In addition, an interesting phenomenon is that on the Pubmed dataset, when enhanced features are used, the fine-tuned-based method can achieve very good performance. One possible explanation is that the model mainly learns the "shortcut" of LLM prediction results, so the performance of TAPE will be highly correlated with the prediction accuracy of LLM itself. Next, we compare the effectiveness between TAPE and KEA.

Figure 9. Comparison of KEA and TAPE

In the experimental results, the original features of KEA and TAPE were improved. Among them, KEA can achieve better results on Cora, while TAPE is more effective on Pubmed. After the discussion in the next chapter, it will be seen that this has to do with LLM's good predictive performance on Pubmed itself. Compared to TAPE, KEA will perform more consistently on different datasets because it does not rely on LLM predictions. Beyond these two datasets, there are many more use cases for this text-level enhancement. Smaller pre-trained language models such as BERT or T5 often do not have ChatGPT-level reasoning capabilities, and there is no way to have a good understanding of different areas such as code and formatted text like ChatGPT. Therefore, when it comes to the problems of these scenarios, the original content can be converted through a large model such as ChatGPT. Training a smaller model on the transformed data can have faster inference speed and lower inference costs. At the same time, if you also have a certain amount of labeled samples, you will better grasp some personalized information in the dataset through fine-tuning than context learning.

Forecasting with LLM: LLMs-as-Predictors

In this part, this paper further considers whether GNN can be discarded and designed prompts to make LLM generate effective predictions. Since this article mainly considers node classification tasks, a simple baseline is to treat node classification as a text classification task. Based on this idea, this paper first designs some simple prompts to test how much performance LLM can perform without using any graph structure. This paper considers zero shots, few shots, and tests the effects of using the Chain of thought.

Figure 10.Prompt design without considering the structure information of the diagram

The experimental results are shown in the figure below. The performance of LLM on different datasets varies greatly. On the Pubmed dataset, it can be seen that LLM even outperforms GNN in the case of zero shot. On datasets such as Cora and Arxiv, there is a big gap with GNN. Note that for GNN here, on Cora, CiteSeer, Pubmed, 20 samples per class are selected for the training set, while Arxiv has more training samples on the Products dataset. In contrast, LLM predictions are based on zero or few samples, while GNN does not have the ability to learn from zero samples and will perform poorly in the case of few samples. Of course, the limitation of input length also prevents LLM from including more context samples.

Figure 11.Prompt results without considering graph structure information

By analyzing the experimental results, it is reasonable to have the LLM prediction error in some cases. An example is shown in Figure 12. It can be seen that many papers themselves are also cross-cutting, so LLM reasoning through its own common-sense information when predicting sometimes does not match the preferences of the label. It's also worth pondering: Is this single-label setup reasonable?

Figure 12.Reasonable error

In addition, LLM performed the worst on the Arxiv dataset, which is not consistent with the conclusions in TAPE, so it is necessary to compare what the two prompts differ. The prompts used by TAPE are shown below.

Abstract: <abstract text> \n Title: <title text> \n Question: Which arXiv CS sub-categorydoes this paper belong to? Give 5 likely arXiv CS sub-categories as a comma-separated list ordered from most to least likely, in the form “cs. XX”, and provide your reasoning. \n \n Answer:

Interestingly, TAPE doesn't even indicate in the prompt which categories exist in the dataset, but directly leverages the knowledge about arxiv that exists in LLM. Curiously, with this small change, LLM's predicted performance changes dramatically, which raises suspicions that it has something to do with the leak of its own test set label. As a high-quality corpus, the data on arxiv is most likely included in the pre-training of various LLMs, and TAPE's prompt may allow LLM to better recall these pre-training corpora. This reminds us that we need to rethink the rationality of the assessment, because the accuracy rate may not reflect the quality of the prompt and the ability of the language model, but only the memory problem of LLM. The above two questions are related to the evaluation of data sets and are very valuable future directions.

Further, this article also considers whether structural information can be included in the form of text in prompts. This article tests several ways to represent structured information in prompt. Specifically, we try to express edge relationships using natural language "connections" and implicitly express edge relationships by summarizing information about surrounding neighbor nodes.

The results show that the following implicit expression is the most effective.

Paper:<paper content>

NeighborSummary:<Neighborsummary>

Instruction:<Task instruction>

Specifically, imitating the idea of GNN, the second-order neighbor node is sampled, and then the corresponding text content is input into the LLM for a summary as structural information, as shown in Figure 13.

Figure 13.Example of prompts summarized using LLM

In this paper, the effectiveness of PROMPT is tested on several datasets, and the results are shown in Figure 14. On the other four datasets except Pubmed, a certain improvement can be obtained relatively regardless of structure, reflecting the effectiveness of the method. Further, this paper analyzes why this prompt fails on the Pubmed dataset.

Figure 14.Prompt results considering graph structure information

On the Pubmed dataset, in many cases, the label of the sample appears directly in the text attribute of the sample. An example is shown below. Because of this feature, if you want to achieve better results on the Pubmed dataset, you can learn this "shortcut", and LLM's particularly good performance on this dataset may be derived from this. In this case, adding the summarized neighbor information may make it more difficult for LLM to capture this "shortcut" information, so performance will degrade.

Title: Predictive power of sequential measures of albuminuria for progression to ESRD or death in Pima Indians with type 2 diabetes. ... (content omitted here)

Ground truth label: Diabetes Mellitus Type 2

Further, at some heterophilous points where neighbors have different labels than themselves, LLM, like GNN, will be disturbed by neighbor information, thus outputting incorrect predictions.

Figure 15: Neighbor heteromatism causes incorrect predictions

The heteromatity of GNN is also an interesting research direction, you can also refer to our paper Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?

Case study: Generating annotations with LLM

As can be seen from the discussion above, LLM can achieve good zero-sample prediction performance in some cases, which makes it possible to replace manual sample generation annotation. In this paper, we preliminarily explore the possibility of using LLM to generate labels and then train GNN with these labels.

Figure 16.Label training GNN using LLM generation

In response to this problem, there are two points that need to be studied

How to select important points in the graph according to the structure and attributes of the graph to maximize the benefits of the annotation is similar to the setting of active learning on the graph
If you estimate the quality of the annotations generated by LLM, and filter the wrong annotations

discuss

Finally, a brief discussion of the limitations of this article, as well as some interesting follow-up directions. First of all, it should be noted that this article is mainly aimed at the task of node classification, and this pipeline needs more research to be extended to more graph learning tasks, and the title may also have some overclaims from this point of view. In addition, there are some scenarios where valid node properties cannot be obtained. For example, in financial transaction networks, where user nodes are anonymous in many cases, how to construct meaningful prompts that LLM understands becomes a new challenge.

Secondly, how to reduce the cost of LLM is also a problem worth considering. In the article, it is discussed to use LLM for enhancement, and this enhancement needs to use each node as input, if there are N nodes, it needs to have N interactions with LLM, which has a high cost of use. During the experiment, we also tried open source models like Vicuna, but the quality of the generated content was still far from ChatGPT. In addition, calls to ChatGPT based on APIs cannot be batched at present, so it is also inefficient. How to reduce costs and improve efficiency while ensuring performance is also worth considering.

Finally, an important issue is the assessment of LLM. In this article, possible test set leakage problems and unreasonable single annotation settings have been discussed. To solve the first problem, a simple idea is to use data that is not within the pre-training corpus of the large model, but this also requires us to constantly update the dataset and generate the correct human labels. For the second problem, one possible solution is to use the multi label setting. For paper classification datasets like ARXIV, high-quality multi-label labels can be generated through the categories of ARXIV itself, but for more general cases, how to generate correct labels is still a difficult problem to solve.

bibliography

[1] Zhao J, Qu M, Li C, et al. Learning on large-scale text-attributed graphs via variational inference[J]. arXiv preprint arXiv:2210.14709, 2022.

[2] Chien E, Chang W C, Hsieh C J, et al. Node feature extraction by self-supervised multi-scale neighborhood prediction[J]. arXiv preprint arXiv:2111.00064, 2021.

[3] He X, Bresson X, Laurent T, et al. Explanations as Features: LLM-Based Features for Text-Attributed Graphs[J]. arXiv preprint arXiv:2305.19523, 2023.

Exploring the potential of large language models for graph learning (with link)