Expand the knowledge graph by eliminating edges

How we changed the data model to change the category of complexity of adding nodes while enabling faster traversal.

译自 Scaling Knowledge Graphs by Eliminating Edges，作者 Ben Chambers。

Knowledge graphs are able to link related content in a way that complements vector similarity. Just as hyperlinks allow a website to be connected to relevant information, edges in a diagram can connect content to other related, but not necessarily similar, content.

In general, this provides a more complete contextual reference for generative AI applications, resulting in more complete answers. Instead of having an approximate copy of the 10 most relevant information, you should have one copy and 9 pieces of relevant information that provide more depth and breadth of the topic.

We have recently introduced content-centric knowledge graphs as a method better suited to generative AI and graph retrieval enhanced generation (RAG). Since then, we've contributed GraphVectorStore to LangChain and introduced several techniques for inferring links between content, including explicit links in HTML, common keywords using Keybert, named entity extraction using GLiNER, and hierarchies for documents and titles.

Some of these techniques result in highly connected knowledge graphs. For example, linking nodes with common keywords will create a cluster of highly connected blocks when applied to documents on the same topic.

In the worst-case scenario, if each node has the same five keywords, then we will create 5 * n * (n - 1) edges between the nodes. Since edges are created when data is loaded, this causes the time it takes to load nodes to grow quadratically – each new node must be linked to all past nodes!

Here, we'll discuss how we can change the data model to change the complexity category of adding nodes while enabling faster traversal. The key change is to store outbound and inbound links instead of materializing edges.

In the keyword example, this allows us to persist 5 inbound and outbound links instead of 5 * (n - 1) edges, ensuring that there is no performance degradation when adding new nodes. This requires changes to be made to the traversal so that edges are discovered at query time, not when nodes are loaded. As we'll see, there are ways to take advantage of this to achieve faster traversal.

Content-centric knowledge graphs

A content-centric knowledge graph is a knowledge graph where nodes represent content – such as text paragraphs, images, and tables. These are particularly well-suited for capturing multimodal information and are easier to build than more detailed, entity-centric knowledge graphs. The relationships between content—links between paragraphs, images cited by paragraphs—enable more complete large language model (LLM) context.

Using content as a node fits well with the GenAI ecosystem – the LangChain document becomes a node. We've discussed a number of ways to introduce edges between these nodes: for example, explicit links in HTML, common keywords, and cross-references. The ability to extract links between documents is critical to building these knowledge graphs. Depending on the density of the links between nodes, the original implementation that uses materialized edges would have scaling issues.

Links and edges

To improve the compatibility of the Content-Centric Knowledge Graph, we want to describe edges without any additional information beyond the metadata of each document. Instead of describing edges specifically (which would be impossible since it involves two different documents), we use the concept of "link" instead. Each document defines zero or more links that connect to corresponding links on other documents.

A node with an outbound link has an edge with each node that has a matching inbound link.

In the example below, we see three nodes. All three nodes are linked together by a common keyword "foo". Node 2 is the only one that contains the keyword "bar", so it doesn't have an edge of that type/label. Node 1 has an outgoing edge pointing to "bar" and node 3 has an incoming edge, so they are connected by a directed edge.

Similar to how a hypergraph can be represented as a bipartite graph, the above can be visualized as a graph where the edges between nodes are passed through different types of nodes representing labels. In this case, the outgoing edge is the edge from the node to the label, and the incoming edge is the edge from the label to the node. The edges between the nodes in the original graph are the same as the paths of length 2 through the labeled nodes in the bipartite graph.

Problem: Common keywords and highly connected graphs

Keywords are a double-edged sword. They can be used to link nodes with shared keywords together to retrieve information from the nodes that extend a particular topic. However, when the keywords overlap too much, it quickly degenerates into a fully connected graph with edges between pairs of nodes. Using techniques such as TF-IDF (Word Frequency-Inverse Document Frequency) to select keywords that are more unique across documents can help, but ideally, the performance of the knowledge graph will not degrade even if the connectivity is high.

Solution: Avoid materializing edges

Instead of linking nodes by explicitly materializing edges when adding nodes, we can query the connections and query the connections as we traverse the graph.

Materialized edges	Query-based edges
load	Query and write to all edges. O(t*n^2) is used to write to n fully connected nodes with t labels. O(n) If there is no connection.	Write to the incoming and outgoing edges of each node. O(t*n) is used to write to n nodes (connected or not).
traverse	O(1) Query each source node. O(n) query to find all nodes reachable from n source nodes.	O(t) Query each source node. O(T) (number of unique tags) to find all reachable nodes.

load

The main change here is that instead of storing edges with the source as the key, we store the incoming edge labels with the label as the key. This allows us to request all targets for an edge emitted from a given node.

traverse

Changing to a query join during traversal means that finding the target of a given source node requires a separate query for each outbound edge label. However, this doesn't slow down performance for several reasons:

The implementation is able to query the linked nodes for each tag in parallel. Using a highly scalable database like DataStax Astra DB/ Apache Cassandra makes concurrency a viable technology.
We denormalize the target text embedding so that each query can be made against the top-level target node that most closely resembles the query. This allows us to limit the query to the best node to consider for each outbound edge label, rather than getting all nodes.
We can remember the outgoing label that we have already dealt with. In the case of highly connected (common) tags, once we retrieve the set of nodes with the incoming edge with the given keyword, we don't need to do this anymore - even if there are other edges pointing to the node, the result is the same (reaching that destination). This enables the traversal to be cut off earlier than in the case where the edge has to be traversed. Essentially, we take advantage of the bipartite graph and remember the content nodes that have been visited and the label nodes that have been visited.

In fact, as you'll see in the benchmark, we've found that traversing tags this way is actually faster than traversing edges. Building on top of a common database for interconnected content allows us to optimize schemas and query schemas for retrieval. In this case, it allows us to consider each label that connects the nodes once during the traversal (the set of nodes that arrive does not change), whereas a traditional graph needs to consider each edge between the nodes.

Use case: Keyword links from PDFs

To demonstrate the use of keywords, we show how to load a PDF, split it into chunks, and use Keybert to extract keywords for each chunk.

from langchain_openai import OpenAIEmbeddings
import cassio
from langchain_community.graph_vectorstores.cassandra import (
CassandraGraphVectorStore,
)
from langchain_community.graph_vectorstores.extractors import (
    LinkExtractorTransformer,
    KeybertLinkExtractor,
)
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize AstraDB / Cassandra connections.
cassio.init(auto=True)

# Create a GraphVectorStore, combining Vector nodes and Graph edges.
knowledge_store = CassandraGraphVectorStore(OpenAIEmbeddings())

# Load and split content as normal.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64,
    length_function=len,
    is_separator_regex=False,
)
loader = PyPDFLoader("Tourbook.pdf")
pages = loader.load_and_split(text_splitter)

# Create a document transformer extracting keywords.
transformer = LinkExtractorTransformer([
    KeybertLinkExtractor(extract_keywords_kwargs={
      "stop_words": "english",
    }),
])

# Apply the document transformer.
pages = transformer.transform_documents(pages)

# Store the pages in the knowledge store.
knowledge_store.add_documents(pages)

Benchmark results

The PDF document about travel in India was split into 136 blocks and loaded six times into a content-centric knowledge graph. Each load creates 136 new blocks. Using the old method (materializing edges), we see that the time to load the document is high from the beginning and increases roughly linearly; Each new document must be linked to all old documents, which will grow over time. As a result, the total time to load all documents is exponential. With the new method (on-demand edges), the time is constant and is usually much lower because no edges are created.

We also found that in both cases, the time it took to traverse the graph to retrieve the document was about the same – relatively constant.

Try!

Content-centric knowledge graphs are just as fast and easy to populate as vector stores. By storing information about the content that each Document links to, you can scale and efficiently store graphs with arbitrarily high connectivity. Traversal, guided by maximum marginal correlation, allows traversal of the graph to retrieve the most relevant and diverse information (and the relevant information linked to that information) that is most relevant to the query.

To experience the benefits of these advancements, try the latest improvements in langchain-core 0.2.23 and langchain-community 0.2.10. We invite you to use LangChain to integrate a content-centric knowledge graph into your project and explore further enhancements for connecting and retrieving content. Share your feedback and join us in improving these tools to push the boundaries of what's possible in the knowledge graph in AI.

Expand the knowledge graph by eliminating edges