laitimes

To predict the fragment spectrum of intact glycopeptides, Zhejiang University developed a deep learning method DeepGlyco

author:ScienceAI
To predict the fragment spectrum of intact glycopeptides, Zhejiang University developed a deep learning method DeepGlyco

Edit | Radish peel

Deep learning has achieved remarkable success in the field of mass spectrometry-based proteomics and is currently making its mark in the field of glycoproteomics. While various deep learning models can predict fragment mass spectra of peptides with great accuracy, they cannot cope with the nonlinear glycan structure in intact glycopeptides.

The Zhejiang University team proposed DeepGlyco, a deep learning-based method for predicting fragment spectra of intact glycopeptides. The model uses a tree-like long-term short-term memory network to process the glycan moiety, and a graph neural network architecture to merge the potential fragmentation paths of specific glycan structures.

This feature is conducive to the model's interpretability and ability to distinguish glycan structural isomers. The researchers further demonstrated that the predicted spectral library can be used for data-independent acquisition glycoproteomics and can be used as a complement to library integrity.

该研究以「Prediction of glycopeptide fragment mass spectra by deep learning」为题,于 2024 年 3 月 19 日发布在《Nature Communications》。

To predict the fragment spectrum of intact glycopeptides, Zhejiang University developed a deep learning method DeepGlyco

Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is the method of choice widely used in proteomics and glycoproteomics. At the heart of proteomics data analysis is the identification of peptides by matching fragment spectra to theoretical or experimental spectra of candidate peptides.

The most commonly used proteomics or glycoproteomics search engines are based on database searches, where peptide profile matching (PSM), glycopeptide profile matching (GPSM) scores fragment ions theoretically generated from peptide sequences and glycans, but largely ignores fragment ion strength.

Spectral library search correlates the intensity pattern of the analyte fragment ions with the spectrum, resulting in a more discriminating match score. Spectral libraries are also commonly used for the analysis of independent data acquisition (DIA) experiments. However, the incompleteness of the library coverage determines the upper limit of the library retrieval and identification capability.

Over the years, the application of machine learning, especially deep learning methods, in proteomics has become more common. Scientists use deep neural networks to predict peptide properties and behaviors throughout MS-based proteomics workflows, including detectability in relation to protease digestibility, retention time in LC, collision cross-sections in ion mobility spectroscopy, and fragment ion strength in MS/MS.

Most of the existing peptide property prediction tools use long short-term memory (LSTM), gated recurrent units, or transformer-based models. These models can only handle the linear input of peptide sequences (simple PTMs are considered indivisible labels) and cannot handle glycan structures.

In addition, intact glycopeptides have different cleavage behavior in MS/MS than non-glycosylated peptides. High-energy collisional dissociation (HCD) with stepped collision energy (CE) is the most common cleavage strategy for N-glycopeptides, which continuously cleaves glycans and peptide bonds. This results in pooled spectra containing not only peptide fragments (b/y ions) but also glycan fragments (B/Y ions) that are not covered by existing peptide fragment spectral prediction models.

In the latest study, the Zhejiang University team proposed a deep learning-based framework called DeepGlyco for predicting MS/MS profiles of intact glycopeptides. The input peptide sequences are processed by the conventional LSTM network, while the glycan structure is resolved by the introduction of a tree LSTM network. The putative cleavage pathway of structure-specific glycans is modeled by a graph neural network with an attentional mechanism, enabling the interpretation of the possible origin of predicted fragment ions. This feature facilitates the differentiation of glycan structural isomers. The investigators further demonstrated that the prediction library is also suitable for the analysis of DIA data for glycopeptides as a complement to library integrity.

To predict the fragment spectrum of intact glycopeptides, Zhejiang University developed a deep learning method DeepGlyco

Figure: Overview of a deep learning model for glycopeptide fragment profiling prediction. (Source: Paper)

The main difference between this method and other peptide MS/MS prediction methods is the ability to process nonlinear glycan structures by introducing a tree LSTM network. While the individual modules play their respective roles, extracting features from peptide and glycan moieties, they share information with each other through feature fusion about glycopeptides as a whole. Multi-task learning is employed to predict the entire glycopeptide profile as well as peptide and glycan fragments, designed to accommodate a wide range of peak intensities for different fragment types.

The method achieves high prediction accuracy using a model trained on data from the same organisms and instrument settings. Changes in organisms and instrument settings may result in a loss of predictive performance. Due to the difficulty of accessing large-scale glycopeptide MS/MS datasets compared to traditional proteomic datasets, the generalization ability of the model is still limited by the size of the training data.

To predict the fragment spectrum of intact glycopeptides, Zhejiang University developed a deep learning method DeepGlyco

Figure: Performance of glycopeptide fragment profiling prediction. (Source: Paper)

The investigators believe that the addition of additional encoders, such as instrument type and collision energy, to spectral metadata may facilitate the scalability of spectral prediction models in other glycoproteomic datasets in independent laboratories.

Another distinguishing feature of this deep learning model is that the predictions can be explained by the attention weights calculated in the model. It turns out that attention weights can reflect the importance of possible cleavage in the putative cleavage pathways of specific glycan structures. This highlights the rationale for how the model learns glycopeptide MS/MS fragmentation.

This feature allows for the differentiation of glycan structural isomers by modeling the changes in peak intensity originating from different cleavage pathways. The team demonstrated that the predicted spectra can be used for spectral library searches, which can rank potential glycan structures based on a given glycopeptide composition and filter out unlikely candidates.

To predict the fragment spectrum of intact glycopeptides, Zhejiang University developed a deep learning method DeepGlyco

Figure: Differentiation of structurally isomeric glycopeptides using a prediction spectral library. (Source: Paper)

While there is still a gap in the accurate identification of glycan structures by library search alone, it can partially distinguish between glycan structure isomers, such as core fucosylation. Unlike methods that rely on confirming the presence of signature ions, library searches, which account for intensity patterns across the spectrum, have been shown to be effective in peptide identification and phosphorylation site localization.

Through spectral prediction, the team overcame the limitations of spectral library search for incomplete coverage of glycan structural spatial libraries and demonstrated its potential to validate or supplement glycopeptide structural identification by other methods. The researchers further hypothesized that spectral prediction could improve the scoring of glycopeptide database searches and de novo sequencing.

The results of the paper also suggest that the predicted spectral library can be used to analyze DIA data for glycopeptides. Predictive libraries can not only correct low-quality spectra in sample-specific experimental spectral libraries while maintaining the same glycopeptide space, but also expand glycoproteome coverage and improve library integrity.

To predict the fragment spectrum of intact glycopeptides, Zhejiang University developed a deep learning method DeepGlyco

Figure: Performance of a predicted spectral library for DIA analysis. (Source: Paper)

Current glycopeptide-centric DIA data analysis methods cannot withstand a large query space, which contains a large proportion of false target glycopeptides that cannot be detected in the sample. This limitation is not specific to glycoproteomics and is actually a statistical control strategy inherited from traditional proteomic DIA analysis.

Therefore, it is impractical to use a library of predicted glycopeptides generated from bio-wide proteome and glycome spaces. Conversely, a list of starting glycopeptides of interest is still needed to define the search space. The investigators anticipate that this issue will be addressed with significant advances in glycoproteomics DIA data analysis, such as deep learning-based scoring compatible with traditional proteomics proteomics proteome size prediction libraries.

The team expects this work to provide a valuable deep learning resource for the glycoproteomics community and other potential applications in users' information workflows. Although this is demonstrated in the context of N-glycoproteomics, here the general architecture of the deep learning model can be applied to the spectral prediction of O-glycopeptides.

The researchers envision that future expansions of the model architecture will support fragmentation techniques for other fragmentation ion types, such as electron transfer dissociation, and analytes containing multiple glycans per glycopeptide.

Paper link: https://www.nature.com/articles/s41467-024-46771-1

Read on