laitimes

Nat. Genet. | Deep protein language models predict disease variant effects at the genome scale

author:Seishin Treasure Book

Compile | Zeng Quanchen

Review | Wang

Today, we are introducing a paper from the team of Chun Jimmie Ye and Vasilis Ntranos on the application of language models. Predicting the effects of coding variation is a major challenge. Although recent deep learning models have improved the accuracy of variation-effect prediction, they cannot analyze all coding variants due to reliance on near-homlogs or software limitations. Here, the authors developed a workflow to predict about 450 million possible missense variation effects in the human genome using ESM1b, a protein language model with 650 million parameters. ESM1b outperformed existing methods in classifying approximately 150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements in 28 deep mutational scan datasets.

Nat. Genet. | Deep protein language models predict disease variant effects at the genome scale

The phenotypic consequences of genetic variation, known as Variant Effect Prediction (VEP), is a key challenge in human genetics. Coding variants that alter protein amino acid sequences are of particular interest due to their richness in disease association, mechanistic understanding, and therapeutic feasibility. Most naturally occurring coding variations are missense mutations, replacing one amino acid with another. Despite advances in functional genomics and genetic research, distinguishing protein destructive harmful variants from neutral variants remains a challenge. In addition, alternative splicing exists in most human genes, and the same variant may have a damaging effect on some protein isoforms but is neutral to others, depending on interaction with the rest of the protein. Therefore, most of the missense variants are still Variant of Uncertain Significance (VUS), limiting the application of exome sequencing in clinical diagnosis. VEP is even more challenging for coding variants that affect multiple amino acid residues, such as in-frame indels.

VEP's experimental methods, such as Deep Mutational Scans (DMS) and Perturb-seq, can simultaneously measure the molecular and cellular phenotypes of thousands of variants. However, these endophenotypes are not exactly agents of relevant clinical phenotypes and are difficult to scale genome-wide. In contrast, computational methods for learning the biophysical properties or evolutionary constraints of proteins can theoretically cover all coding variations. Although most computational methods are based on labeled data for pathogenic versus benign variation, unsupervised homology-based methods can predict variation effects directly from multiple sequence alignments (MSAs) without training on labeled data. Recently, an unsupervised deep learning method called EVE, which implements generative variational autoencoders, has been shown to outperform supervised methods. However, due to their dependence on MSA, homology-based methods provide predictions of only a subset of well-aligned proteins and residues. In addition, because different isoforms of the same gene have the same homologs, it is not clear whether they will be able to distinguish the effects of variants on different isomers.

Nat. Genet. | Deep protein language models predict disease variant effects at the genome scale

Figure 1

Another deep learning approach to VEP uses protein language models, a technique derived from natural language processing. These are deep neural networks trained to simulate the space of known protein sequences captured throughout evolution through large protein datasets such as UniProt (Figure 1a). It is worth noting that protein language models do not require explicit homology and can estimate the likelihood of any possible amino acid sequence. They have been shown to implicitly learn how protein sequences determine many aspects of protein structure and function, including secondary structure, remote interactions, post-translational modifications, and binding sites. One of the largest protein language models is ESM1b, a publicly available 650 million-parameter model with training data including about 250 million protein sequences. It has been shown to be able to predict the effects of variability associated with DMS experimental results without further training.

However, the use of ESM1b is subject to several limitations. First, the model's input sequence length is limited to 1,022 amino acids, excluding approximately 12% of human protein isoforms. Second, while evaluated on DMS data for 32 genes, 10 of which were from humans, it is unclear how well the model performs in predicting the clinical effects of coding variants on a genome-wide scale. Finally, using ESM1b requires software engineering skills, deep learning expertise, and high-memory GPUs, which combine to form technical barriers to widespread use. Here, the authors generalized ESM1b to protein sequences of arbitrary length and used it to predict about 450 million possible missense variation effects for all 42,336 protein isoforms in the human genome. The authors evaluated it on three different benchmarks and compared it to 45 other VEP methods.

Predict the effects of possible missense variation in the human genome

The authors developed an improved ESM1b workflow and applied it to obtain a complete catalog of approximately 450 million missense variant effects on all 42,336 known human protein isoforms. The effect fraction for each variant is the log-likelihood ratio (LLR) between the variant and the wild-type (WT) residue (Figure 1b). Unlike the current homology model (Figure 1c), which is only available for a subset of human proteins and residues with MSA coverage, ESM1b predicts the effects of each possible missense variation. Many possible mutations in protein regions predicted to be harmful by ESM1b are often aligned with known protein domains (Figure 1d). As shown, for SPAST, SLC7A3, and ARX, these domains may be outside MSA coverage and are not suitable for homology-based models (Figure 1d), but may carry disease-associated variants.

ESM1b outperforms other VEP methods in clinical and experimental benchmarks

Nat. Genet. | Deep protein language models predict disease variant effects at the genome scale

Figure 2

To evaluate ESM1b's performance in predicting the clinical impact of variants, the authors compared the model's effect scores between pathogenic and benign variants in two datasets. The first dataset contains pathogenic and benign variants annotated in ClinVar, and the second dataset contains variants annotated as pathogenic in HGMD, as well as benign variants in gnomAD (defined as allele frequencies greater than 1%). The distribution of ESM1b effect scores shows significant differences between pathogenic and benign variants in these two datasets (Figure 2a). In addition, pathogenic and benign variants show a consistent distribution in both datasets, indicating that the predicted results are well calibrated. Using an LLR threshold of -7.5 to distinguish pathogenic and benign variants, true positives were 81% and 82%, respectively, in these two datasets. Comparing ESM1b to EVE as a classifier of variant pathogenicity, ESM1b received a ROC-AUC score of 0.905 in distinguishing between 19,925 pathogenicity and 16,612 benign variants (spanning 2,765 genes) in ClinVar, compared to 0.885 for EVE. In HGMD/gnomAD (covering 1,991 genes, including 27,754 pathogens and 2,743 common variants), ESM1b received an ROC-AUC score of 0.897, compared to an EVE score of 0.882 (Figure 2b).

After confirming the high accuracy of ESM1b as a variant pathogenicity classifier, the authors attempted to predict the effect of VUS in ClinVar. For this purpose, the ESM1b effect score is modeled on VUS as a Gaussian mixture distribution with two components (Figure 2c). The distribution of these two fits is in good agreement with the distribution of the annotated pathogenic and benign variants (Figure 2D). Based on the model, the authors estimate that about 58 percent of the missense VUS in ClinVar is benign and about 42 percent is pathogenic. In addition to EVE, the authors compared ESM1b with 44 other VEP methods, including all functional prediction methods and conservative scores from Database for Nonsynonymous SNPs' Functional Predictions (dbNSFP). In clinical benchmarking comparisons, the authors considered only those that (1) were not trained on clinical databases such as ClinVar and HGMD, or did not use features from these trained methods, and (2) did not use allele frequencies as features, as allele frequencies are often used to label variants as benign. Of these 46 methods, 19 (including ESM1b and EVE) met these criteria for unbiased comparison. In the set of variants reported by all 19 methods, ESM1b outperformed all other methods on ClinVar and HGMD/gnomAD (Figure 2e, f). Similarly, ESM1b outperformed each individual method in its respective reported set of variants (Figure 2g, h). The results of all pairwise comparisons were statistically significant, with a P-value less than 0.001.

Nat. Genet. | Deep protein language models predict disease variant effects at the genome scale

Figure 3

The authors further compared the capabilities of these 46 VEP methods in predicting DMS experimental measurements. The complete DMS benchmark included 28 experiments covering 15 human genes (166,132 experimental measurements on 76,133 variants). The authors compared 43 methods to a subset of 16,049 variants in the 11 genes reported by these methods. ESM1b leads with an average Spearman correlation coefficient score of 0.426, and the relationship between the effect score and the experimental measurement is shown in Figure 3a, followed by DEOGEN2 (0.423), REVEL (0.419), and EVE (0.418). DEOGEN2 and REVEL are supervised methods, while EVE, like ESM1b, is an unsupervised approach. Comparing ESM1b directly with EVE with 64,580 variants (spanning 15 genes) with EVE scores shows a similar trend (Figure 3b). Similarly, ESM1b outperformed the other 45 methods in the set of variants reported by each method (Figure 3c), with 37 methods comparing statistically significant. Two additional analyses further demonstrate the functional explanation of ESM1b predictions. First, as shown in individual examples (Figure 1d), missense variants located within the domain have a more negative (harmful) effect score, while variants located outside the domain are similar to benign variations (Figure 3d). Second, the ESM1b effect score coincides well with allele frequencies, and common variants are predicted to be less destructive (Figure 3e), consistent with common variants that are generally considered benign.

ESM1b can predict the effect of variation on protein isoforms

Nat. Genet. | Deep protein language models predict disease variant effects at the genome scale

Figure 4

As a protein language model, ESM1b evaluates each variant in the context of the input amino acid sequence, allowing the same variant to be evaluated in the context of different protein isoforms. One variant may be destructive to some isomers but not to others, possibly due to interaction with selectable splicing domains (Figure 4A). For example, comparing ESM1b scores between a major isomer of P53 and a shorter isomer, the authors found that the scores of 170 variants (mostly located near the splicing junction) varied widely (LLR difference >4), including three ClinVar variants, annotated as VUS (Figure 4b). 3,477 missense variants were identified in ClinVar with significant differences in predicted effects between isoforms (LLR standard deviation >2) (Figure 4c). Notably, the authors considered only vetted and manually screened protein isoforms. Of these 3,477 variants, 148 (4%) were benign or probably benign, 437 (13%) were pathogenic or probably pathogenic, and 2,892 (83%) VUS. Interestingly, the effect fractional distribution of these VUS is similar to that of pathogenic variants when considering the most destructive isomers; When considering the least destructive isomers, the distribution of these VUS is similar to benign variation (Figure 4c). Similar to P53, many clinically important genes have ClinVar variants with high effect fractional variance across different isoforms, including BRCA1, IRF6, and TGFB3 (Figure 4d).

conclusion

A comprehensive evaluation showed that ESM1b outperformed other state-of-the-art VEP methods in differentiating pathogenic and benign variants in ClinVar and HGMD/gnomAD and in predicting the effects reported by DMS experiments. As a protein language model that does not explicitly rely on homology, ESM1b offers several additional advantages in terms of VEP. As an unsupervised method, ESM1b does not have the risk of information leakage from the training set to the test set in clinical or population genetics datasets, allowing accurate and unbiased assessment. ESM1b's predictions are simpler and faster than homology-based methods because only one input sequence is required once the generic model has been trained. Notably, protein language models can provide predictions for every possible amino acid sequence and apply to all coding variants. In studies, the universality of ESM1b has been demonstrated, including (1) variants outside the coverage of MSA, (2) variations with different effects on different protein isoforms, (3) intraframe insertions, and (4) stop codon variations.

Resources

Brandes, N., Goldman, G., Wang, C.H. et al. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet (2023).

https://doi.org/10.1038/s41588-023-01465-0

Read on