Yum, tasty mutations...

MutationTaster Documentation

Input

MutationTaster has three different analysis modes that can be selected in the left panel:

Chromosomal position

Input Description
SNV or InDel

MutationTaster predicts the variant effect on all protein-coding Ensembl transcripts and prints a table with summarised results for all of them and transcript-specific details below. The variant can be entered in various formats:

  • Chr2:232526664T>C (default)
  • 2:232.526.664T>C (dot as thousands separator)
  • chr2:g.232.526.664C>T (comma as thousands separator)
  • 2-232526664-C-T (gnomAD)

For InDels always start with the last reference base before the variant. E.g. 1:1048791CAG>C

Specific transcript

Input Description
Gene symbol

You can choose a gene by entering one of the following:

  • HGNC symbol e.g. LEP (case insensitive)
  • Ensembl stable gene ID (e.g. ENSG00000174697)

After entering a valid gene, all protein-coding Ensembl transcripts of your gene will be displayed.

Transcript

You can also directly enter an Ensembl stable transcript ID (e.g. ENST00000308868). In this case, you do not need to enter a gene.

Reference

Choose Coding sequence (c.) if your position refers to the coding sequence of the selected transcript (position 1 is to the A in the start ATG).

Choose Transcript (cDNA)if your position refers to the cDNA of the selected transcript (position 1 is the first base of the transcript, i.e. the first 5'UTR base).

Choose Gene (genomic sequence)if your position refers to the gene sequence (position 1 is the first base of the first transcript, including non-protein coding ones).

Variant by sequence snippet

Choose Variant by sequence snippet if you have a sequence inlcuding the variant, e.g. from Sanger sequencing. You can paste the sequence into this field, putting square brackets [ ] around the reference base(s) and the alternative allele (e.g. ACGGTT[A/G]CTCTAAGGA for a base exchange from A to G). Examples of the format are provided when hovering over the question mark (?) in the input field. All entries have to refer to the 5'-3' direction of the sequence.

Note that you must choose the reference system (i.e. CDS, cDNA, gene).

Variant by position / SNV

Choose Variant by position / SNV if your variant is an SNV.

Enter the position of the base exchange.

Note that you must choose the reference system (i.e. CDS, cDNA, gene).

Then fill in the new base. For a base exchange c.1204G>T you would enter a T as new base.

When changing the input, the sequence surrounding the variant will appear at the bottom of the screen. The reference base affected by the base exchange is highlighted in blue.

Variant by position / InDel

Choose Variant by position / InDel if you are want to analyse an insertion, deletion, or an insertion/deletion.

Enter the position of the variant:

  • The first line is the position of last reference base upstream of the variant, i.e. of the base directly preceding the variant.
    Example: For the deletion of three nucleotides c.92_94delGAC, enter 91.
  • The second line is the position of first reference base downstream of the variant, i.e. of the base directly following the variant.
    Example: For the deletion of three nucleotides c.92_94delGAC, enter 95.
  • The third line is for the inserted bases. Leave empty to indicate a deletion.
    Example: For an insertion of GAGA between nucleotides 51 and 52 of the coding region (c.51_52insGAGA), enter GAGA.

Note that you must choose the reference system (i.e. CDS, cDNA, gene).

When changing the input, the sequence surrounding the variant will appear at the bottom of the screen. The reference bases affected by the InDel are highlighted in blue.

VCF file

Input Description
Project name

Optional. The project name will be displayed in the output and in email notifications. You must not use patient names!

Email address

Optional. When provided, you will receive a notifcation and a link to the results when the analyses is complete. Results will be kept on our server for three weeks unless you delete them earlier.

VCF file

Input files must be in VCF format and coordinates must refer to GRCh38 (or hg38). While MutationTaster has no size limit for VCF files, only variants within protein-coding genes will be analysed.
This version now supports multi-sample VCF processing. However, the MutationTaster pipeline will merge the samples for analysis and does not provide sample-specific results in the output.

Minimum coverage

You can exclude variants with a low coverage from the analysis by setting this field to a suitable value (e.g. 100 for exome sequencing or 15 for genome sequencing).
If you do not want to exclude poorly covered variants or if there is no DP data in your VCF, fill in 0.
Default # is a minimum coverage of 10.

Search for homozygous variants

Check Analyse homozygous variants only if you are only interested in homozygous variants - heterozygous variants will be neglected.

Filter against gnomAD

You can restrict your analysis to rare variants by filtering against gnomAD. If you wish to exclude variants found 10 or more times in homozygous state (e.g. for a rare recessive disorder) but include all heterozygous variants, you can leave everything as it is (default setting). You can also exclude variants found in individuals from gnomAD with any genotype (e.g. for dominant disorders). If you do not want to filter at all, uncheck all boxes.

Analyse custom regions

If you want to focus on certain genomic regions (e.g. from linkage analysis or homozygosity mapping), choose Analyse custom regions. A text field will open and you can enter your regions of interest in bed-format.

You can also exclude regions with the Exclude custom regions option.

Analyse only exons

If you do not want to include intronic variants, choose Only exons and ... bp flanking introns option. Near-exonic intronic variants can be included by setting the flanking introns value to a suitable number, e.g. to 50 bp to cover variants that might affect splicing.

Analyze only on chromosome

Choose Only variants on chromosome to restrict the analysis to your favorite chromosome.

Output

The different elements of the output are named and described below. The first table applies to the analysis of a chromosomal position or specific transcript. The second table and subsequent sections describe the outfrom from the analysis of a VCF file.

Single variants: Chromosomal position/specific transcript

Output Description
Prediction MutationTaster predicts a variant as deleterious or benign. For more details about the classification process, please read the section about our Random Forest classifier.
Summary List of the most prominent features of the variant (e.g. 'at intron-exon boundary', 'spans start ATG', 'homozygous in gnomAD' etc.)
Variant The variant on "physical" i.e. chromosomal level, in HGVS notation (e.g. chr7:91623937_91623938insGGCAAT).
Gene symbol The official HGNC gene symbol.
Gene constraints Transcript-specific gene constraints from gnomAD. This includes the LOEUF score (loss-of-function observed / expected upper bound fraction), as well as the pLOF, missense and synonymous o/e (observed / expected) scores. For a more detailed explanation on the scores refer to the gnomAD documentation.
Ensembl transcript ID Ensembl [1] stable transcript ID, starting with ENST.
Genbank ID(s) NCBI Reference Sequence Database (RefSeq) [2] Transcript ID starting with NM for mRNA transcripts. MANE-select transcripts are highlighted.
UniProt / AlphaMissense peptide Transcript-specific UniProt KB / SwissProt [3] accession ID and links to AlphaMissense[4]. Since AlphaFold structures are not available for each transcript, we provide both transcript and gene-specific links.
Variant type Is either a base exchange, a combination of insertion and deletion, an insertion, or a deletion.
Gene region Is either 5'UTR (untranslated region), CDS (coding sequence), 3'UTR, or intron.
DNA changes Variant on nucleotide level. gDNA level (g.) position is always is displayed, cDNA level (cDNA.) only for variants located in exons, CDS level (c.) only for variants in the coding sequence.
AA changes All amino acid changes are shown here, displaying the original versus the new amino acid as well as the position of the substitution and the Grantham score. The Grantham Matrix [5]) is a measure of the difference between the physico-chemical characteristics of two amino acids. Scores may range from 0.0 to 215. These scores are displayed for information reasons only and do not influence MutationTaster's predictions because we use our Random Forest to score the deleteriousness of AA substitutions.
An asterisk (*) stands for a stop codon, a minus (-) indicates an insertion/deletion.
If the startATG is lost, MutationTaster searches for a potential new, downstream startATG and informs you about AA changes based on the assumed alternative AA sequence.
Frameshift Can be either yes or no.
Length of protein MutationTaster checks if the resulting protein will be elongated (prolonged), truncated, or whether nonsense-mediated mRNA decay (NMD) is likely to occur. MutationTaster determines the NMD border as last intron/exon junction minus 50 bp and analyses if a given premature termination codon occurs upstream of this border, thus leading to NMD. An elongated protein is referred to as prolonged, i.e. the original termination codon is destroyed and the translation stops later than normal. Truncated is grouped into slightly truncated (if less than 10% of the reference protein length are missing) or strongly truncated (if more than 10% of original protein length are missing). In the two latter cases, the additional information 'might cause NMD' is given, because the '-50 boundary rule' is not fulfilled, but it cannot be ruled out that NMD occurs nevertheless. If MutationTaster concludes that a variant causes NMD, this variant is automatically regarded as a deleterious mutation. The classifier is run nevertheless and the Tree vote value for the prediction is shown.
Pathogenic variant (ClinVar) Indicates if this variant is listed as a disease mutation in ClinVar. If a variant is marked as likely pathogenic or pathogenic in ClinVar, it is automatically predicted to be disease-causing, i.e. deleterious automatic (the Random Forest classifier is run nevertheless and the Tree vote value for the prediction is shown). For known disease mutations, we also display the disorders they cause and link to OMIM whenever possible.
Variant DBs Our database contains all variants from gnomAD [6]. For variants found in this database, MutationTaster provides a table with the following information: homozygous (-/-), heterozygous, allele carriers. Variants with 40+ homozygote individuals in gnomAD are automatically predicted as benign automatic (the Random Forest classifier is run nevertheless and the Tree vote for the prediction is shown).

MutationTaster provides the dbSNP (rs) ID and a link to dbSNP for variants listed in dbSNP. Please note that dbSNP IDs do not consider the alleles of a variant, only the position, so one dbSNP ID may include several variants.

Protein conservation For conservation analysis, homologues isoforms in ten other species (chimp, rhesus macaque, mouse, cat, chicken, claw frog, pufferfish, zebrafish, fruitfly, and worm) are aligned with the corresponding isoform. Sequences are aligned with blastp [7], which is installed as stand-alone executable on our server and analysed by MutationTaster.
The status of evolutionary conservation is either classified as all identical (i.e. the same amino acid in the human and the homologue amino acid sequence) (partly) conserved (i.e. similar amino acids in the human and the homologous amino acid sequence) or not conserved (i.e. different amino acids in the human and the homologue amino acid sequence). MutationTaster states when no homologous gene is known or no alignment could be made. Alignments are shown as snippets for each species, including the position of the analysed residue, the alignment and the status.
We de-liberately restrict conservation analysis to ten animal species, although sequence data for far more species is available. The inclusion of further species did not have considerable influence on prediction accuracy but each alignment significantly decreased the speed of MutationTaster.
Protein features The program checks whether any protein features (from SwissProt) are directly or indirectly affected by the variant.
Lost means that the AA exchange invoked by the variant in question is located within the protein feature. A protein feature might get lost if a whole exon is skipped due to splice site changes, or if a protein is shortened because of a premature termination codon - in those cases, protein features are indirectly affected.
Phylogenetic conservation MutationTaster uses phastCons and phyloP to indicate the conservation on DNA level [8].
PhastCons values vary between 0 and 1 and reflect the probability that each nucleotide belongs to a conserved element, based on the multiple alignment of genome sequences of 100 different species (the closer the value is to 1, the more probable the nucleotide is conserved). It considers not just each individual base, but also its flanking bases. By contrast, phyloP separately measures conservation at individual bases, ignoring the effects of their neighbors. Moreover, phyloP can not only measure conservation (slower evolution than expected under neutral drift) but also acceleration (faster than expected). Sites predicted to be conserved are assigned positive scores, while sites predicted to be fast-evolving are assigned negative scores. The scores shown in MutationTaster2025 are based on the 100 vertebrate species alignment. However, there are multiple different versions available. For more information about phyloP and phastCons, please see the cited paper or the description on the UCSC website.
Splice sites MutationTaster uses a custom-build module based on MaxEntScan [9] to analyse possible changes in splice sites.
Splice site analysis is turned-off by default for mitochondrial genes.
Kozak consensus sequence altered The Kozak consensus sequence (gccRccAUGG; R = purine) starts upstream of the start codon (AUG) and plays a major role in the initiation of translation. The purine (R) at position -3 as well as the G in position +4 are highly conserved. The program checks whether for a given variant a previously strong consensus sequence has been weakened.
Poly(A) signal MutationTaster uses a locally installed version of polyadq [10] for analysis of polyadenylation signals. More information is avavilable at http://rulai.cshl.org/tools/polyadq/polyadq_form.html
AA sequence altered Can be either yes (AA exchange) or no (no AA exchange)
Chromosome The chromosome the variant is located on.
Strand Is either 1 for forward strand or -1 for reverse strand.
gDNA and cDNA sequence snippet The sequence surrounding the variant (20 bp up- and downstream). The altered bases are highlighted in red.
reference and mutated AA sequence Complete AA sequences, the asterisk (*) indicates STOP. The amino acid highlighted in bold and red in the Mutated Sequence represents the altered AA.
Position of stopcodon in wt / mu CDS Position of the last base of the stop codon (this can either be TGA, TAA or TAG), position 1 refers to the A in the start ATG codon.
Position (AA) of stopcodon in wt / mu AA sequence Position of the stop codon (asterisk, *) in the amino acid sequence, position 1 refers to the first amino acid of the protein.
Position of stocodon in wt / mu cDNA Position of the last base pair of the stopcodon (this can be either TGA, TAA or TAG), position 1 refers to the first base pair of the cDNA.
Position of start ATG in wt / mu cDNA Position of the A in the start ATG, position 1 refers to the first base of the cDNA. If the regular start ATG is changed by a variant, MutationTaster searches for the next upstream ATG and assumes this to be the new start ATG for the variant sequence.
Last intron/exon border The last base of the exon before the last exon.
Theoretical NMD border in CDS MutationTaster determines the NMD border as last intron/exon junction minus 50 bp and analyses if a given premature termination codon occurs upstream of this border, thus leading to NMD. If MutationTaster concludes that a variant causes NMD, this variant is automatically regarded as a deleterious mutation. The classifier is run nevertheless and the Tree vote value for the prediction is shown.
Length of CDS The length of the coding sequence from the A of the initiation codon (ATG) to the last base of the termination codon.
Coding sequence (CDS) position Position of the variant in the coding sequence.
cDNA position Last reference base upstream of the variant and first reference base downstream of the variant in coding DNA sequence context (positions relative to start of transcribed coding DNA reference sequence) e.g. 1203 / 1205, the altered base is at position 1204.
gDNA position Last reference base upstream of the variant and first reference base downstream of the variant in gene context (positions relative to start of genomic DNA reference sequence) e.g. 53,344 / 53,346, the altered base is at position 53,345.
Chromosomal position Last reference base upstream of the variant and first reference base downstream of the variant in chromosomal sequence context (position relative to start of chromosomal reference sequence) e.g. 154,372,337 / 154,372,339, the altered base is at position 154,372,338.
Speed This is the time MutationTaster needed for analysis & prediction - your browser might however need some extra time to display the results.

VCF file

MutationTaster is tightly integrated with our disease mutation search engine MutationDistiller. With MutationDistiller, you can easily prioritise potentially disease-causing variants according to the biological role of the affected genes. On the results page, simply Open MutationDistiller to view your variants in MutationDistiller.

In most cases, MutationTaster will not analyse each and every line of your VCF file, either because you have set certain filters, or because certain variants were not suitable for analysis with MutationTaster.

Statistics Description
Submitted

Number of variants (lines) in VCF file.

Analysed

Number of variants which were analysed with MutationTaster. These will normally be significantly more than the analysable variants, because for most variants, more than one (suitable) transcript will be found.

Analysis problems

The number of problems encountered during the analysis of your VCF.

Details on analysis run

A link to the log file showing the processing of your VCF by MutationTaster in detail.

Discarded variants

A downloadable file listing the variants that were skipped during analysis and why. Variants are ignored for analysis due to presence in gnomAD (applies only if any of the Filter polymorphisms options were set when the VCF file was uploaded).

Additionally, variants can be excluded from analysis because they are

  • extragenic and/or out of/distant from exon (applies only if option for Only exons is set)
  • out of chromosome (applies only if option for Only chromosome is set)
  • out of region (applies only if option for Analyse custom region is set) or
  • inside of region (applies only if option for Exclude custom region is set)

MutationTaster results are stored in our database and can be accessed online on our server. We will store your results only for three weeks. Afterwards, they will automatically be deleted. You can download your results as a zip-archive. The archive contains one TSV file with one variant per line and the following columns per variant:

Chr Position Gene Prediction Model Tree vote Type
AAE dbSNP Ref Alt Clinvar Features Region Coverage Hom / Het

We offer to filter out certain variants (e.g. those that were excluded due to presence in 1000G) and to sort the remaining variants according to user-specified criteria (see Sort). Once downloaded and stored on your own machine, you can still re-sort the TSV file with Microsoft Excel or similar spreadsheet programs.

The option to delete your data as soon as your download is completed will soon be added.

Option Description
Sort & group

The results stored in our database can be sorted by different criteria for either displaying and browsing them directly on our server or for exporting them.

Sort by these attributes; in one, two or three levels:

  • Chromosome (chromosome from 1 to Y)
  • Chromosome Reverse (chromosome from Y to 1)
  • Position (ascending)
  • Position Reverse (descending)
  • Gene (genesymbol from A to Z)
  • Gene Reverse (genesymbol from Z to A)
  • Prediction (prediction from benign to deleterious)
  • Prediction Reverse (prediction from deleterious to benign)
  • Model (model used by the classifier, without_aae-simple_aae-complex_aae-5utr-3utr)
  • Model Reverse (model used by the classifier, 3utr-5utr-complex_aae-simple_aae-without_aae)
  • dbSNP (rs-number ascending)
  • dbSNP Reverse (rs-number descending)
Filter

There are the following options to hide certain variants: All predicted polymorphisms, known polymorphisms (i.e. homozygous > 40 times in gnomAD or ExAC, or > 4 times in 1000G) and prediction problems. Selection of options is valid for both displaying results in the browser as well as downloading them as TSV. In addition, for displaying results only, the number of results per page can be limited, which generally reduces the loading time of each page.

Get the data

The results can either be displayed online in your browser (choose Display) or be downloaded as TSV (choose Download TSV). Filtering and sorting options are applied to both methods.

Please note: The VCF analysis pipeline will process the variants from the submitted VCF-file in all suitable Ensembl transcripts. Some transcripts will not be included in the analysis, e.g. transcripts which either (a) have no or too many corresponding NCBI gene ID(s); or (b) are protein-coding but have no correct start codon (ATG) or stop codon (TGA, TAA, TAG). MutationTaster first tries to use protein-coding transcripts and if there is at least one, it won't search for transcripts of other biotypes. Only if there are no protein-coding transcripts available, it will try to use transcripts of other biotypes (although certain biotypes are straightaway and principally excluded from analysis, e.g. nonsense_mediated_decay,ambiguous_orf,TR_pseudogene etc.).

Random Forest classifier

Tree vote

MutationTaster uses Random Forest models for predictions. Tree vote indicates how many decision trees of the Random Forest are suggestive of deleteriousness vs. how many are suggestive of a benign variant. The number before the vertical bar (|) always represents deleterious predictions and the number after the vertical bar benign predictions. E.g., tree vote 82|18 means that 82 decision trees in the Random Forest have indicated deleteriousness and 18 decision trees have indicated a benign variant.

Please note that the tree vote value should NOT be interpreted as the probability of error. Our results show that wrong predictions are usually caused by benign or deleterious variants that show characteristics of the other case, e.g. SNPs that are highly conserved and destroy protein features or disease mutations that appear to have no effect on the protein/gene at all. However, the number of deleterious versus benign votes indicates the confidence of the classifier that this variant belongs to that particular class.

Models

We provide five different models aimed at different types of variants, either aimed at

All models were trained with all available and suitable common polymorphisms and disease mutations. The number of features and the number of trees were separately optimised for each of the five models.

MutationTaster automatically determines the correct model for each variant.

Training and Performance

We have attempted to find a reasonable trade-off between predictive performance and speed and therefore limited the tree number and tree size within the different Random Forest models.

The final performance scores presented below are calculated on a hold-out dataset that was separated before cross-validation and give an indication of the model performance on real world data.

without_aae simple_aae complex_aae 5utr 3utr
PPV 0.935 0.895 0.994 0.872 0.820
NPV 0.996 0.891 0.857 0.907 0.980
Sensitivity 0.824 0.903 0.999 0.813 0.697
Specificity 0.998 0.882 0.618 0.939 0.989
Balanced accuracy 0.911 0.893 0.808 0.876 0.843
Precision 0.935 0.895 0.994 0.872 0.820
Recall 0.824 0.903 0.999 0.813 0.697
No of test variants 438848 85393 198637 5307 7885
No of positive test variants (pathogenic) 64653 61991 197549 4066 2593
No of negative test variants (benign) 3567047 55491 3066 7957 38828

It should be noted that the prediction models used by MutationTaster were explicitly trained for balanced accuracy, i.e. equal predictive performance for benign and deleterious variants. While this increases the number of false-positive predictions, it reduces the risk of missing a true disease mutation compared to predictors trained for specificity.

A more detailed description of the classifier and how the models were trained can be found on the Random Forest Classifier page. The models are available for download on the Supplementary data page.

Special cases

If a variant is a 'common' polymorphism (as confirmed by the existence of at least 40 homozygous individuals in gnomAD), it is automatically predicted to be benign. Variants listed as disease mutation in ClinVar or predicted to cause a premature termination codon (leading to nonsense-mediated mRNA decay, NMD) are automatically predicted as deleterious. In both cases, the Random Forest classifier is run nevertheless and the tree vote for the prediction that was automatically made is shown.

Limitations and future plans

Limitations

Future plans

References

  1. Harrison PW, Amode MR, Austine-Orimoloye O, et al. Ensembl 2024. Nucleic Acids Res. 2024;52(D1):D891-D899. doi:10.1093/nar/gkad1049
  2. O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189.
  3. Bateman A, Martin MJ, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bursteinas B, Bye-A-Jee H. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research. 2020 Nov 25. doi: https://doi.org/10.1093/nar/gkaa1100.
  4. Tordai H, Torres O, Csepi M, Padányi R, Lukács GL, Hegedűs T. Analysis of AlphaMissense data in different protein groups and structural context. Sci Data. 2024 May 14;11(1):495. doi: 10.1038/s41597-024-03327-8. PMID: 38744964; PMCID: PMC11094042.
  5. Grantham, R: Amino acid difference formular to help explain protein evolution. Science 185: 862-864 (1974)
  6. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020 May;581(7809):434-43. doi: https://doi.org/10.1038/s41586-020-2308-7.
  7. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.
  8. Pollard KS, Hubisz MJ, Siepel A: Detection of non-neutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21
  9. Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of computational biology. 2004 Mar 1;11(2-3):377-94.
  10. Tabaska JE, Zhang MQ: Detection of polyadenylation signals in human DNA sequences. Gene 1999;231: 77-86.

MutationTaster Tutorial

Sanger sequencing

Disease mutation via sequence snippet: SOD1_H47R

Imagine a patient suffering from ALS. You decide to sequence the exons of the SOD1 gene. By looking at the sequence reads, you find an A->G point mutation. The sequence after this exchange is TGTTCATGAGTTTGGAGATAATACAGCAGGCTGT.

Open the web interface, enter the gene symbol (SOD1) and select the correct transcript (ENST00000270142). Then select the radio button for coding sequence positions and enter the snippet in the dbSNP format (i.e. [A/G]TGTTCATGAGTTTGGAGATAATACAGCAGGCTGT) into the snippet field. Now click on submit...

This point mutation mutation is known to cause ALS (AMYOTROPHIC LATERAL SCLEROSIS 1, OMIM 147450#0008). The involved amino acid residue is highly conserved and is responsible for the copper-binding activity (displayed as lost protein feature). Also, the phyloP and phastCons values are very high, indicating strong evolutionary conservation.

Please note that this mutation has a dbSNP ID - excluding variants from dbSNP as potential disease candidates is not a good idea!

Polymorphism via gene position: AGRN_g.28670_28671delAG

Imagine a patient with a muscular disease. Among others, you sequence the complete sequence of the agrin (AGRN) gene and find a deletion of 2 bases: AG at position 28670 and 28671. Is this likely to be disease-causing, e.g. by destroying the reading frame? Now imagine that you are too lazy (or too clever) to look up all exons by hand...

Open the web interface, enter the gene symbol (AGRN) and select a transcript (e.g. ENST00000379370). Then select the radio button for gene positions and enter the last reference base upstream of the variant (28672), the first reference base downstream the variant (28675), and the inserted bases (nothing). You will see the gene sequence below the input fields, helping you to check if you 'delete' the correct bases. Now click on submit...

This example variant is a deletion which was also found in the 1000Genomes Project (1000G) in homozygous state. There is no amino acid change. The altered base is hardly conserved, as conveyed by the low phyloP / phastCons values.

NGS of candidate genes

Disease mutation via chromosomal position: CHRND_L63P (chr2:232526664T>C)

Imagine you sequenced the complete exome of a patient suffering from Myasthenic syndrome. You find a mutation in the gene for the delta subunit of the nicotinic acetylchloline receptor on chromosome 2 (chr2:232526664T>C). Imagine that you are too lazy to determine gene-specific positions.

Open the web interface for physical positions, enter the chromosome (2), the physical position (232526664) and reference allele (t) and variant (c). Now click on submit...

Whole exome/genome sequencing

Complete genotypes - a tiny example

Imagine you sequenced a complete genome of a patient with Myasthenic syndrome. Amanzingly, you end up with just 5 variants

Go to our VCF analysis pipeline, download the sample file and upload it to the pipeline. Feel free to play around with the filters!

On the results page, click on 'display' to get a closer look at the results. Click on GenDaB links on the gene symbols to check the likelihood of the genes involved to be disease causing.

MutationTaster Examples

All examples can also be accessed from the footer of the main page.

Deleterious variants

CHRND_L63P

This single base exchange in the CHRND gene is listed in NCBI ClinVar as a known disease-causing variant (dbSNP:rs121909508) for MYASTHENIC SYNDROME, CONGENITAL, FAST-CHANNEL, OMIM 100720#0013).

It results in an amino acid exchange from leucin to proline. The involved amino acid residue is part of a functional domain (topo domain) and is highly conserved (amino acids identical in all homologes). As a consequence of the mutation, the topo domain is lost. The phyloP and phastCons values for the changed nucleotide are very high, again indicating strong evolutionary conservation.

Show in MutationTaster: Analyse all transcripts, analyse transcript ENST00000258385

CLDN16_G191R

This variant is a single base exchange that is listed in NCBI ClinVar as a known disease-causing variant (dbSNP:rs104893721) for HOMG3 (HYPOMAGNESEMIA 3, OMIM 603959#0003).

The altered base leads to an amino acid substitution at a highly conserved residue (amino acid is identical in all homologs). Since the amino acid exchange is quite dramatic (a neutral glycine is replaced with a charged arginine), the substitution has a high amino acid exchange score (3.41, extracted from the Grantham matrix).

As reflected by the high phyloP and phastCons scores, the position is also conserved on the DNA level. The predicted splice site changes are only marginal and can therefore be neglected. The classification 'disease causing' is due to the ClinVar entry but the high probability indicates, that MutationTaster would have classified this variant as deleterious anyway.

Show in MutationTaster: Analyse transcript ENST00000264734

SOD1_H47R

This example is a mutation that is listed in NCBI ClinVar as a known disease-causing variant (dbSNP:rs121912443) for ALS (AMYOTROPHIC LATERAL SCLEROSIS 1, OMIM 147450#0008). The involved amino acid residue is highly conserved and lies within the catalytic copper-binding site of the protein, which is displayed as lost protein feature.

The phyloP and phastCons values for this position (4.283 and 0.996) are also very high, indicating strong evolutionary conservation. Again, the prediction is due to the ClinVar entry but has a very high probability anyway.

Show in MutationTaster: Analyse transcript ENST00000270142

Benign variants

AGRN_g.28670_28671delAG

This example is a deletion that was found more than 4 times in the 1000Genomes Project (1000G) in homozygous state. There is no amino acid exchange since the deletion lies within an intronic sequence. As it can be seen from the low phyloP / phastCons values, the deleted bases are hardly conserved. The splice site changes should be interpreted with care, especially in this context where the variant is not conserved and has been frequently found homozygous in the 1000G. It can be assumed that the splice site changes are false positive predictions. The classification 'benign' is due to the findings in the 1000G but the high probability indicates, that MutationTaster would have classified this variant as non-disease causing anyway.

Show in MutationTaster: Analyse transcript ENST00000379370

VCF batch querys

Tiny sample file

The VCF file TinyExample38.vcf only contains three lines to illustrate the file format:

3    190122694    .    G    A    116    .    .    GT:DP    0/1:154
1    984171    .    CAG    C    116    .    .    GT:DP    0/1:154
21    33036170    .    A    G    116    .    .    GT:DP    0/1:154

Show in MutationTaster

Sample exome (cystic fibrosis)

The zipped file SampleExome_CF38.vcf.gz contains a public VCF file with disease mutations in the CFTR gene spiked in. It is good practice to zip or gzip large VCF files before upload to MutationTaster. All variants must be in VCF format and refer to GRCh38 / hg38.

Show in MutationTaster

MutationTaster API

Automated analysis of chromosomal positions

If you want to integrate queries for chromosomal positions of our web interface in your personal NGS pipelines, this is possible in two ways depending on the output you prefer.

Text output

https://www.genecascade.org/MutationTaster2025/modperl/API.cgi?variants=2:232526664T>C
https://www.genecascade.org/MutationTaster2025/modperl/API.cgi?variants=21:33039603A>C,2:232526664T>C

If submitted via GET, the character ">" in the commands above is usually encoded as "%3E". That is,

https://www.genecascade.org/MutationTaster2025/modperl/API.cgi?variants=2:232526664T%3EC
https://www.genecascade.org/MutationTaster2025/modperl/API.cgi?variants=21:33039603A%3EC,2:232526664T%3EC

The value for variants can be one or multiple variants (separated by comma). The output is a table with 14 columns: chr, pos, ref, alt, transcript_stable, NCBI_geneid, prediction, model, tree_vote, note, splicesite, distance_from_splicesite, disease_mutation, polymorphism

You can specify your preferred reference genome and Ensembl build by adding &genome_version=38&ensembl_version=112 to your query. Currently, MutationTaster2025 only supports Ensembl build 112 of GRCh38, but support for GRCh37 and other Ensembl builds is planned for future releases.

In rare cases where the above URLs do not work, it may additionally be necessary to encode the characters ":" as "%3A" and "," as "%2C".

See output

Perl script

An example how to access the API via POST and print results can be found in this Perl script:

QueryMutationTasterAPI.pl

HTML output

https://genecascade.org/MutationTaster2025/cgi-bin/ChrPos.cgi?chromosome=2&position=232526664&ref=T&alt=C

The respective values for chromosome, position, ref and alt have to be set according to the variant(s) in question. The output has the same format as the output generated via the MutationTaster Web interface.

See output

Automated analysis of VCF files

POST request with curl

On Unix-based systems, you can use curl to post a VCF file to MutationTaster:
curl \
-F "name=Project_name" \
-F "email=your@email.edu" \
-F "filename=@Your_VCF_file.vcf" \
https://genecascade.org/MutationTaster2025/cgi-bin/MT_VCF_Pipeline.cgi

Replace Project_name, your@email.edu and Your_VCF_file.vcf with your details. The path to the VCF file must be prefixed with "@".

If the request is correct, you will immediately receive an email confirming your submission with a link to monitor the status of the analysis. Otherwise, please check the output of curl for possible errors.

Perl script

If you want to avoid the manual upload of VCF files to MutationTaster, we provide a Perl script which automatically sends a VCF file to the VCF analysis pipeline and afterwards retrieves the results. Please find it here:

sendVCF_MutationTaster.pl

 

 

If you encounter any problems regarding the API, please write us an email.

MutationTaster Support

Contact

In case you discover bugs, have suggestions or questions, please write an e-mail to Dominik Seelow (dominik.seelow AT charite.de).

We also appreciate hearing about your general experiences using MutationTaster.

Publications

Steinhaus R, Proft S, Schuelke M, Cooper DN, Schwarz JM, Seelow D. MutationTaster2021. Nucleic Acids Research. 2021 Apr 24.

Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nature methods. 2014 Apr;11(4):361-2.

Schwarz JM, Rödelsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nature methods. 2010 Aug;7(8):575-6.

Error messages

Message Explanation
InsDel too long At present, MutationTaster handles only InsDels up to 12 bases.
Your mutation of interest seems to span an exon/intron boundary. This kind of mutation can only be analysed in gDNA mode.
No transcripts for this gene found! You might have mis-spelled the gene symbol or used a protein name which is not always also the correct symbol (e.g. protein p53 is gene TP53). Also, in some (rare) cases a NCBI gene could not be mapped to an Ensembl gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction. Moreover, we filter out protein-coding transcripts (Ensembl biotype protein_coding) without a correct start codon (ATG) and correct stop codon (TGA, TAA, TAG). This might lead to the phenomenon that MutationTaster complains about "no suitable transcripts" or "no transcripts for this gene found" although Ensembl lists one or several. Transcripts of mitochondrial genes are not tested for integrity due to differences in the mitochondrial genetic code.
No internal Ensembl transcript ID found. / No Ensembl gene ID found for transcript. / No stable ID for this gene. Our database doesn't know the transcript you specified. This might happen if you refer to a newer or older release than the one we use. The release MT uses is mentioned on the query interface.
Ensembl gene XXX not found in ENSEMBL Our database doesn't know the gene you specified. This might happen if you refer to a newer or older release than the one we use. The release MT uses is mentioned on the query interface.
No NCBI gene ID found. / No NCBI gene ID found for this transcript. In some (rare) cases an Ensembl gene could not be mapped to a NCBI gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction.
Too many NCBI gene IDs found. In some (rare) cases an Ensembl gene could not be mapped to a single NCBI gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction.
Only invalid NCBI gene IDs found. In some (very rare) cases an Ensembl gene could not be mapped to a valid NCBI gene, i.e. the NCBI gene Ensembl refers to is 'discontinued' and was replaced by another gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction. Please contact us if you encouter such a case.
Gene XXX not found on any chromosome. The gene under scrutiny has no valid positional data. This should not occur at all. Please contact us if you encouter such a case.
Gene XXX (Entrez gene YYY) and transcript ZZZ do not match! The transcript you entered is not a product of the gene you entered. Please check your input.
Position is out of gene! You entered a position that is located outside the gene. This may happen when you mapped genomic position to gene-specific position using an old genome build. Or, of course, by typos. Please check your input.
Could not retrieve a sequence or sequence is too short. MT was not able to get the gene sequence from Ensembl. This might be due to network problems so you should repeat the analysis after some time. Should this not work, please contact us.
No start ATG exon found. The transcript is not properly annotated: there is no start position of the coding sequence in the database. Please select another transcript of the same gene.
No stop exon found. The transcript is not properly annotated: there is no stop position of the coding sequence in the database. Please select another transcript of the same gene.
Chosen transcript ENSTXXX has no correct start ATG annotated. Protein-coding transcripts (Ensembl biotype protein_coding) are tested for transcript integrity, i.e. for presence of a correct start codon (ATG) and correct stop codon (TGA, TAA, ATG). If one is missing, an error message is thrown out because analysis in corrupt transcripts might lead to a wrong prediction.
Sequence XXX is not unique in your gene! Please use a longer snippet.
Sequence was not found in your gene. Please check your input: is there a typo in your snippet? Or do you use a snippet created from the wrong strand? MT always refers to the strand the gene is located on.
Snippet not properly formatted. Please check your input: snippets must be specified as ACGTACGT[OLDBASES/NEWBASES]ACGTACGT.

MutationTaster FAQs

General questions

Can I download pre-computed predictions?

Unfortunately not. Unlike SIFT or PolyPhen which handle only single amino acid substitutions, MutationTaster works on DNA level and allows insertions and deletions. The exome alone comprises about 30 Mb with 3 possible single base exchanges at each site (let alone introns and InDels). These 30 M x 3 SBEs may affect several different transcripts, leading to about 30,000,000 (Mb) x 3 (SBEs) x 5 (transcripts) = 450,000,000 values to pre-compute.

We could of course generate such a list, but it would still not include the InDels and most of the introns. What is more important: such a list would take a very long time to generate and might soon become outdated. We rather spend our efforts on improving MutationTaster!

Why doesn't MutationTaster know my valid transcript ID?

We filter out protein-coding transcripts (Ensembl biotype protein_coding) without a correct start codon (ATG) and correct stop codon (TGA, TAA, ATG). This might lead to the phenomenon that MutationTaster complains about "no suitable transcripts" or "no transcripts for this gene found" although there are some listed in Ensembl. We decided to exclude such transcripts from analysis in MutationTaster due to their bad annotation, which might in the end lead to a wrong prediction. Transcripts of mitochondrial genes are not tested for integrity due to differences in the mitochondrial genetic code.

Does a high Tree vote value indicate a high probability for a correct prediction, then?

Unfortunately not. Our results show that wrong predictions are usually not reflected by low Tree vote values but are rather caused by benign or deleterious variants that show characteristics of the other case, e.g. SNPs that are highly conserved and destroy protein features or disease mutations that appear to have no effect on the protein/gene at all.

Why don't you exclude known SNPs as possible deleterious mutations?

Because many variants listed in db SNP have never been shown to show all three genotypes in unaffected individuals. Some SNPs even appear to have only one allele. And even if both alleles were observed, there should be a sufficient number of healthy individuals who are homozygous for the minor allele to exclude a damaging effect.

If all three genotypes were observed in the HapMap project, or the variant was found homozygously in the 1000Genomes Project more than 4 times, it will automatically be regarded as benign.

Why don't you use allele frequencies to exclude variants as possible deleterious mutations?

We decided not to do this because not all loss-of-function variants are rare (e.g. CFTR mutations). Instead, we rely on healthy homozygous individuals. Are they observed, a variant is unlikely to cause a severe early-onset single gene disorder and hence predicted to be benign.

The prediction for my favourite variant has changed. Why?

Well, this is a very rare event. However, as the available data such as protein features is increasing, we regularly update our database and re-train the classifier. In some cases, the annotation of a gene improves drastically. This may yield formerly unknown protein features in your gene/protein at your position which can of course influence the prediction of your variant.

Is there any way to learn how a single variant is classified?

Well, yes and no. A Random Forest classifier studies the frequencies of single item statuses (such as 'conservation in cattle - highly conserved' or 'existence of a disulfide bond - no') in both groups of the training set ('polymorphisms' / 'disease mutations'). It compares the statuses of these items in your variant with the known frequencies and then decides which group fits best.

You can of course study the model and hence the frequencies used by the classifier. They can be found in our supplementary data.

What does "InDel variants are limited to 40 bp" mean?

The website states that with MT "InDel variants are limited to 40 bp", however, does that mean the ACTUAL insertion or deletion, or the DESCRIPTION of the insertion or deletion? For example, the description of the ALT variant contains 15 bases, whilst the REF variant contains 13 bases, so that the actual INSERTION (AC) here is only 2 bases but it has been rejected for being too long.

When bases get inserted to / deleted from a stretch of similar bases, as in the given example to a stretch of several 'AC', MutationTaster doesn't know at which position exactly the 'AC' was inserted (or deleted), due to the whole stretch of AC. That's why it has to use the whole 13 bases, although the actual insertion is only 2 bases. This is also the reason why such variants are described that way in your VCF file.

Why don't you offer a MutationTaster download version for local installation on my own machine?

We are asked regularly for standalone versions of MutationTaster, our conversion tools or the database. We don't offer these services, because it is not feasible. We would flood the world with lots of different versions of MutationTaster which we could never maintain. The distribution of local installations probably would lead to hundreds of support questions and we (only 2 people) are already busy with those that concern the version we control and know. We are not able to give support concerning installation issues or questions like 'how is the conservation internally stored?' or errors that occur only in the versions modified by the users. Moreover, you would need a very powerful hardware and highly optimised server to reach the same speed as the online version. MutationTaster uses a database which is tens of GBs in size with parts of Ensembl and the 1000 Genomes data in it. Additionally, we use some external tools for which we have signed disclosure agreements and which we are hence not allowed to share with other groups anyway. If you want to integrate MutationTaster in your own analysis pipeline for Next Generation Sequencing data, we suggest to use our VCF analysis pipeline that can be called via Perl's WWW::Mechanize module and similar approaches.

What does the AA changes score mean and how does it influence the prediction?

The score is taken from the Grantham Matrix for amino acid substitutions and reflects the physicochemical difference between the original and the mutated amino acid. It ranges from 0.0 to 215 but does not provide a value for amino acid insertions/deletions. However, the score is only displayed for information purpose and does not influence the prediction. Instead, MutationTaster uses the frequency of the respective AA exchange in known disease causing mutations and polymorphisms for the classification.

Why is the same variant classified as benign when there is an amino acid exchange and as deleterious when there is no amino acid exchange?

MutationTaster uses five different models (without_aae, simple_aae, complex_aae, 3utr, 5utr) for its prediction. Depending on the type of variant, MutationTaster automatically determines the correct model. Each model was trained with a suitable set of known polymorphisms/ disease mutations and the prioritisation of the individual parameters differs among the different models. Thus the prediction of a variant might not be the same, if two different models are applied (e.g. without_aae model and simple_aae model). In some cases with a 'deleterious' prediction due to DNA related features such as strong conservation, knowledge of the effect of amino acid substitution can 'weaken' the prediction, e.g. if the difference of the two amino acids is modest and no protein domains are affected. This is a consequence of the different models: If we used only one, all 'silent' mutations would be considered as benign - and we decided to rather risk false positives than to lose any true positives.

Why are the phlyoP and phastCons scores for the same variant different in MutationTaster2025 compared to MutationTaster2021 or other sorces like gnomAD?

There are multiple different conservation tracks available in UCSC which use a different amount of species alignments. In MutationTaster2025 we use the 100 vertebrate species version for GrCh38, whereas in MutationTaster2021 we used the 46 track for GrCh37. Similarly, other data sources may use different alignments (e.g. gnomAD uses the 241 way alignment) and their scores may differ.

The AlphaMissense transcript links sometimes don't show a result, why?

There are transcripts, for which there simply is no structure available in AlphaMissense. This is why MutationTaster2025 always provides the gene-specific link to AlphaMissense along with the transcript-specific one.

VCF files

Why are there so many cases with no prediction (n/a)?

Most of the n/a cases are due to a missing link between an Ensembl transcript and an NCBI gene (error message: no NCBI gene ID found for this transcript). Ensembl has far more genes and transcripts annotated than NCBI, however, we need to link the Ensembl genes to NCBI in order to get the HGNC genesymbol and SwissProt Accession ID. To circumvent this, we plan to fetch SwissProt ID and genesymbol also via Ensembl in the future, so that in case of a missing link to NCBI, the analysis can be conducted neverthess.

Why are there so many outsides genes, although I have uploaded a VCF file from Exome Sequencing?

Target enrichment is not 100% perfect, thus it is normal that there are variants outside genes. Moreover, we do not use all available transcripts (see MutationTaster FAQs), because some are not suitable for analysis with MutationTaster. In case there is a variant in a gene which only has transcripts not suitable for MutationTaster analysis, this will be counted as outside gene

MutationTaster Changelog

Welcome to the MutationTaster changelog. This page informs you about the latest developments and modifications to MutationTaster.

20 December 2024

MutationTaster2025

Our latest release of MutationTaster includes the following changes:

Previous updates

Visit https://www.genecascade.org/MutationTaster2021/info/#changelog

MutationTaster Legal

Imprint

Responsible for the purposes of media law for this page

Dominik Seelow
Bioinformatics and Translational Genetics
Berliner Institut für Gesundheitsforschung
Charitéplatz 1
10117 Berlin
email: dominik.seelow (at) charite.de

Data protection

If you use one of our query engines, you may enter an e-mail address. The secret URL pointing at your project will be sent to you directly after the upload and you will receive a second notification when the analysis has finished.
Your e-mail is only visible during the data analysis and the analysis web pages can only be accessed via the secret URL.
Data you uploaded will be automatically deleted within two months time unless you delete it by yourself or request complete deletion or an extension of that period by e-mail. Please note that your from-address must match the address you specified when uploading the data. Your e-mail address will remain in our database to allow us to determine the number of different users. It will be deleted upon request.

License

MutationTaster 2025 is free and open to all users and there is no login requirement. Please contact us if you want to include pre-computed scores of MutationTaster into your own software.