Yum, tasty mutations...

MutationTaster Documentation

Input

MutationTaster has three different analysis modes that can be selected in the left panel:

Chromosomal position

Input Description
SNV or indel

MutationTaster analyses the submitted variant at the given position in all feasible Ensembl transcripts and puts out a table with summarized results as well as the traditional detailed results. The variant can be entered in various formats:

  • 17:38244559C>T (default)
  • 17:38.244.559C>T (dot as thousands separator)
  • chr17:g.38,244,559C>T (comma as thousands separator)
  • 17-38244559-C-T (gnomAD)
  • 17 38244559 . C T (VCF)
  • 17 38244559 C T (VCF without ID column)

For InDels, use the VCF format, i.e. always start with the last reference base before the variant.

Specific transcript

Input Description
Gene symbol

You can identify your gene of interest by entering one of the following:

  • HGNC symbol e.g. LEP (case insensitive)
  • Ensembl gene ID (starting with ENSG, e.g. ENSG00000174697)

MutationTaster will automatically recognise the type of input. Upon typing a valid gene symbole, Ensembl transcripts for your gene will be displayed.

Transcript

You can also directly enter the Ensembl transcript ID (starting with ENST, e.g. ENST00000308868) of your gene of interest. In this case, you do not need to fill out the gene symbol field.

If more than one transcript is available, they are ordered by length from top to bottom.

Reference

Choose Coding sequence (c.) if you are working with coding sequence positions / sequence for localising the alteration of interest. Coding sequence (CDS) position 1 refers to the A of the start ATG (and is sometimes also called ORF, for open reading frame).

Choose Transcript (cDNA) if you are working with cDNA positions / sequence for localising the alteration of interest. cDNA position 1 refers to the first base of the transcript.

Choose Gene (genomic sequence) if you are working with gDNA positions / sequence for localising the alteration of interest. Genomic sequence (gDNA) position 1 refers to the first base of the gene.

Variant by sequence snippet

Choose Variant by sequence snippet if you have a sequence snippet around an alteration that you want to analyse. You can paste this sequence snippet into this field, putting square brackets [ ] around the altered base and the new base (e.g. ACGGTT[A/G]CTCTAAGGA for a base exchange from A to G). Comprehensive examples of the format are provided by hovering the mouse pointer over the question mark (?) on the input field. All entries have to refer to the 5'-3' direction of the transcript sequence.

Variant by position / SNV

Choose Variant by position / SNV if you are working with a single base exchange. This means, that only one single base is altered. If you have named the mutation according to the HGVS variation nomenclature there should be indicated whether you have to work in the coding sequence (CDS) or gDNA mode.

Enter the position of the base exchange. Important: in coding sequence mode, position 1 refers to the A in the ATG start codon; in transcript (cDNA sequence) mode, position 1 refers to the first base in the cDNA transcript, which is mainly part of the 5'UTR. Positions must not exceed the length of the sequence.

Upon changing the content of the input field, the sequence snippet surrounding the indicated (exchanged) base of interest will appear at the bottom of the screen. The wild-type base affected by the base exchange is highlighted in blue. Please check if the highlighted base is concordant with the one you wanted to indicate and also whether the surrounding sequence is correct.

Then fill in the new base. For a base exchange c.1204G>T you would enter a T as new base.

Variant by position / Indel

Choose Variant by position / Indel if you are working with an insertion, a deletion or a combination thereof. You do not need to further specify which kind of alteration you are exactly dealing with, since this is automatically determined by the software (and displayed in the output).

Enter the region of the alteration in the order the input fields are arranged on the screen:

  • Position of last wild-type base before alteration refers to the base directly preceding the variant. For a deletion of three nucleotides, e.g. c.92_94delGAC (or c.92_94del3 or c.92_94del), enter 91.
  • Position of first-wild type base after alteration refers to the base directly following the alteration. For a deletion of three nucleotides, e.g. c.92_94delGAC (or c.92_94del3 or c.92_94del), enter 95.
  • Position 1 refers to the first base of the coding sequence / cDNA / gDNA, depending on the chosen mode. Positions must not exceed the length of the gene's coding sequence / cDNA / gDNA sequence.

Upon changing the contents of both input fields, the sequence snippet surrounding the indicated altered region of interest will appear at the bottom of the screen. The wild-type base(s) affected by the alteration is / are highlighted in blue. Please check if the highlighted base(s) is / are concordant with the one(s) you wanted to indicate and also whether the surrounding sequence is correct.

In the case of an insertion, enter the inserted bases. For example in case of an insertion of a GAGA-sequence between nucleotides 51 and 52 of the coding region (c.51_52insGAGA), enter GAGA. For a deletion of three nucleotides like c.92_94delGAC (or c.92_94del3 or c.92_94del) simply enter nothing.

VCF file

Input Description
Project name

The project name is displayed on the output report and in email notifications.

Email address

Can be provided in order to get notified when your MutationTaster results are ready. These can be browsed for three weeks on our server and will afterwards be deleted.

VCF file

Input files have to be in VCF format, coordinates must refer to GRCh37 (also called hg19). Up to now, we do not offer processing of (merged) VCF files containing variants obtained from sequencing of two or more samples. Thus, the uplpoaded VCF file may only contain data from one sample.

Minimum coverage

Very low covered positions don't offer reliable data. Therefore, it is useful to exclude such variants from analysis (if not already done during variant calling / pileup). We offer the possibility to skip variants that are covered below a user-defined threshold. To this end, adjust the number in the corresponding text field. If you don't want to exclude poorly covered variants, fill in 0. Default # is a minimum coverage of 10.

Search for homozygous variants

Check Analyse homozygous variants only if you are interested in MutationTaster results for homozygous variants - heterozygous variants will be neglected. If unchecked, all variants in your VCF will be processed (unless other options checked).

Combine neighbouring variants

Sometimes single base exchanges are located very close to each other. If considered separately as single alterations, they might seem harmless, but if they act together, they might be deleterious. For this reason we offer to combine neighbouring variants (only single base exchanges) and treat them as if they were one, but more complex, alteration. Check Combine neighbouring variants if you are interested in this. The analysis of the combined alterations is conducted in addition to the analysis of the single alterations.

Filter against gnomAD, ExAC and 1000G

Here, you may specify filter options to skip analysis of your variants that were also found in gnomAD, ExAC or the 1000Genomes Project (1000G). If you wish to exclude variants found 10 or more times (gnomAD or ExAc) or 4 or more times (1000G) in homozygous state but include all heterozygous variants, you can leave everything as it is (default setting). But you are free to change the number of cases that have to be present in either source in order to exclude variants from analysis. If you do not want to filter at all, uncheck all boxes.

Analyse custom regions

If you don't need your complete VCF file to be analysed, you can save time by constricting analysis to certain regions (for example linkage- or homozygous regions). Choose Analyse custom regions (a text field will open) and enter your regions of interest in bed-format.

You can also exclude certain regions with the Exclude custom regions option.

Analyse only exons

Some people are interested in variants all over the genome, but mostly in exonic ones. They can use a ready-made set of all suitable Ensembl exons for analysis by ticking the Only exons and ... bp flanking introns option. Since many people are also interested in intronic variants which are however close to exons, you can enter your favorite value between 0 and 99 – this is the number of "flanking" bases adjacent to intron/exon borders which are additionally analysed.

Analyze only on chromosome

If you are interested in all variants on a certain chromosome, choose Only variants on chromosome and enter your favorite chromosome)

Output

The different elements of the output are named and described below. The first table applies to the analysis of a chromosomal position or specific transcript. The second table and subsequent sections describe the outfrom from the analysis of a VCF file.

Chromosomal position/specific transcript

Output Description
Prediction MutationTaster predicts a variant as deleterious or benign. For more details about the classification process, please read the section about our Random Forest classifier.
Summary List of the most prominent features of the analysed alteration (e.g. 'at intron-exon boundary', 'spans start ATG', 'homozygous in 1000G' etc.)
Alteration (phys. location) The alteration on "physical" i.e. chromosomal level (e.g. chr7:91623937_91623938insGGCAAT).
HGNC symbol The official HGNC symbol.
Ensembl transcript ID Ensembl [1] transcript ID, starting with ENST.
UniProt peptide (SwissProt ID) UniProt KB / SwissProt [2] accession ID. Unfortunately, this does not always correctly correspond to the selected product of the transcript.
Alteration type Is either a base exchange, a combination of insertion and deletion, an insertion or a deletion.
Alteration region Is either 5'UTR (untranslated region), CDS (coding sequence), 3'UTR or intron.
DNA changes Alteration on nucleotide level. gDNA level (g.) is displayed always, cDNA level (cDNA.) for alterations located in exons, CDS level (c.) only for alterations residing in an exon in the coding sequence.
AA changes Any amino acid changes are shown here, displaying the original versus the new amino acid as well as the position of the substitution and a score for it. This score is taken from an amino acid substitution matrix (Grantham Matrix [3]) which takes into account the physico-chemical characteristics of amino acids and scores substitutions according to the degree of difference between the original and the new amino acid. Scores may range from 0.0 to 215. Since the Grantham matrix does not provide values for an amino acid insertion/deletion, no score is given in such cases. The score is only displayed for information reasons and does not influence the MutationTaster prediction as generated by our Random Forest classifier. An asterisk (*) stands for a stop codon, a minus (-) means that in the original AA sequence, there was no AA at this position. If the initial Methionine codon (startATG) is lost, MutationTaster searches for a potential new, downstream startATG and informs you about AA changes based on the assumed alternative AA sequence.
Position(s) of altered AA Lists the positions of altered AA. For mutations resulting in a frameshift, the position of the first altered AA is displayed along with the information that due to a frameshift, there are further changes downstream.
Frameshift Can be either yes or no.
Known variant

Indicates if this variant has been found in large-scale sequencing projects or is listed as a disease mutation in ClinVar. Our database contains all variants from the 1000 Genomes Project [4] (abbreviated here as 1000G), ExAC [5] and gnomAD [6]. For variants found in these databases, MutationTaster provides a table with the following information: homozygous (-/-), heterozygous, allele carriers. Variants with 40+ homozygote individuals gnomAD or ExAC, or 4+ homozygous individuals in 1000G are automatically predicted as benign automatic (the Random Forest classifier is run nevertheless and the Tree vote for the prediction is shown).

If a variant is marked as probable-pathogenic or pathogenic in ClinVar, it is automatically predicted to be disease-causing, i.e. deleterious automatic (the Random Forest classifier is run nevertheless and the Tree vote value for the prediction is shown). For known disease mutations, we also display the disorders they cause. MutationTaster provides the rs ID and a link to dbSNP for all variants listed in dbSNP. Please note that dbSNP IDs do not consider the alleles of a variant, only the position. Moreover, we have integrated the public version of the Human Gene Mutation Database (HGMD) [7]. The data includes the positions of the disease mutations and their HGMD ID. The disease alleles are not included so we cannot use HGMD for automatic predictions. Whenever an HGMD public disease mutation is found at the same position as a variant, this will be written in the summary. We also place a direct hyperlink to the mutation in HGMD into the 'Known variant' field, so you can check whether the HGMD mutation has the same allele as your variant (and whether the disease matches). Please note that you must be logged in at the HGMD site to make the hyperlink work - access to the public version is free but requires registration.

PhyloP / phastCons phastCons and phyloP are both methods to determine the grade of conservation of a given nucleotide [8]. MutationTaster uses values which are precomputed and offered by UCSC. phastCons values vary between 0 and 1 and reflect the probability that each nucleotide belongs to a conserved element, based on the multiple alignment of genome sequences of 46 different species (the closer the value is to 1, the more probable the nucleotide is conserved). It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP (values between -14 and +6) separately measures conservation at individual columns, ignoring the effects of their neighbors. Moreover, phyloP can not only measure conservation (slower evolution than expected under neutral drift) but also acceleration (faster than expected). Sites predicted to be conserved are assigned positive scores, while sites predicted to be fast-evolving are assigned negative scores. For more information about phyloP and phastCons, please see the cited paper or the description on the UCSC website.
Splice sites MutationTaster uses a custom-build module based on MaxEntScan [9] to analyse possible changes in splice sites. Splice site analysis is turned-off by default for mitochondrial genes.
ExAC pLi scores Indicate the tolerance of a gene against loss-of-function variants.
Kozak consensus sequence altered The Kozak consensus sequence (gccRccAUGG; R = purine) starts upstream of the start codon (AUG) and plays a major role in the initiation of translation. The purine (R) at position -3 as well as the G in position +4 are highly conserved. The program checks whether for a given alteration a previously strong consensus sequence has been weakened.
Conservation on AA level For conservation analysis, amino acid or nucleotide sequence homologues of ten other species (chimp, rhesus macaque, mouse, cat, chicken, claw frog, pufferfish, zebrafish, fruitfly, and worm) are aligned with the corresponding human sequence of the gene in question. Sequences are aligned with blastp [8], which is installed as stand-alone executable on our server, and analysed by MutationTaster.
The status of evolutionary conservation is either classified as all identical (i.e. the same amino acid in the human and the homologue amino acid sequence) (partly) conserved (i.e. similar amino acids in the human and the homologous amino acid sequence) or not conserved (i.e. different amino acids in the human and the homologue amino acid sequence). The status for local nucleotide sequence alignments is either conserved or not conserved. Additionally, MutationTaster states when no homologous gene is known or no alignment could be made. Alignments are shown as snippets for each species, including the position of the analysed residue, the alignment and the status. We de-liberately restrict conservation analysis to ten animal species, although sequence data for far more species is available. The inclusion of further species did not have considerable influence on pre-diction accuracy but each alignment significantly decreased the speed of MutationTaster.
Protein features The program checks whether any protein features are directly or indirectly affected by the alteration. Our database stores all human SwissProt protein features. Some features will not have an influence on the prediction; they are only displayed for information and should not have an impact on the disease-causing potential of the alteration (e.g. CONFLICT or MUTAGEN).
Lost means that the AA exchange invoked by the alteration in question is located within the protein feature. A protein feature might get lost if a whole exon is skipped due to splice site changes, or if a protein is shortened because of a premature termination codon - in those cases, protein features are indirectly affected.
Length of protein MutationTaster checks if the resulting protein will be elongated (prolonged), truncated, or whether nonsense-mediated mRNA decay (NMD) is likely to occur. MutationTaster determines the NMD border as last intron/exon junction minus 50 bp and analyses if a given premature termination codon occurs 5' to this border thus leading to NMD. An elongated protein is referred to as prolonged, i.e. the original termination codon is destroyed and the translation stops later than normal. Truncated is reffered to as either slightly truncated (if less than 10% of the wild-type protein length are missing) or strongly truncated (if more than 10% of original protein length are missing). In the two latter cases, the additional information 'might cause NMD' is given, because the '-55 boundary rule' is not fulfilled, but it cannot be ruled out that NMD occurs nevertheless. If MutationTaster concludes that an alteration causes NMD, this alteration is automatically regarded as a deleterious mutation. The classifier is run never-theless and the Tree vote value for the prediction is shown.
AA sequence altered Can be either yes (AA exchange) or no (no AA exchange)
Position(s) of altered AA If the alteration in question is located in the CDS, the position on amino acid level is shown here. If the alteration spans two or more amino acids, these are all displayed and separated by a comma.
Position of stopcodon in wt / mu CDS Position of the last base of the stop codon (this can either be TGA, TAA or TAG), position 1 refers to the A in the start ATG codon.
Position (AA) of stopcodon in wt / mu AA sequence Position of the stop asterisk (*) in the amino acid sequence, position 1 refers to the first amino acid of the protein.
Poly(A) signal MutationTaster uses a locally installed version of the program polyadq [11] for analysis of polyadenylation signals. More information at http://rulai.cshl.org/tools/polyadq/polyadq_form.html
Conservation on nucleotide level Conservation on nucleotide level is analysed similarly to AA level: Using bl2seq, homologue DNA sequences of different species are compared to the human DNA sequence. Conservation status can either be all identical (same base(s) in human and species sequence), not conserved (different base(s) in human and species sequence) or no alignment (if no local alignment around the indicated position(s) was found). If no homologue sequences are found, this is indicated by no homologue. Up to now, conservation on nucleotide level is not used for the prediction.
Position of start ATG in wt / mu cDNA Position of the A in the start ATG, position 1 refers to the first base of the cDNA. If the regular start ATG is changed by an alteration, MutationTaster searches for the next most 5'-ATG and assumes this to be the new start ATG for the mutated sequence.
Position of termination codon in wt / mu cDNA Position of the last base pair of the termination codon (this can be either TGA, TAA or TAG), position 1 refers to the first base pair of the cDNA.
Chromosome The chromosome the alteration is located on.
Strand Is either 1 for forward strand or -1 for reverse strand.
Last intron/exon border The last base of the exon before the last exon.
Theoretical NMD border in CDS In order to avoid truncated proteins which might act in a dominant-negative manner, the eukaryotic cell has a surveillance mechanism to ensure that only error-free mRNAs are translated. It was shown that mRNA shorter than a given length is nearly completely degraded. This process is known as nonsense-mediated mRNA decay or NMD. The rule seems to be that a termination codon occurring 50-55 nucleotides upstream of the final intron / exon junction initiates the NMD machinery and the mRNA gets degraded. Therefore, this program determines the NMD border as last intron / exon junction minus 50 bp and analyses if a given premature termination codon occurs 5' to this border thus eventually leading to NMD.
Length of CDS The length of the coding sequence from the A of the initiation codon (ATG) to the last base of the termination codon.
cDNA position Gives the last wild-type base before alteration and first wild-type base after alteration in coding DNA sequence context (positions relative to start of transcribed coding DNA reference sequence) e.g. 1203 / 1205, the altered base is at position 1204.
gDNA position Gives the last wild-type base pair before alteration and first wild-type base pair after alteration in genomic DNA sequence context (positions relative to start of genomic DNA reference sequence) e.g. 53,344 / 53,346, the altered base is at position 53,345.
Chromosomal position Gives the last wild-type base before alteration and first wild-type base after alteration in chromosomal sequence context (position relative to start of chromosomal reference sequence) e.g. 154,372,337 / 154,372,339, the altered base is at position 154,372,338.
gDNA and cDNA sequence snippet The sequence surrounding the alteration (20 bp up- and downstream). The altered bases are highlighted in blue.
Wild-type and mutated AA sequence Complete AA sequences, the asterisk (*) indicates STOP.
Speed This is the time MutationTaster needed for analysis & prediction - your browser might need some extra time to display the results, especially if you include images.

VCF file

MutationTaster is tightly integrated with our disease mutation search engine MutationDistiller. With MutationDistiller, you can easily prioritize potentially disease-causing variants according to the biological role of the affected genes. On the results page, simply Open MutationDistiller to view your variants in MutationDistiller.

Most often, MutationTaster will not analyse each and every line of your VCF file, either because you have set certain filters, or because certain variants were not suitable for analysis with MutationTaster.

Statistics Description
Submitted variants

Number of alterations (lines) in VCF file.

Analysable variants

Number of variants which were suitable for analysis. These can be significantly more than the lines in the VCF, because sometimes one line in the VCF contains more than one alternative allele. Additionally, if you choose to combine neighbouring variants, the number will even rise.

Analysed variants

Number of variants which were analysed with MutationTaster. These will normally be significantly more than the analysable variants, because for most variants, more than one (suitable) transcript will be found.

Discarded variants

Variants are ignored for analysis due to presence in gnomAD, 1000 Genomes Project or ExAC (applies only if any of the Filter polymorphisms options were set when the VCF file was uploaded).

Additionally, variants can be excluded from analysis because they are

  • extragenic and/or out of/distant from exon (applies only if option for Only exons is set)
  • out of chromosome (applies only if option for Only chromosome is set)
  • out of region (applies only if option for Analyse custom region is set) or
  • inside of region (applies only if option for Exclude custom region is set)

MutationTaster results are stored in our database and can be accessed online on our server. We will store your results only for three weeks. Afterwards, they will automatically be deleted. You can download your results as a zip-archive. The archive contains one TSV file with one variant per line and the following columns per variant:

Chr Position Gene Prediction Model Tree vote Type
AAE dbSNP Ref Alt Clinvar Features Region Coverage Hom / Het

We offer to filter out certain variants (e.g. those that were excluded due to presence in 1000G) and to sort the remaining variants according to user-specified criteria (see Sort). Once downloaded and stored on your own machine, you can still re-sort the TSV file with Microsoft Excel or similar spreadsheet programs.

The option to delete your data as soon as your download is completed will soon be added.

Option Description
Sort & group

The results stored in our database can be sorted by different criteria for either displaying and browsing them directly on our server or for exporting them.

Sort by these attributes; in one, two or three levels:

  • Chromosome (chromosome from 1 to Y)
  • Chromosome Reverse (chromosome from Y to 1)
  • Position (ascending)
  • Position Reverse (descending)
  • Gene (genesymbol from A to Z)
  • Gene Reverse (genesymbol from Z to A)
  • Prediction (prediction from benign to deleterious)
  • Prediction Reverse (prediction from deleterious to benign)
  • Model (model used by the classifier, without_aae-simple_aae-complex_aae-5utr-3utr)
  • Model Reverse (model used by the classifier, 3utr-5utr-complex_aae-simple_aae-without_aae)
  • dbSNP (rs-number ascending)
  • dbSNP Reverse (rs-number descending)
Filter

There are the following options to hide certain alterations: All predicted polymorphisms, known polymorphisms (i.e. homozygous > 40 times in gnomAD or ExAC, or > 4 times in 1000G) and prediction problems. Selection of options is valid for both displaying results in the browser as well as downloading them as TSV. In addition, for displaying results only, the number of results per page can be limited, which generally reduces the loading time of each page.

Get the data

The results can either be displayed online in your browser (choose Display) or be downloaded as TSV (choose Download TSV). Filtering and sorting options are applied to both methods.

Please note: The VCF analysis pipeline will process the variants from the submitted VCF-file in all suitable Ensembl transcripts. Some transcripts will not be included in the analysis, e.g. transcripts which either (a) have no or too many corresponding NCBI gene ID(s); or (b) are protein-coding but have no correct start codon (ATG) or stop codon (TGA, TAA, TAG). MutationTaster first tries to use protein-coding transcripts and if there is at least one, it won't search for transcripts of other biotypes. Only if there are no protein-coding transcripts available, it will try to use transcripts of other biotypes (although certain biotypes are straightaway and principally excluded from analysis, e.g. nonsense_mediated_decay,ambiguous_orf,TR_pseudogene etc.).

Random Forest classifier

Tree vote

MutationTaster uses Random Forest models for predictions. Tree vote indicates how many decision trees of the Random Forest are suggestive of deleteriousness vs. how many are suggestive of a benign alteration. The number before the vertical bar (|) always represents deleterious predictions and the number after the vertical bar benign predictions. E.g., tree vote 82|18 means that 82 decision trees in the Random Forest have indicated deleteriousness and 18 decision trees have indicated a benign alteration.

Please note that the tree vote value should NOT be interpreted as the probability of error. Our results show that wrong predictions are usually not reflected by any particular tree vote configuration but are rather caused by benign or deleterious variants that show characteristics of the other case, e.g. SNPs that are highly conserved and destroy protein features or disease mutations that appear to have no effect on the protein/gene at all.

Models

We provide five different models aimed at different types of variants, either aimed at

All models were trained with all available and suitable common polymorphisms and disease mutations. The number of features and the number of trees were separately optimised for each of the five models.

MutationTaster automatically determines the correct model for each variant.

Training

We have attempted to find a reasonable trade-off between predictive performance and speed and therefore limited tree number and tree size within the different Random Forest models. Detailed information about the forests that were trained and tested can be found on the Supplementary data page.

It should be noted that the prediction models used by MutationTaster were explicitly trained for balanced accuracy, i.e. equal predictive performance for benign and deleterious variants. While this increases the number of false-positive predictions, it reduces the risk of missing a true disease mutation compared to predictors trained for specificity.

Special cases

If a variant is a 'common' polymorphism (as confirmed by the existence of at least 40 homozygous individuals in ExAC or gnomAD or 4+ homozygous individuals in the 1000G), it is automatically predicted to be benign. Variants listed as disease mutation in ClinVar or predicted to cause a premature termination codon (leading to nonsense-mediated mRNA decay, NMD) are automatically predicted as deleterious. In both cases, the Random Forest classifier is run nevertheless and the tree vote for the prediction that was automatically made is shown.

Limitations and future plans

Limitations

Future plans

References

  1. Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, Azov AG, Bennett R, Bhai J, Billis K. Ensembl 2021. Nucleic Acids Research. 2021 Jan 8;49(D1):D884-91. doi: 10.1093/nar/gkaa942.
  2. Bateman A, Martin MJ, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bursteinas B, Bye-A-Jee H. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research. 2020 Nov 25. doi: https://doi.org/10.1093/nar/gkaa1100.
  3. Grantham, R: Amino acid difference formular to help explain protein evolution. Science 185: 862-864 (1974)
  4. 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015 Oct;526(7571):68. doi: https://doi.org/10.1038/nature15393.
  5. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016 Aug;536(7616):285-91. doi: https://doi.org/10.1038/nature19057
  6. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020 May;581(7809):434-43. doi: https://doi.org/10.1038/s41586-020-2308-7.
  7. Stenson PD, Mort M, Ball EV, Chapman M, Evans K, Azevedo L, Hayden M, Heywood S, Millar DS, Phillips AD, Cooper DN. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Human Genetics. 2020 Jun 28:1-1. doi: https://doi.org/10.1007/s00439-020-02199-3.
  8. Pollard KS, Hubisz MJ, Siepel A: Detection of non-neutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21
  9. Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of computational biology. 2004 Mar 1;11(2-3):377-94.
  10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.
  11. Tabaska JE, Zhang MQ: Detection of polyadenylation signals in human DNA sequences. Gene 1999;231: 77-86.

MutationTaster Tutorial

Sanger sequencing

Disease mutation via sequence snippet: SOD1_H47R

Imagine a patient suffering from ALS. You decide to sequence the exons of the SOD1 gene. By looking at the sequence reads, you find an A->G point mutation. The sequence after this exchange is TGTTCATGAGTTTGGAGATAATACAGCAGGCTGT.

Open the web interface, enter the gene symbol (SOD1) and select the correct transcript (ENST00000270142). Then select the radio button for coding sequence positions and enter the snippet in the dbSNP format (i.e. [A/G]TGTTCATGAGTTTGGAGATAATACAGCAGGCTGT) into the snippet field. Now click on submit...

This point mutation mutation is known to cause ALS (AMYOTROPHIC LATERAL SCLEROSIS 1, OMIM 147450#0008). The involved amino acid residue is highly conserved and is responsible for the copper-binding activity (displayed as lost protein feature). Also, the phyloP and phastCons values are very high, indicating strong evolutionary conservation.

Please note that this mutation has a dbSNP ID - excluding variants from dbSNP as potential disease candidates is not a good idea!

Polymorphism via gene position: AGRN_g.28670_28671delAG

Imagine a patient with a muscular disease. Among others, you sequence the complete sequence of the agrin (AGRN) gene and find a deletion of 2 bases: AG at position 28670 and 28671. Is this likely to be disease-causing, e.g. by destroying the reading frame? Now imagine that you are too lazy (or too clever) to look up all exons by hand...

Open the web interface, enter the gene symbol (AGRN) and select a transcript (e.g. ENST00000379370). Then select the radio button for gene positions and enter the last reference base upstream of the alteration (28669), the first reference base downstream the alteration(28672), and the inserted bases (nothing). You will see the gene sequence below the input fields, helping you to check if you 'delete' the correct bases. Now click on submit...

This example alteration is a deletion which was also found in the 1000Genomes Project (1000G) in homozygous state. There is no amino acid change. The altered base is hardly conserved, as conveyed by the low phyloP / phastCons values.

NGS of candidate genes

Disease mutation via chromosomal position: CHRND_L63P (chr2:233391374T>C)

Imagine you sequenced the complete exome of a patient suffering from Myasthenic syndrome. You find a mutation in the gene for the delta subunit of the nicotinic acetylchloline receptor on chromosome 2 (chr2:233391374T>C). Imagine that you are too lazy to determine gene-specific positions.

Open the web interface for physical positions, enter the chromosome (2), the physical position (233391374) and reference allele (t) and variant (c). Now click on submit...

Whole exome/genome sequencing

Complete genotypes - a tiny example

Imagine you sequenced a complete genome of a patient with Myasthenic syndrome. Amanzingly, you end up with just 5 alterations

Go to our VCF analysis pipeline, download the sample file and upload it to the pipeline. Feel free to play around with the filters!

On the results page, click on 'display' to get a closer look at the results. Click on GeneDistiller links on the right side to check the likelihood of the genes involved to be disease causing.

MutationTaster Examples

All examples can also be accessed from the footer of the main page.

Deleterious variants

CHRND_L63P

This single base exchange in the CHRND gene is listed in NCBI ClinVar as a known disease-causing variant (dbSNP:rs121909508) for MYASTHENIC SYNDROME, CONGENITAL, FAST-CHANNEL, OMIM 100720#0013).

It results in an amino acid exchange from leucin to proline. The involved amino acid residue is part of a functional domain (topo domain) and is highly conserved (amino acids identical in all homologes). As a consequence of the mutation, the topo domain is lost. The phyloP and phastCons values for the changed nucleotide are very high, again indicating strong evolutionary conservation.

Show in MutationTaster: Analyse all transcripts, analyse transcript ENST00000258385

CLDN16_G191R

This alteration is a single base exchange that is listed in NCBI ClinVar as a known disease-causing variant (dbSNP:rs104893721) for HOMG3 (HYPOMAGNESEMIA 3, OMIM 603959#0003).

The altered base leads to an amino acid substitution at a highly conserved residue (amino acid is identical in all homologs). Since the amino acid exchange is quite dramatic (a neutral glycine is replaced with a charged arginine), the substitution has a high amino acid exchange score (3.41, extracted from the Grantham matrix).

As reflected by the high phyloP and phastCons scores, the position is also conserved on the DNA level. The predicted splice site changes are only marginal and can therefore be neglected. The classification 'disease causing' is due to the ClinVar entry but the high probability indicates, that MutationTaster would have classified this variant as deleterious anyway.

Show in MutationTaster: Analyse transcript ENST00000264734

SOD1_H47R

This example is a mutation that is listed in NCBI ClinVar as a known disease-causing variant (dbSNP:rs121912443) for ALS (AMYOTROPHIC LATERAL SCLEROSIS 1, OMIM 147450#0008). The involved amino acid residue is highly conserved and lies within the catalytic copper-binding site of the protein, which is displayed as lost protein feature.

The phyloP and phastCons values for this position (4.283 and 0.996) are also very high, indicating strong evolutionary conservation. Again, the prediction is due to the ClinVar entry but has a very high probability anyway.

Show in MutationTaster: Analyse transcript ENST00000270142

Benign variants

AGRN_g.28670_28671delAG

This example is a deletion that was found more than 4 times in the 1000Genomes Project (1000G) in homozygous state. There is no amino acid exchange since the deletion lies within an intronic sequence. As it can be seen from the low phyloP / phastCons values, the deleted bases are hardly conserved. The splice site changes should be interpreted with care, especially in this context where the variant is not conserved and has been frequently found homozygous in the 1000G. It can be assumed that the splice site changes are false positive predictions. The classification 'benign' is due to the findings in the 1000G but the high probability indicates, that MutationTaster would have classified this variant as non-disease causing anyway.

Show in MutationTaster: Analyse transcript ENST00000379370

VCF batch querys

Tiny sample file

The VCF file TinyExample.vcf only contains three lines to illustrate the file format:

3	190122694	.	G	A	116	.	.	GT:DP	0/1:154
1	984171	.	CAG	C	116	.	.	GT:DP	0/1:154
21	33036170	.	A	G	116	.	.	GT:DP	0/1:154

Show in MutationTaster

Sample exome (cystic fibrosis)

The zipped file SampleExome_CF.vcf.gz contains a public VCF file with disease mutations in the CFTR gene spiked in. It is good practice to zip or gzip large VCF files before upload to MutationTaster. All variants must be in VCF format and refer to GRCh37 / hg19.

Show in MutationTaster

MutationTaster API

Automated analysis of chromosomal positions

If you want to integrate queries for chromosomal positions of our web interface in your personal NGS pipelines, this is possible in two ways depending on the output you prefer.

Text output

https://genecascade.org/MT2021/MT_API102.cgi?variants=21:33039603A>C
https://genecascade.org/MT2021/MT_API102.cgi?variants=21:33039603A>C,2:233391374T>C

If submitted via GET, the character ">" in the commands above is usually encoded as "%3E". That is,

https://genecascade.org/MT2021/MT_API102.cgi?variants=21:33039603A%3EC
https://genecascade.org/MT2021/MT_API102.cgi?variants=21:33039603A%3EC,2:233391374T%3EC

The value for variants can be one or multiple variants (separated by comma). The output is a table with 14 columns: chr, pos, ref, alt, transcript_stable, NCBI_geneid, prediction, model, tree_vote, note, splicesite, distance_from_splicesite, disease_mutation, polymorphism

In rare cases where the above URLs do not work, it may additionally be necessary to encode the characters ":" as "%3A" and "," as "%2C".

See output

Perl script

An example how to access the API via POST and print results can be found in this Perl script:

QueryMutationTasterAPI.pl

HTML output

https://genecascade.org/MTc2021/ChrPos102.cgi?chromosome=3&position=190122694&ref=G&alt=A

The respective values for chromosome, position, ref and alt have to be set according to the variant(s) in question. The output has the same format as the output generated via the MutationTaster Web interface.

See output

Automated analysis of VCF files

POST request with curl

On Unix-based systems, you can use curl to post a VCF file to MutationTaster:
curl \
-F "name=Project_name" \
-F "email=your@email.edu" \
-F "filename=@Your_VCF_file.vcf" \
https://www.genecascade.org/QE/MT37_102/MTQE_start.cgi

Replace Project_name, your@email.edu and Your_VCF_file.vcf with your details. The path to the VCF file must be prefixed with "@".

If the request is correct, you will immediately receive an email confirming your submission with a link to monitor the status of the analysis. Otherwise, please check the output of curl for possible errors.

Perl script

If you want to avoid the manual upload of VCF files to MutationTaster, we provide a Perl script which automatically sends a VCF file to the VCF analysis pipeline and afterwards retrieves the results. Please find it here:

sendVCF_MutationTaster.pl

 

 

If you encounter any problems regarding the API, please write us an email.

MutationTaster Support

Contact

In case you discover bugs, have suggestions or questions, please write an e-mail to Dominik Seelow (dominik.seelow AT charite.de).

We also appreciate hearing about your general experiences using MutationTaster.

Scientific articles

Steinhaus R, Proft S, Schuelke M, Cooper DN, Schwarz JM, Seelow D. MutationTaster2021. Nucleic Acids Research. 2021 Apr 24.

Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nature methods. 2014 Apr;11(4):361-2.

Schwarz JM, Rödelsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nature methods. 2010 Aug;7(8):575-6.

Error messages

Message Explanation
InsDel too long At present, MutationTaster handles only InsDels up to 12 bases.
Your mutation of interest seems to span an exon/intron boundary. This kind of mutation can only be analysed in gDNA mode.
No transcripts for this gene found! You might have mis-spelled the gene symbol or used a protein name which is not always also the correct symbol (e.g. protein p53 is gene TP53). Also, in some (rare) cases a NCBI gene could not be mapped to an Ensembl gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction. Moreover, we filter out protein-coding transcripts (Ensembl biotype protein_coding) without a correct start codon (ATG) and correct stop codon (TGA, TAA, TAG). This might lead to the phenomenon that MutationTaster complains about "no suitable transcripts" or "no transcripts for this gene found" although Ensembl lists one or several. Transcripts of mitochondrial genes are not tested for integrity due to differences in the mitochondrial genetic code.
No internal Ensembl transcript ID found. / No Ensembl gene ID found for transcript. / No stable ID for this gene. Our database doesn't know the transcript you specified. This might happen if you refer to a newer or older release than the one we use. The release MT uses is mentioned on the query interface.
Ensembl gene XXX not found in ENSEMBL Our database doesn't know the gene you specified. This might happen if you refer to a newer or older release than the one we use. The release MT uses is mentioned on the query interface.
No NCBI gene ID found. / No NCBI gene ID found for this transcript. In some (rare) cases an Ensembl gene could not be mapped to a NCBI gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction.
Too many NCBI gene IDs found. In some (rare) cases an Ensembl gene could not be mapped to a single NCBI gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction.
Only invalid NCBI gene IDs found. In some (very rare) cases an Ensembl gene could not be mapped to a valid NCBI gene, i.e. the NCBI gene Ensembl refers to is 'discontinued' and was replaced by another gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction. Please contact us if you encouter such a case.
Gene XXX not found on any chromosome. The gene under scrutiny has no valid positional data. This should not occur at all. Please contact us if you encouter such a case.
Gene XXX (Entrez gene YYY) and transcript ZZZ do not match! The transcript you entered is not a product of the gene you entered. Please check your input.
Position is out of gene! You entered a position that is located outside the gene. This may happen when you mapped genomic position to gene-specific position using an old genome build. Or, of course, by typos. Please check your input.
Could not retrieve a sequence or sequence is too short. MT was not able to get the gene sequence from Ensembl. This might be due to network problems so you should repeat the analysis after some time. Should this not work, please contact us.
No start ATG exon found. The transcript is not properly annotated: there is no start position of the coding sequence in the database. Please select another transcript of the same gene.
No stop exon found. The transcript is not properly annotated: there is no stop position of the coding sequence in the database. Please select another transcript of the same gene.
Chosen transcript ENSTXXX has no correct start ATG annotated. Protein-coding transcripts (Ensembl biotype protein_coding) are tested for transcript integrity, i.e. for presence of a correct start codon (ATG) and correct stop codon (TGA, TAA, ATG). If one is missing, an error message is thrown out because analysis in corrupt transcripts might lead to a wrong prediction.
Sequence XXX is not unique in your gene! Please use a longer snippet.
Sequence was not found in your gene. Please check your input: is there a typo in your snippet? Or do you use a snippet created from the wrong strand? MT always refers to the strand the gene is located on.
Snippet not properly formatted. Please check your input: snippets must be specified as ACGTACGT[OLDBASES/NEWBASES]ACGTACGT.

MutationTaster FAQs

General questions

Can I download pre-computed predictions?

Unfortunately not. Unlike SIFT or PolyPhen which handle only single amino acid substitutions, MutationTaster works on DNA level and allows insertions and deletions. The exome alone comprises about 30 Mb with 3 possible single base exchanges at each site (let alone introns and InDels). These 30 M x 3 SBEs may affect several different transcripts, leading to about 30,000,000 (Mb) x 3 (SBEs) x 5 (transcripts) = 450,000,000 values to pre-compute.

We could of course generate such a list, but it would still not include the InDels and most of the introns. What is more important: such a list would take a very long time to generate and might soon become outdated. We rather spend our efforts on improving MutationTaster!

Why doesn't MutationTaster know my valid transcript ID?

We filter out protein-coding transcripts (Ensembl biotype protein_coding) without a correct start codon (ATG) and correct stop codon (TGA, TAA, ATG). This might lead to the phenomenon that MutationTaster complains about "no suitable transcripts" or "no transcripts for this gene found" although there are some listed in Ensembl. We decided to exclude such transcripts from analysis in MutationTaster due to their bad annotation, which might in the end lead to a wrong prediction. Transcripts of mitochondrial genes are not tested for integrity due to differences in the mitochondrial genetic code.

Does a high Tree vote value indicate a high probability for a correct prediction, then?

Unfortunately not. Our results show that wrong predictions are usually not reflected by low Tree vote values but are rather caused by benign or deleterious alterations that show characteristics of the other case, e.g. SNPs that are highly conserved and destroy protein features or disease mutations that appear to have no effect on the protein/gene at all.

Why don't you exclude known SNPs as possible deleterious mutations?

Because many variants listed in db SNP have never been shown to show all three genotypes in unaffected individuals. Some SNPs even appear to have only one allele. And even if both alleles were observed, there should be a sufficient number of healthy individuals who are homozygous for the minor allele to exclude a damaging effect.

If all three genotypes were observed in the HapMap project, or the alteration was found homozygously in the 1000Genomes Project more than 4 times, it will automatically be regarded as benign.

Why don't you use allele frequencies to exclude variants as possible deleterious mutations?

We decided not to do this because not all loss-of-function variants are rare (e.g. CFTR mutations). Instead, we rely on healthy homozygous individuals. Are they observed, a variant is unlikely to cause a severe early-onset single gene disorder and hence predicted to be benign.

The prediction for my favourite alteration has changed. Why?

Well, this is a very rare event. However, as the available data such as protein features is increasing, we regularly update our database and re-train the classifier. In some cases, the annotation of a gene improves drastically. This may yield formerly unknown protein features in your gene/protein at your position which can of course influence the prediction of your alteration.

Is there any way to learn how a single alteration is classified?

Well, yes and no. A Random Forest classifier studies the frequencies of single item statuses (such as 'conservation in cattle - highly conserved' or 'existence of a disulfide bond - no') in both groups of the training set ('polymorphisms' / 'disease mutations'). It compares the statuses of these items in your alteration with the known frequencies and then decides which group fits best.

You can of course study the model and hence the frequencies used by the classifier. They can be found in our supplementary data.

What does "InDel alterations are limited to 40 bp" mean?

The website states that with MT "InDel alterations are limited to 40 bp", however, does that mean the ACTUAL insertion or deletion, or the DESCRIPTION of the insertion or deletion? For example, the description of the ALT variant contains 15 bases, whilst the REF variant contains 13 bases, so that the actual INSERTION (AC) here is only 2 bases but it has been rejected for being too long.

When bases get inserted to / deleted from a stretch of similar bases, as in the given example to a stretch of several 'AC', MutationTaster doesn't know at which position exactly the 'AC' was inserted (or deleted), due to the whole stretch of AC. That's why it has to use the whole 13 bases, although the actual insertion is only 2 bases. This is also the reason why such variants are described that way in your VCF file.

Why don't you offer a MutationTaster download version for local installation on my own machine?

We are asked regularly for standalone versions of MutationTaster, our conversion tools or the database. We don't offer these services, because it is not feasible. We would flood the world with lots of different versions of MutationTaster which we could never maintain. The distribution of local installations probably would lead to hundreds of support questions and we (only 2 people) are already busy with those that concern the version we control and know. We are not able to give support concerning installation issues or questions like 'how is the conservation internally stored?' or errors that occur only in the versions modified by the users. Moreover, you would need a very powerful hardware and highly optimised server to reach the same speed as the online version. MutationTaster uses a database which is tens of GBs in size with parts of Ensembl and the 1000 Genomes data in it. Additionally, we use some external tools for which we have signed disclosure agreements and which we are hence not allowed to share with other groups anyway. If you want to integrate MutationTaster in your own analysis pipeline for Next Generation Sequencing data, we suggest to use our VCF analysis pipeline that can be called via Perl's WWW::Mechanize module and similar approaches.

What does the AA changes score mean and how does it influence the prediction?

The score is taken from the Grantham Matrix for amino acid substitutions and reflects the physicochemical difference between the original and the mutated amino acid. It ranges from 0.0 to 215 but does not provide a value for amino acid insertions/deletions. However, the score is only displayed for information purpose and does not influence the prediction. Instead, MutationTaster uses the frequency of the respective AA exchange in known disease causing mutations and polymorphisms for the classification.

Why is the same variant classified as benign when there is an amino acid exchange and as deleterious when there is no amino acid exchange?

MutationTaster uses five different models (without_aae, simple_aae, complex_aae, 3utr, 5utr) for its prediction. Depending on the type of variant, MutationTaster automatically determines the correct model. Each model was trained with a suitable set of known polymorphisms/ disease mutations and the prioritisation of the individual parameters differs among the different models. Thus the prediction of a variant might not be the same, if two different models are applied (e.g. without_aae model and simple_aae model). In some cases with a 'deleterious' prediction due to DNA related features such as strong conservation, knowledge of the effect of amino acid substitution can 'weaken' the prediction, e.g. if the difference of the two amino acids is modest and no protein domains are affected. This is a consequence of the different models: If we used only one, all 'silent' mutations would be considered as benign - and we decided to rather risk false positives than to lose any true positives.

VCF files

Why are there so many cases with no prediction (n/a)?

Most of the n/a cases are due to a missing link between an Ensembl transcript and an NCBI gene (error message: no NCBI gene ID found for this transcript). Ensembl has far more genes and transcripts annotated than NCBI, however, we need to link the Ensembl genes to NCBI in order to get the HGNC genesymbol and SwissProt Accession ID. To circumvent this, we plan to fetch SwissProt ID and genesymbol also via Ensembl in the future, so that in case of a missing link to NCBI, the analysis can be conducted neverthess.

Why are there so many outsides genes, although I have uploaded a VCF file from Exome Sequencing?

Target enrichment is not 100% perfect, thus it is normal that there are variants outside genes. Moreover, we do not use all available transcripts (see MutationTaster FAQs), because some are not suitable for analysis with MutationTaster. In case there is a variant in a gene which only has transcripts not suitable for MutationTaster analysis, this will be counted as outside gene

MutationTaster Changelog

Welcome to the MutationTaster changelog. This page keeps you up to date about changes of/in MutationTaster.

24 April 2021

New article for MutationTaster2021 has been published:

Steinhaus R, Proft S, Schuelke M, Cooper DN, Schwarz JM, Seelow D. MutationTaster2021. Nucleic Acids Research. 2021 Apr 24.

20 December 2020

MutationTaster2021

Our latest release of MutationTaster includes the following changes:

Previous updates

Visit http://www.mutationtaster.org/info/news.html

MutationTaster Legal

Imprint

Responsible for the purposes of media law for this page

Dominik Seelow
Bioinformatics and Translational Genetics
Berliner Institut für Gesundheitsforschung
Charitéplatz 1
10117 Berlin
email: dominik.seelow (at) charite.de

Data protection

If you use one of our query engines, you may enter an e-mail address. The secret URL pointing at your project will be sent to you directly after the upload and you will receive a second notification when the analysis has finished.
Your e-mail is only visible during the data analysis and the analysis web pages can only be accessed via the secret URL.
Data you uploaded will be automatically deleted within two months time unless you delete it by yourself or request complete deletion or an extension of that period by e-mail. Please note that your from-address must match the address you specified when uploading the data. Your e-mail address will remain in our database to allow us to determine the number of different users. It will be deleted upon request.

License

MutationTaster 2021 is free and open to all users and there is no login requirement. Please contact us if you want to include pre-computed scores of MutationTaster into your own software.