Discussions regarding the description of sequence variants
This page gives an overview of the discussions raised and suggestions made to describe sequence variations after publication of the latest manuscript on this issue by JT den Dunnen and S Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format). We invite investigators to send us further remarks on the issues discussed here. Furthermore, we solicit complicated cases not yet covered, with a suggestion regarding how to describe these. We hope these pages will be used as a guide to describe any sequence change, ultimately evolving into a uniformly accepted standard.
For reactions: contact us by E-mail (to: HGVSmn @ JohanDenDunnen.nl) or using the HGVS variant description forum.
As correctly pointed out by Peter Slickers (Clondiag Chip Technologies), providing a database accession number is not sufficient to identify a sequence in the database unambiguously. There may exist several different versions for a given accession. In most cases only the annotation changes, while the sequence remains the same, but this is not always the case and one can not rely on this (compare e.g. NM_000130.1 and NM_000130.2). Therefore one should always use accession AND version number to refer to the reference sequence (see Recommendations).
In some cases the description of pathogenic changes in genes started well
before there were any mutation nomenclature recommendations (e.g. in
thalassemias and cystic fibrosis). When new reports describe variants
according to current recommendations, instead of using traditional
descriptions, experts in the field experience problems "recognizing"
these variants. However, nomenclature rules should be universal and thus
can not be made to apply for specific situations. The traditional notation
only rings a bell for experts in the field, for others it is cryptic and
Although annoying, traditional descriptions should not be used anymore. When one uses the recommended descriptions it will be only a matter of time until also the experts get acquainted. The recommendation is to list in the variant summary Table official and traditional names next to each other in separate columns, like c.88+2T>G and IVS#+2T>G, p.Phe508del and delF508, or c.24dupG and Cd8/9+G.
Several people have requested to extend the recommendations for the numbering of nucleotides using a coding DNA reference sequence to include a specific description for untranscribed nucleotides (i.e. 5' of the transcription initiation site (cap-site) or 3' of the polyA-addition site). Thus far, these requests have not been granted. The main reason is that genes often have several transcription initiation sites (promoters/5'-first exons) as well as polyA-addition sites (3'-terminal exons). Furthermore, the transcription initiation or cap-site, is often ill-defined (see also Practical problems coding DNA reference sequence). Consequently, the suggested information in the description (indicating that the variant lies in untranscribed sequences) is not very reliable and informative. In addition, it further complicates the already complex description using a coding DNA Reference Sequence.
Recently our knowledge of the genome and its transcription is quickly maturing and transcription initiation and polyA-addition sites have been mapped much more precisely. When, as recommended, a stable LRG-based reference sequence (see Recommendations reference sequence) is used, these uncertainties are less of an issue.
The most mature suggestion is to extend the current recommendations (see Numbering coding DNA reference sequence) with;
In the past, descriptions like c.76A/G and p.36L/I (p.36Lys/Ile) have been used to describe "polymorphic" sequence variants (see Mutation / polymorphism). Note that a description of a variant should be neutral and not include any functional conclusion; consequently, polymorphisms and changes affecting function ("pathogenic") should not be described differently. Note that it will often be very difficult to discriminate between variants affecting function and those that are truly neutral (not affecting function).
Description of so called "silent" changes in the format p.(Leu54Leu) (alternatively p.(L54L)) should not be used; descriptions should be given at DNA level. The description at protein level is not informative and not unequivocal (there are at least five possibilities at DNA level which may underlie p.(Leu54Leu)). A correct description has the format c.162C>G (p.(Leu54=)), with "p.(Leu54=)" indicating that there is no effect on protein level expected.
NOTE: the recommendation for the description of silent protein changes was recently modified (see proposal SVD-WG001 - No change). The recommended format is now c.162C>G p.(Leu54=); the change at DNA level should always be listed.
Initially, the "-"-character (hyphen) was used for two different purposes, i.e. to indicate a range (nucletotides c.12-13delTG) as well as to indicate a negative number (e.g. for intronic sequences like in c.77-2A>G). This description might cause confusion, which should be circumvented. For example when the change is c.12-13del, does this indicate a deletion from coding DNA nucletoide 12 to 13 or from the intronic nucleotide c.12-13 ?. Since for intronic positions both the "+" and "-" characters are essential, the recommendation is to use the "_"-character (underscore) to indicate a range.
As a consequence of the above mentioned change, the ";"-character should not be used to describe changes which affect RNA-processing, i.e. yielding two or more transcripts (den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12). The suggestion is to use the ","-character (comma) instead (see Recommendations). This rule applies to both description at RNA and protein level.
If a deletion is large and the reference sequence is split over several files, list at least once (in order) the respective files containing the overall reference sequence. When describing the change, to prevent confusion, include a reference to the sequence used, e.g. AC109326.2:g.82398_L78833.1:g.80466del. In the "Remarks" column of the summary table the size of the deletion could be mentioned (e.g. 160 kb deletion spanning exons 1-22). Please note that, since the reference sequence is split over several files, this size can not be deduced from the description of the sequence (see also Discussion - Fused genes).
Although duplications can be considered as a special type of insertion, the recommendation is to describe duplications independently from insertions, using the term "dup" . This recommendation also applies for a duplicated mono-, di-, tri-, etc. nucleotide stretch. There are several reasons why the recommendation is to describe such changes as a "duplication" (see Triplication, ...)
NOTE: the description "dup" (see Standards) may by definition only be used when the additional copy is directly 3'-flanking of the original copy (tandem duplication). For large duplications (e.g. one or more exons of a gene) there will often be no such experimental proof, the additional copy can be inserted anywhere in the genome. Without experimental evidence, such changes should be described as an insertion.
Duplications are indicated by the term "dup". The question arose what to
do when more copies are involved, use triplications,
quadruplications, etc. ?. There are several
possibilities. First, like "dup" for duplications one could use "tri" for
triplications, "qua" for quadruplications, etc. Another possibility is to
use the recommendation to describe sequence
repeat variability and to use "3" for triplication (3
copies), "4" for quadruplication (4 copies), etc. A variant of this
possibility is to use rep3, rep4, etc.
To prevent that more and more specific notations are used, making the overall description of sequence variants increasingly complicated, tri, qua, rep3, rep4, etc. are not recommended.
NOTE: the format "[N]" can only be used to when there is experimental evidence that the additional copies (N-1) are in tandem on the same chromosome.
From Pat O'Neill (Burlington, USA):
I especially like the use of "dup" in place of "ins" when the inserted base creates a run of two or more bases. I feel that there should be a parallel term for the loss of a base from a run of two or more bases instead of just "del". This is because of the mechanistic implications of both an ins and a del of a base in a run. Has this been discussed? My only thought for a term in place of "del" is "los" for loss.
Shuji Ogino (Boston, USA) agrees with this suggestion but suggest to use the term "dec" for a decrease in length.
Reply (JdD): Basically the "dup" nomenclature was suggested because the description is simpler, shorter and less unequivocal (see Discussion). The potential mechanistic relation is nice but was not decisive. Basically a description should be clear/unequivocal and not so much contain additional information.
The description of insertions has had some discussion. The first point of
discussion was whether the nucleotides (amino acids) flanking the
insertion site had to be given both or not. In the past, the description
22insG (or Cys22insGly) was used both to indicate insertion at
position 22 and insertion after position 22. This situation
becomes even more complex when a "-" character is involved, like in
-14insG or 456-13insG. Does the latter mean at or after
intronic nucleotide 456-13 and in addition, after nucleotide
456-13 is that position 456-12 or 456-14 ?. Consequently, to
prevent confusion, both flanking residues have to be listed.
The second point of discussion was which character to use as a separator. The initial suggestion was to use the "^"-character (e.g. p.Q83^C84insQ). However, since a character to indicate a range was already available, it was decided to use this character, i.e. the "_"-character (see above).
The occurrence of a combination of a deletion and insertion, sometimes named "indel", is not rare. Based on existing terminology, a recommendation for their description can be rather straightforward; a combination of a deletion and insertion at the same site is described using the format 112_117delinsTG. On protein level, likewise, as p.Trp33_Lys35delinsArg.
date 2012-08-31 Based on a new variant reported in the IFITM5 gene (c.-14C>T, generating a new translation initiation codon at position -5), Raymond Dalgleish (Leicester, UK) asked how to describe this variant on protein level.
The recommendation is to describe the generation of new upstream translation initiation codons using the format "p.Met1ext-5", where "-5" is the position of the new translation initiating Methionine.
NOTE: description clarified with the help of Raymond Dalgleish (Leicester, UK).
The recommendation is to designate frame shifting variants by "fs". It is not useful to add much detail in the description of frame shifting variants besides (especially in the case of C-terminal variants) the length of the new, shifted reading frame. Two notations can be used to describe frame shift changes, a short or a long form.
"fs" after a description of the first amino acid affected by the change.
"fs*#" after a description of the amino acid(s) affected by the change and the change occurring at the site of the frame shift. "*#" indicates at which codon position the new reading frame ends in a stop (*). The position of the stop in the new reading frame is calculated starting at the first amino acid that is changed by the frame shift, and ending at the first stop codon (*#).
NOTE: the shifted reading frame is thus open for '#-1' amino acids.
Please note that the frame shift example given in den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12 contains a mistakes; p.R97fs*121 (page 11) should be "p.R97fs*25, indicating a frame-shifting change with Arginine-97 as the first affected amino acid and the new reading frame being open for 24 amino acids".
As discussed, in some cases it is very difficult to assign a sequence which can be used as a good reference for numbering. When a coding DNA reference sequence is used it should represent the major transcript of the gene. Alternatively spliced exons (5'-first, internal or 3'-terminal) derived from within the gene can than be best numbered as for intronic sequences. Description of variants in transcripts initiating or terminating outside this region is more difficult. The suggestion is to described these as usual but to precede them with a unique identifier of the alternative transcript and a ":"-character, like c.Dp427c:3G>T. The alternative transcript should be precisely described and refer to a specific database record (Genbank, EMBL, DDJB)., the accession number of which should be provided.
For example, for the DMD-gene, involved in Duchenne Muscular Dystrophy (DMD), the major transcript is that found in muscle, indicated with Dp427m. Other transcripts are initiated from within the Dp427m gene, e.g. that found in Purkinje cells (Dp427p) and in retina (Dp260), but these are all considered "alternative transcripts". Variants in the respective promoter / exon 1 region can thus be described as intronic sequences in relation to the Dp427m coding DNA sequence. However, the brain promoter / exon 1 lies 5' of the Dp427m promoter. Thus, a variant in this region should be described using the format c.Dp427c:3G>T. Note that this transcript encodes a new translation initiation site and that the numbering used starts with nucleotide +1 for the A of the ATG-translation initiation codon of the Dp427c-transcript.
For clarity reasons, e.g. to prevent confusion when in one manuscript
variants in relation to different reference sequences are described, it is
recommended to use unique sequence indicators as part of the
description of each variant. It should be noted however that for every
indicator the respective reference sequence used should always be
mentioned. Unique indicator and sequence description should be separated
by a colon (":") (see
Publications reporting linkage or association studies often use a range
of different markers/SNP's. Such publications should contain an unequivocal
description of all markers used. An easy way to achieve this is to
include in the description a direct, unequivocal reference to the reference
sequence used (preferably a GenBank or dbSNP record).
Regarding SNP's and their use in the text of papers Peter
Taschner (LUMC, Leiden, NEDERLAND) makes the following remark;
most recommendations for sequence variant nomenclature apply to genotype descriptions in tables. Unfortunately, these are not very useful in the general text of a paper. For instance, the OPRM1:c.118A>G or dbSNP1799971:A>G designation can be used to describe the sequence variant, but in a paper you might like to discuss the phenotypic consequences of different genotypes. In fact the current recommendation is to use OPRM1:c.[118A>G];[=] to describe a heterozygote and [=];[=] and OPRM1:c.[118A>G];[118A>G] for the homozygotes. I would like to suggest to describe the genotypes in the text like;
- OPRM1:c.118AA homozygotes
- OPRM1:c.118GA heterozygotes
- OPRM1:c.118GG homozygotes
The different alleles could then be designated as the OPRM1:c.118A allele and the OPRM1:c.118G allele. In combination with variants of other genes, the genotype descriptions could be OPRM1:c.118AA, GJB2:c.76AC double heterozygotes, etc.
Haplotypes are a special form of two or more variants in one chromosome (see Recommendations DNA changes). When it is once clearly described (e.g. in the Materials & Methods) what the order of the variants is and which reference sequences were used a rather simple description of a haplotype can be used. Descriptions using "" are of course only used for variants on one chromosome. Examples;
For the description of translocations the format "t(X;4)(p21.2;q34)",
suggested originally by the ISCN
(1985), is already used as a standard.
NOTE: current recommendations in this areas are made by the "Standing Committee on Human Cytogenetic Nomenclature" and were published recently as ISCN 2013".
For a description at the molecular level this notation can be followed, extended with the standard description indicating the exact translocation breakpoint. When due to local similarities the exact breakpoint is uncertain, following standard nomenclature rules, it will be arbitrarily assigned to the most 3' nucleotide. Since the translocation breakpoints can have a complex structure and since it involves two different chromosomal locations, the sequences of the two translocation breakpoints should always be submitted to a sequence database (Genbank, EMBL, DDJB). The accession numbers of these files should be listed in the report
Next to the exact location of the translocation breakpoint, its molecular characterisation will yield more details including e.g. deletions/duplications at the junction and the sequence joined, derived from the other chromosome. We believe that a description covering all these details will become too complex. However, when one wants to include these details, the first description should be for the translocated 5' segment of the gene, the second for the translocated 3' segment, separated by a ";"-character. It should also be noted that when a translocation joins genes A and B, the description of the breakpoint in the sequence variation database of gene A is different from that in gene B. The major difference being that the nucleotide numbering is based on that of gene A or gene B respectively.
Due to (large) deletions, translocations or inversions, genetic
rearrangements may have one breakpoint far from the gene under study. The
breakpoint might lie in 'empty' intragenic sequences or in another gene.
Consequently, to describe the breakpoint at a molecular level two
Reference Sequences will be required. To describe cases like this, no
recommendations have been made yet.
Recommendation: for the breakpoint residing in the gene under study, nucleotide numbering is clear and follows the standard. When the breakpoint lies in another gene, nucleotide numbering for that end should be based on the nucleotide numbering for that gene (accession.version number of the Reference Sequence used should be provided). To indicate that the end lies in another gene, the nucleotide number should be preceded with the gene's official Gene Symbol, like GJB2:c.233. When the breakpoint does not reside in another gene, the accession number of the Reference Sequence will be used instead of the official Gene Symbol, like AC012343.2:g.763 (please note that this is always a genomic Reference Sequence). When the breakpoint ends on the opposite strand (reverse, complementary, non-transcribed or anti-sense strand) of a gene or on the opposite strand of an intragenic sequence, an "o" will precede the official Gene Symbol (like oGJB2:c.233). Pleas note that the use of a "c" (complementary), "a" (anti-sense) or "r" (reverse) might cause confusion with nucleotides C and A or the "r" indicating description of a change on RNA-level.
Discussions regarding the use of either the one- or three-letter amino acid code to describe variants at protein level are ongoing. Basically, descriptions using the one-letter amino acid code are unequivocal, short and thus preferred. However, since the one-letter amino acid code is not obvious (Ala, Arg, Asn, Asp start with A, Gln, Glu, GLy with G, Leu, Lys with L, Phe, Pro with P and Thr, Tyr with T) publications often contain mistakes when the one-letter code is used. In addition, the '*' is not only used to indicate a stop codon (translation termination) but also to indicate unknown residues. Consequently, to prevent mistakes, we favour the use of the three-letter amino acid code.
Currently, variants in the translation initiating Methionine (M1) are usually described as a substitution, e.g. p.Met1Val. This is not correct. Either no protein is produced (p.0) or a new translation initiation site up- or downstream is used (e.g. p.Met1ValextMet-12 or p.Met1_Lys45del resp.). Unless experimental proof is available, it is probably best to report the effect on protein level as "p.Met1?" (unknown). When experimental data show that no protein is made, the description "p.0" is recommended (see Examples).
Usually, descriptions at protein level have no experimental proof, i.e. are predictions only, deduced directly from the DNA sequence. However, when RNA has been analysed, and (unexpected) effects at RNA processing can be excluded, the predcited protein change will usually be correct. Similarly, the variant protein may have been detected using immuno-histochemistry or on Western blot. To indicate whether there is any experimental evidence for a protein description, it is recommended that when RNA nor protein has been analysed, the description is given between brackets (e.g. p.(Arg22Ser)).
Question; (Richard Barber) when the nucleotide change is common and well characterised at RNA and protein level, such as CFTR:p.Phe508del, there seems no need to use a description with brackets.
Answer; Agreed. However, please check carefully that such evidence is indeed available and do not fall into the trap of "transitive proof". i.e. reports only referring to another source for experimental evidence without giving any themselves.
A gene conversion is a nonreciprocal transfer of genetic information between two homologous sequences. As a result of a gene conversion the sequence of (part of) a gene can be copied from a highly similar sequence residing elsewhere in the genome. Usually, the converted segment contains a range of sequence changes, making its description rather complex. In such cases it is recommended to use a specific description using the format "region_changed" con "region of origin". Please note that also here the rule applies to arbitrarily assign the most 3' position possible as the first to have been changed.
| Top of page | MutNomen
homepage | Check-list | Symbols,
| Recommendations: DNA, RNA, protein, uncertain |
| Discussions | FAQ's | History |
| Example descriptions: QuickRef, DNA, RNA, protein |