Discussions regarding the description of sequence variants


Last modified February 20, 2008

Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.


Contents


Introduction

This page gives an overview of the discussions raised and suggestions made to describe sequence variations after publication of the latest manuscript on this issue by JT den Dunnen and S Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format). We invite investigators to send us further remarks on the issues discussed here. Furthermore, we solicit complicated cases not yet covered, with a suggestion regarding how to describe these. We hope these pages will be used as a guide to describe any sequence change, ultimately evolving into a uniformly accepted standard.

For reactions E-mail to: ddunnen@LUMC.nl and Stylianos.Antonarakis@medecine.unige.ch


Discussion / recent modifications


Accession number

As correctly pointed out by Peter Slickers (Clondiag Chip Technologies), providing a database accession number is not sufficient to identify a sequence in the database unambiguously. There may exist several different versions for a given accession. In most cases only the annotation changes, while the sequence remains the same, but this is not always the case and one can not rely on this (compare e.g. NM_000130.1 and NM_000130.2). Therefore one should always use accession AND version number to refer to the reference sequence (see Recommendations). 

Traditional descriptions

In some cases the description of pathogenic changes in genes started well before there were any mutation nomenclature recommendations (e.g. in thalassemias and cystic fibrosis). When new reports describe variants according to current recommendations, instead of using traditional descriptions, experts in the field experience problems "recognizing" these variants. However, nomenclature rules should be universal and thus can not be made to apply for specific situations. The traditional notation only rings a bell for experts in the field, for others it is cryptic and confusing.
Although annoying, traditional descriptions should not be used anymore. When one uses the recommended descriptions it will be only a matter of time until also the experts get acquainted. The recommendation is to list in the variant summary Table official and traditional names next to each other in separate columns, like c.88+2T>G and IVS#+2T>G, p.Phe508del and delF508, or c.24dupG and Cd8/9+G.

Polymorphisms

In the past, a specific notation has been used to describe polymorphic sequence variations, i.e. c.76A/G and p.36L/I (p.36Lys/Ile). However, a description of a variant should be neutral and not include any functional conclusion; consequently, polymorphisms and pathogenic changes should not be described differently (see also mutation / polymorphism). Please also note that it will often be very difficult to discriminate between pathogenic and really neutral (polymorphic) changes.

Descriptions of a range using "_"

Initially, the "-"-character (hyphen) was used for two different purposes, i.e. to indicate a range (nucletotides c.12-13delTG) as well as to indicate a negative number (e.g. for intronic sequences like in c.77-2A>G). This description might cause confusion, which should be circumvented. For example when the change is c.12-13del, does this indicate a deletion from coding DNA nucletoide 12 to 13 or from the intronic nucleotide c.12-13 ?. Since for intronic positions both the "+" and "-" characters are essential, the recommendation is to use the "_"-character (underscore) to indicate a range. 

Two sequence variants in one individual

More transcripts / proteins from one allele

As a consequence of the above mentioned change, the ";"-character should not be used to describe changes which affect RNA-processing, i.e. yielding two or more transcripts (den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12). The suggestion is to use the ","-character (comma) instead (see Recommendations). This rule applies to both description at RNA and protein level.

Large deletions, split reference sequence

If a deletion is large and the reference sequence is split over several files, list at least once (in order) the respective files containing the overall reference sequence. When describing the change, to prevent confusion, include a reference to the sequence used, e.g. AC109326.2:g.82398_L78833.1:g.80466del. In the "Remarks" column of the summary table the size of the deletion could be mentioned (e.g. 160 kb deletion spanning exons 1-22). Please note that, since the reference sequence is split over several files, this size can not be deduced from the description of the sequence. See also Discussion - Fused genes.

Duplication or insertion

Although duplications can be considered as a special type of insertion, the recommendation is to describe duplications independently from insertions, using the term "dup" . This recommendation also applies for a duplicated mono-, di-, tri-, etc. nucleotide stretch. There are several reasons why the recommendation is to describe such changes as a "duplication".

  • the description is simpler, shorter and more unequivocal
  • it is clearer and prevents confusion regarding the exact position introduced when an insertion is incorrectly reported like "22insG" (see Insertions)
  • it prevents discussions regarding the position of the insertion; in the case of a duplication including the intron/exon border (e.g. c.123-8_143dup) is the "insertion" in the intron or the exon ?. 
  • insertion more or less includes "coming from elsewhere". Mechanistically, a duplication is more likely to be caused by DNA polymerase slippage, duplicating a local sequence.

Examples

Triplication, quadruplication, ...

Duplications are indicated by the term "dup". The question arose what to do with triplications, quadruplications, etc. There are several possibilities. First, like "dup" for duplications one could use "tri" for triplications, "qua" for quadruplications, etc. Another possibility is to use the recommendation to describe alleles of variable short sequence repeats and to use [3] for triplication, [4] for quadruplication, etc. A variant of this possibility is to use rep3, rep4, etc.
To prevent that more and more specific notations are designed, making the overall description of DNA variants increasingly complicated the recommendation is build further from existing recommendations. Thus, triplications, quadruplications, etc. are described like the
alleles of variable short sequence repeats using [3], [4], etc.

Examples

When these amplifications involve imperfect copies of the unit sequence, descriptions quickly become too complex to be meaningful. In such cases the recommendation is to submit the sequence that has been determined to GenBank and to use the accession.version number in the description (see Recommendations).

Loss from a run of nucleotides

From Pat O'Neill (Burlington, USA):
I especially like the use of "dup" in place of "ins" when the inserted base creates a run of 2 or more bases. I feel that there should be a parallel term for the loss of a base from a run of 2 or more bases instead of just "del". This is because of the mechanistic implications of both an ins and a del of a base in a run. Has this been discussed? My only thought for a term in place of "del" is "los" for loss.
Shuji Ogino (Boston, USA) agrees with this suggestion but suggest to use the term "dec" for a decrease in length.

Reply  (JdD): Basically the "dup" nomenclature was suggested because the description is simpler, shorter and less unequivocal (see Discussion). The potential mechanistic relation is nice but was not decisive. Basically a description should be clear/unequivocal and not so much contain additional information.
Agreed / not agreed - please tell us your opinion (mail to: ddunnen @ lumc.nl and Stylianos.Antonarakis @ medecine.unige.ch)

Insertions

The description of insertions has had some discussion. The first point of discussion was whether the nucleotides (amino acids) flanking the insertion site had to be given both or not. In the past, the description 22insG (or Cys22insGly) was used both to indicate insertion at position 22 and insertion after position 22. This situation becomes even more complex when a "-" character is involved, like in -14insG or 456-13insG. Does the latter mean at or after intronic nucleotide 456-13 and in addition, after nucleotide 456-13 is that position 456-12 or 456-14 ?. Consequently, to prevent confusion, both flanking residues have to be listed
The second point of discussion was which character to use as a separator. The initial suggestion was to use the "^"-character (e.g. p.Q83^C84insQ). However, since a character to indicate a range was already available, it was decided to use this character, i.e. the "_"-character (see above).

Insertion-deletions  (indels)

The occurrence of a combination of a deletion and insertion, sometimes named "indel", is not rare. Based on existing terminology, a recommendation for their description can be rather straightforward; a combination of a deletion and insertion at the same site is described in the format 112_117delinsTG (alternatively c.112_117delAGGTCAinsTG). On protein level, likewise, as p.Trp33_Lys35delinsArg. 

Frame shift variants

NOTE: description clarified with the help of Raymond Dalgleish (Leicester, UK).

The recommendation is to designate frame shifting variants by "fs". It is not useful to add much detail in the description of frame shifting variants besides (especially in the case of C-terminal variants) the length of the new, shifted reading frame. Two notations can be used to describe frame shift changes, a short or a long form.

Short description
"fs" after a description of the first amino acid affected by the change.

Long description
"fsX#" after a description of the amino acid(s) affected by the change and the change occurring at the site of the frame shift. "X#" indicates at which codon position the new reading frame ends in a stop (X). The position of the stop in the new reading frame is calculated starting at the first amino acid that is changed by the frame shift, and ending at the first stop codon (X#).
NOTE: the shifted reading frame is thus open for '#-1' amino acids.

Please note that the frame shift example given in den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12 contains a mistake; it should read p.R97PfsX23.


New recommendations


Alternative transcripts  (Reference Sequence)

As discussed, in some cases it is very difficult to assign a sequence which can be used as a good reference for numbering. When a coding DNA reference sequence is used it should represent the major transcript of the gene. Alternatively spliced exons (5'-first, internal or 3'-terminal) derived from within the gene can than be best numbered as for intronic sequences. Description of variants in transcripts initiating or terminating outside this region is more difficult. The suggestion is to described these as usual but to precede them with a unique identifier of the alternative transcript and a ":"-character, like c.Dp427c:3G>T. The alternative transcript should be precisely described and refer to a specific database record (Genbank, EMBL, DDJB)., the accession number of which should be provided.

For example, for the DMD-gene, involved in Duchenne Muscular Dystrophy (DMD), the major transcript is that found in muscle, indicated with Dp427m. Other transcripts are initiated from within the Dp427m gene, e.g. that found in Purkinje cells (Dp427p) and in retina (Dp260), but these are all considered "alternative transcripts". Variants in the respective promoter / exon 1 region can thus be described as intronic sequences in relation to the Dp427m coding DNA sequence. However, the brain promoter / exon 1 lies 5' of the Dp427m promoter. Thus, a variant in this region should be described using the format c.Dp427c:3G>T. Note that this transcript encodes a new translation initiation site and that the numbering used starts with nucleotide +1 for the A of the ATG-translation initiation codon of the Dp427c-transcript.

Accepted sequence indicators

For clarity reasons, e.g. to prevent confusion when in one manuscript variants in relation to different reference sequences are described, it is recommended to use unique sequence indicators as part of the description of each variant. It should be noted however that for every indicator the respective reference sequence used should always be mentioned. Unique indicator and sequence description should be separated by a colon (":") (see Recommendations).
Examples;
 

Single Nucleotide Polymorphisms (SNP's)

Publications reporting linkage or association studies often use a range of different markers/SNP's. Such publications should contain an unequivocal description of all markers used. An easy way to achieve this is to include in the description a direct, unequivocal reference to the reference sequence used (preferably a GenBank or dbSNP record).
Examples;
 

Homo/heterozygotes

Regarding SNP's and their use in the text of papers Peter Taschner (LUMC, Leiden, NEDERLAND) makes the following remark;
most recommendations for sequence variant nomenclature apply to genotype descriptions in tables. Unfortunately, these are not very useful in the general text of a paper. For instance, the OPRM1:c.118A>G or dbSNP1799971:A>G designation can be used to describe the sequence variant, but in a paper you might like to discuss the phenotypic consequences of different genotypes. In fact the current recommendation is to use OPRM1:c.[118A>G]+[=] to describe a heterozygote and [=]+[=] and OPRM1:c.[118A>G]+[118A>G] for the homozygotes. I would like to suggest to describe the genotypes in the text like;

  • OPRM1:c.118AA homozygotes
  • OPRM1:c.118GA heterozygotes
  • OPRM1:c.118GG homozygotes

The different alleles could then be designated as the OPRM1:c.118A allele and the OPRM1:c.118G allele. In combination with variants of other genes, the genotype descriptions could be OPRM1:c.118AA, GJB2:c.76AC double heterozygotes, etc.

Haplotypes

Haplotypes are a special form of two or more variants in one allele (see Recommendations DNA changes). When it is once clearly described (e.g. in the Materials & Methods) what the order of the variants is and which reference sequences were used a rather simple description of a haplotype can be used. Examples;

Translocations

For the description of translocations the format "t(X;4)(p21.2;q34)", suggested originally by the ISCN (1985), is already used as a standard.
NOTE: current recommendations are made by the "Standing Committee on Human Cytogenetic Nomenclature (2001-2006)".
For a description at the molecular level this notation can be followed, extended with the standard description indicating the exact translocation breakpoint. When due to local similarities the exact breakpoint is uncertain, following standard nomenclature rules, it will be arbitrarily assigned to the most 3' nucleotide. Since the translocation breakpoints can have a complex structure and since it involves two different chromosomal locations, the sequences of the two translocation breakpoints should always be submitted to a sequence database (Genbank, EMBL, DDJB). The accession numbers of these files should be listed in the report.
    Next to the exact location of the translocation breakpoint, its molecular characterisation will yield more details including e.g. deletions/duplications at the junction and the sequence joined, derived from the other chromosome. We believe that a description covering all these details will become too complex. However, when one wants to include these details, the first description should be for the translocated 5' segment of the gene, the second for the translocated 3' segment, separated by a ";"-character. It should also be noted that when a translocation joins genes A and B, the description of the breakpoint in the sequence variation database of gene A is different from that in gene B. The major difference being that the nucleotide numbering is based on that of gene A or gene B respectively.

Fused genes

Due to (large) deletions, translocations or inversions, genetic rearrangements may have one breakpoint far from the gene under study. The breakpoint might lie in 'empty' intragenic sequences or in another gene. Consequently, to describe the breakpoint at a molecular level two Reference Sequences will be required. To describe cases like this, no recommendations have been made yet. 
Recommendation: for the breakpoint residing in the gene under study, nucleotide numbering is clear and follows the standard. When the breakpoint lies in another gene, nucleotide numbering for that end should be based on the nucleotide numbering for that gene (accession.version number of the Reference Sequence used should be provided). To indicate that the end lies in another gene, the nucleotide number should be preceded with the gene's official Gene Symbol, like GJB2:c.233. When the breakpoint does not reside in another gene, the accession number of the Reference Sequence will be used instead of the official Gene Symbol, like AC012343.2:g.763 (please note that this is always a genomic Reference Sequence). When the breakpoint ends on the opposite strand (reverse, complementary, non-transcribed or anti-sense strand) of a gene or on the opposite strand of an intragenic sequence, an "o" will precede the official Gene Symbol (like oGJB2:c.233). Pleas note that the use of a "c" (complementary), "a" (anti-sense) or "r" (reverse) might cause confusion with nucleotides C and A or the "r" indicating description of a change on RNA-level.

One or three letter amino acid code

Discussions regarding the use of either the one- or three-letter amino acid code to describe variants at protein level are ongoing. Basically, descriptions using the one-letter amino acid code are unequivocal, short and thus preferred. However, since the one-letter amino acid code is not obvious (Ala, Arg, Asn, Asp start with A, Gln, Glu, GLy with G, Leu, Lys with L, Phe, Pro with P and Thr, Tyr with T) publications often contain mistakes when the one-letter code is used. In addition, the 'X' is not only used to indicate a stop codon (translation termination) but also to indicate unknown residues. Consequently, to prevent mistakes, we favour the use of the three-letter amino acid code.

Translation initiation codon changes

Currently, variants in the translation initiating Methionine (M1) are usually described as a substitution, e.g. M1V. This is not correct. Either no protein is produced or the translation initiation site moves up- or downstream. Unless experimental proof is available, it is probably best to report the effect on protein level as "p.Met1?" (unknown). When experimental data show that no protein is made, the description "p.0" is recommended.

Gene conversions

A gene conversion is a nonreciprocal transfer of genetic information between two homologous sequences. As a result of a gene conversion the sequence of (part of) a gene can be copied from a highly similar sequence residing elsewhere in the genome. Usually, the converted segment contains a range of sequence changes, making its description rather complex. In such cases it is recommended to use a specific description using the format "region_changed" con "region of origin". Please note that also here the rule applies to arbitrarily assign the most 3' position possible as the first to have been changed  Examples; 


| Top of page | MutNomen homepage | Check-list |
| Recommendations:  DNARNAprotein, uncertain |
| FAQ's | Codons / amino acids | History |
| Example descriptions:  QuickRef / symbolsDNARNAprotein |

Copyright © HGVS 2007 All Rights Reserved
Website Created by Rania Horaitis, Nomenclature by J.T. Den Dunnen - Disclaimer