![]() |
Discussions regarding the description of sequence variants |
|
Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.
This page gives an overview of the discussions raised and suggestions made to describe sequence variations after publication of the latest manuscript on this issue by JT den Dunnen and S Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format). We invite investigators to send us further remarks on the issues discussed here. Furthermore, we solicit complicated cases not yet covered, with a suggestion regarding how to describe these. We hope these pages will be used as a guide to describe any sequence change, ultimately evolving into a uniformly accepted standard.
For reactions E-mail to: ddunnen@LUMC.nl and Stylianos.Antonarakis@medecine.unige.ch
As correctly pointed out by Peter Slickers (Clondiag Chip Technologies), providing a database accession number is not sufficient to identify a sequence in the database unambiguously. There may exist several different versions for a given accession. In most cases only the annotation changes, while the sequence remains the same, but this is not always the case and one can not rely on this (compare e.g. NM_000130.1 and NM_000130.2). Therefore one should always use accession AND version number to refer to the reference sequence (see Recommendations).
In some cases the description of pathogenic changes in genes started well before there
were any mutation nomenclature recommendations (e.g. in thalassemias and cystic fibrosis).
When new reports describe variants according to current recommendations, instead of using traditional
descriptions, experts in the field experience problems "recognizing"
these variants. However, nomenclature rules should be universal and thus can not be made
to apply for specific situations. The traditional notation only rings a bell for experts
in the field, for others it is cryptic and confusing.
Although annoying, traditional descriptions should not be used anymore. When one uses the
recommended descriptions it will be only a matter of time until also the experts get
acquainted. The recommendation is to list in the variant summary Table official and
traditional names next to each other in separate columns, like c.88+2T>G and
IVS#+2T>G, p.Phe508del and delF508, or c.24dupG and Cd8/9+G.
In the past, a specific notation has been used to describe polymorphic sequence variations, i.e. c.76A/G and p.36L/I (p.36Lys/Ile). However, a description of a variant should be neutral and not include any functional conclusion; consequently, polymorphisms and pathogenic changes should not be described differently (see also mutation / polymorphism). Please also note that it will often be very difficult to discriminate between pathogenic and really neutral (polymorphic) changes.
Initially, the "-"-character (hyphen) was used for two different purposes, i.e. to indicate a range (nucletotides c.12-13delTG) as well as to indicate a negative number (e.g. for intronic sequences like in c.77-2A>G). This description might cause confusion, which should be circumvented. For example when the change is c.12-13del, does this indicate a deletion from coding DNA nucletoide 12 to 13 or from the intronic nucleotide c.12-13 ?. Since for intronic positions both the "+" and "-" characters are essential, the recommendation is to use the "_"-character (underscore) to indicate a range.
As a consequence of the above mentioned change, the ";"-character should not be used to describe changes which affect RNA-processing, i.e. yielding two or more transcripts (den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12). The suggestion is to use the ","-character (comma) instead (see Recommendations). This rule applies to both description at RNA and protein level.
If a deletion is large and the reference sequence is split over several files, list at least once (in order) the respective files containing the overall reference sequence. When describing the change, to prevent confusion, include a reference to the sequence used, e.g. AC109326.2:g.82398_L78833.1:g.80466del. In the "Remarks" column of the summary table the size of the deletion could be mentioned (e.g. 160 kb deletion spanning exons 1-22). Please note that, since the reference sequence is split over several files, this size can not be deduced from the description of the sequence. See also Discussion - Fused genes.
Although duplications can be considered as a special type of insertion, the recommendation is to describe duplications independently from insertions, using the term "dup" . This recommendation also applies for a duplicated mono-, di-, tri-, etc. nucleotide stretch. There are several reasons why the recommendation is to describe such changes as a "duplication".
- the description is simpler, shorter and more unequivocal
- it is clearer and prevents confusion regarding the exact position introduced when an insertion is incorrectly reported like "22insG" (see Insertions)
- it prevents discussions regarding the position of the insertion; in the case of a duplication including the intron/exon border (e.g. c.123-8_143dup) is the "insertion" in the intron or the exon ?.
- insertion more or less includes "coming from elsewhere". Mechanistically, a duplication is more likely to be caused by DNA polymerase slippage, duplicating a local sequence.
Examples
Duplications are indicated by the
term "dup". The question arose what to do with triplications, quadruplications,
etc. There are several possibilities. First, like "dup" for duplications one
could use "tri" for
triplications, "qua" for quadruplications, etc. Another possibility is to use
the recommendation to describe alleles of variable short sequence repeats and to use
[3] for triplication, [4] for quadruplication, etc. A variant of this possibility is to
use rep3, rep4, etc.
To prevent that more and more specific notations are designed, making the overall
description of DNA variants increasingly complicated the recommendation is build further
from existing recommendations. Thus, triplications, quadruplications, etc. are described
like the alleles of variable short sequence repeats using [3],
[4], etc.
Examples
When these amplifications involve imperfect copies of the unit sequence, descriptions quickly become too complex to be meaningful. In such cases the recommendation is to submit the sequence that has been determined to GenBank and to use the accession.version number in the description (see Recommendations).
From Pat O'Neill (Burlington, USA):
I especially like the use of "dup" in place of "ins" when the
inserted base creates a run of 2 or more bases. I feel that there should be a parallel
term for the loss of a base from a run of 2 or more bases instead of just
"del". This is because of the mechanistic implications of both an ins and a del
of a base in a run. Has this been discussed? My only thought for a term in place of
"del" is "los" for loss.
Shuji Ogino (Boston, USA) agrees with this suggestion but suggest to use the term "dec"
for a decrease in length.
Reply (JdD): Basically the "dup" nomenclature was suggested
because the description is simpler, shorter and less unequivocal (see Discussion). The potential mechanistic relation is nice but was not
decisive. Basically a description should be clear/unequivocal and not so much contain
additional information.
Agreed / not agreed - please tell us your opinion (mail to: ddunnen
@ lumc.nl and Stylianos.Antonarakis @ medecine.unige.ch)
The description of insertions has had some discussion. The first point of discussion
was whether the nucleotides (amino acids) flanking the insertion site had to be given both
or not. In the past, the description 22insG (or Cys22insGly) was used both to indicate insertion
at position 22 and insertion after position 22. This situation
becomes even more complex when a "-" character is involved, like in -14insG or
456-13insG. Does the latter mean at or after intronic
nucleotide 456-13 and in addition, after nucleotide 456-13 is that position
456-12 or 456-14 ?. Consequently, to prevent confusion, both flanking
residues have to be listed.
The second point of discussion was which character to use as a separator. The initial
suggestion was to use the "^"-character (e.g. p.Q83^C84insQ). However, since a
character to indicate a range was already available, it was decided to use this character,
i.e. the "_"-character (see above).
The occurrence of a combination of a deletion and insertion, sometimes named "indel", is not rare. Based on existing terminology, a recommendation for their description can be rather straightforward; a combination of a deletion and insertion at the same site is described in the format 112_117delinsTG (alternatively c.112_117delAGGTCAinsTG). On protein level, likewise, as p.Trp33_Lys35delinsArg.
NOTE: description clarified with the help of Raymond Dalgleish (Leicester, UK).
The recommendation is to designate frame shifting variants by "fs". It is not useful to add much detail in the description of frame shifting variants besides (especially in the case of C-terminal variants) the length of the new, shifted reading frame. Two notations can be used to describe frame shift changes, a short or a long form.
Short description
"fs" after a description of the first amino acid affected by the
change.
Long description
"fsX#" after a description of the amino acid(s) affected by the
change and the change occurring at the site of the frame shift. "X#" indicates
at which codon position the new reading frame ends in a stop (X). The position of the stop
in the new reading frame is calculated starting at the first amino acid that is changed by
the frame shift, and ending at the first stop codon (X#).
NOTE: the shifted reading frame is thus open for '#-1' amino acids.
Please note that the frame shift example given in den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12 contains a mistakes; p.R97fsX121 (page 11) should be "p.R97fsX25, indicating a frame-shifting change with Arginine-97 as the first affected amino acid and the new reading frame being open for 24 amino acids".
As discussed, in some cases it is very difficult to assign a sequence which can be used as a good reference for numbering. When a coding DNA reference sequence is used it should represent the major transcript of the gene. Alternatively spliced exons (5'-first, internal or 3'-terminal) derived from within the gene can than be best numbered as for intronic sequences. Description of variants in transcripts initiating or terminating outside this region is more difficult. The suggestion is to described these as usual but to precede them with a unique identifier of the alternative transcript and a ":"-character, like c.Dp427c:3G>T. The alternative transcript should be precisely described and refer to a specific database record (Genbank, EMBL, DDJB)., the accession number of which should be provided.
For example, for the DMD-gene, involved in Duchenne Muscular Dystrophy (DMD), the major transcript is that found in muscle, indicated with Dp427m. Other transcripts are initiated from within the Dp427m gene, e.g. that found in Purkinje cells (Dp427p) and in retina (Dp260), but these are all considered "alternative transcripts". Variants in the respective promoter / exon 1 region can thus be described as intronic sequences in relation to the Dp427m coding DNA sequence. However, the brain promoter / exon 1 lies 5' of the Dp427m promoter. Thus, a variant in this region should be described using the format c.Dp427c:3G>T. Note that this transcript encodes a new translation initiation site and that the numbering used starts with nucleotide +1 for the A of the ATG-translation initiation codon of the Dp427c-transcript.
For clarity reasons, e.g. to prevent confusion when in one manuscript variants in
relation to different reference sequences are described, it is recommended to use unique
sequence indicators as part of the description of each variant. It should be noted
however that for every indicator the respective reference sequence used should
always be mentioned. Unique indicator and sequence description should be separated by a
colon (":") (see Recommendations).
Examples;
Publications reporting linkage or association studies often use a range of different
markers/SNP's. Such publications should contain an unequivocal description of all
markers used. An easy way to achieve this is to include in the description a
direct, unequivocal reference to the reference sequence used (preferably a GenBank
or dbSNP record).
Examples;
Regarding SNP's and their use in the text of papers Peter Taschner (LUMC,
Leiden, NEDERLAND) makes the following remark;
most recommendations for sequence variant nomenclature apply to genotype descriptions in
tables. Unfortunately, these are not very useful in the general text of a paper. For
instance, the OPRM1:c.118A>G or dbSNP1799971:A>G designation can be used to describe
the sequence variant, but in a paper you might like to discuss the phenotypic consequences
of different genotypes. In fact the current recommendation is to use
OPRM1:c.[118A>G]+[=] to describe a heterozygote and [=]+[=] and
OPRM1:c.[118A>G]+[118A>G] for the homozygotes. I would like to suggest to
describe the genotypes in the text like;
- OPRM1:c.118AA homozygotes
- OPRM1:c.118GA heterozygotes
- OPRM1:c.118GG homozygotes
The different alleles could then be designated as the OPRM1:c.118A allele and the OPRM1:c.118G allele. In combination with variants of other genes, the genotype descriptions could be OPRM1:c.118AA, GJB2:c.76AC double heterozygotes, etc.
Haplotypes are a special form of two or more variants in one allele (see Recommendations DNA changes). When it is once clearly described (e.g. in the Materials & Methods) what the order of the variants is and which reference sequences were used a rather simple description of a haplotype can be used. Examples;
For the description of translocations the format "t(X;4)(p21.2;q34)",
suggested originally by the ISCN
(1985), is already used as a standard.
NOTE: current recommendations are made by the "Standing Committee on Human Cytogenetic
Nomenclature (2001-2006)".
For a description at the molecular level this notation can be followed, extended with the
standard description indicating the exact translocation breakpoint. When due to local
similarities the exact breakpoint is uncertain, following standard nomenclature rules, it
will be arbitrarily assigned to the most 3' nucleotide. Since the translocation
breakpoints can have a complex structure and since it involves two different chromosomal
locations, the sequences of the two translocation breakpoints should always be submitted
to a sequence database (Genbank, EMBL, DDJB). The accession numbers of these files should
be listed in the report.
Next to the exact location of the translocation breakpoint, its
molecular characterisation will yield more details including e.g. deletions/duplications
at the junction and the sequence joined, derived from the other chromosome. We believe
that a description covering all these details will become too complex. However, when one
wants to include these details, the first description should be for the translocated 5'
segment of the gene, the second for the translocated 3' segment, separated by a
";"-character. It should also be noted that when a translocation joins
genes A and B, the description of the breakpoint in the sequence variation database of
gene A is different from that in gene B. The major difference being that the nucleotide
numbering is based on that of gene A or gene B respectively.
Due to (large) deletions, translocations or inversions, genetic rearrangements may have
one breakpoint far from the gene under study. The breakpoint might lie in 'empty'
intragenic sequences or in another gene. Consequently, to describe the breakpoint at a
molecular level two Reference Sequences will be required. To describe cases like this, no
recommendations have been made yet.
Recommendation: for the breakpoint residing in the gene under study,
nucleotide numbering is clear and follows the standard. When the breakpoint lies in
another gene, nucleotide numbering for that end should be based on the nucleotide
numbering for that gene (accession.version number of the Reference Sequence used should
be provided). To indicate that the end lies in another gene, the nucleotide number
should be preceded with the gene's official
Gene Symbol, like GJB2:c.233. When the breakpoint does not reside in another gene, the
accession number of the Reference Sequence will be used instead of the official Gene
Symbol, like AC012343.2:g.763 (please note that this is always a genomic Reference
Sequence). When the breakpoint ends on the opposite strand (reverse,
complementary, non-transcribed or anti-sense strand) of a gene or on the opposite
strand of an intragenic sequence, an "o" will precede the official
Gene Symbol (like oGJB2:c.233). Pleas note that the use of a "c" (complementary),
"a" (anti-sense) or "r" (reverse) might cause confusion
with nucleotides C and A or the "r" indicating description of a change on
RNA-level.
Discussions regarding the use of either the one- or three-letter amino acid code to describe variants at protein level are ongoing. Basically, descriptions using the one-letter amino acid code are unequivocal, short and thus preferred. However, since the one-letter amino acid code is not obvious (Ala, Arg, Asn, Asp start with A, Gln, Glu, GLy with G, Leu, Lys with L, Phe, Pro with P and Thr, Tyr with T) publications often contain mistakes when the one-letter code is used. In addition, the 'X' is not only used to indicate a stop codon (translation termination) but also to indicate unknown residues. Consequently, to prevent mistakes, we favour the use of the three-letter amino acid code.
Currently, variants in the translation initiating Methionine (M1) are usually described as a substitution, e.g. M1V. This is not correct. Either no protein is produced or the translation initiation site moves up- or downstream. Unless experimental proof is available, it is probably best to report the effect on protein level as "p.Met1?" (unknown). When experimental data show that no protein is made, the description "p.0" is recommended.
A gene conversion is a nonreciprocal transfer of genetic information between two
homologous sequences. As a result of a
gene conversion the sequence of (part of) a gene can be copied from a highly similar
sequence residing elsewhere in the genome. Usually, the converted segment contains
a range of sequence changes, making its description rather complex. In such cases it is
recommended to use a specific description using the format "region_changed"
con "region of origin". Please note that also here the rule applies to arbitrarily assign the most 3' position
possible as the first to have been changed Examples;
| Top of page | MutNomen
homepage | Check-list | Copyright HGVS 2007 All Rights Reserved
| Recommendations: DNA, RNA, protein, uncertain |
| FAQ's | Codons / amino acids | History |
| Example descriptions: QuickRef / symbols,
DNA, RNA, protein |
Website Created by Rania Horaitis, Nomenclature by J.T. Den Dunnen - Disclaimer