![]() |
Discussions regarding the description of sequence variants |
|
Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.
This page gives an overview of the discussions raised and suggestions made to describe sequence variations after publication of the latest manuscript on this issue by JT den Dunnen and S Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format). We invite investigators to send us further remarks on the issues discussed here. Furthermore, we solicit complicated cases not yet covered, with a suggestion regarding how to describe these. We hope these pages will be used as a guide to describe any sequence change, ultimately evolving into a uniformly accepted standard.
For reactions E-mail to: ddunnen@LUMC.nl and Stylianos.Antonarakis@medecine.unige.ch
As correctly pointed out by Peter Slickers (Clondiag Chip Technologies), providing a database accession number is not sufficient to identify a sequence in the database unambiguously. There may exist several different versions for a given accession. In most cases only the annotation changes, while the sequence remains the same, but this is not always the case and one can not rely on this (compare e.g. NM_000130.1 and NM_000130.2). Therefore one should always use accession AND version number to refer to the reference sequence (see Recommendations).
In some cases the description of pathogenic changes in genes started well
before there were any mutation nomenclature recommendations (e.g. in
thalassemias and cystic fibrosis). When new reports describe variants according
to current recommendations, instead of using traditional descriptions,
experts in the field experience problems "recognizing"
these variants. However, nomenclature rules should be universal and thus can not
be made to apply for specific situations. The traditional notation only rings a
bell for experts in the field, for others it is cryptic and confusing.
Although annoying, traditional descriptions should not be used anymore. When one
uses the recommended descriptions it will be only a matter of time until also
the experts get acquainted. The recommendation is to list in the variant summary
Table official and traditional names next to each other in separate
columns, like c.88+2T>G and IVS#+2T>G, p.Phe508del and delF508, or c.24dupG
and Cd8/9+G.
In the past, a specific notation has been used to describe polymorphic sequence variations, i.e. c.76A/G and p.36L/I (p.36Lys/Ile). However, a description of a variant should be neutral and not include any functional conclusion; consequently, polymorphisms and pathogenic changes should not be described differently (see also mutation / polymorphism). Please also note that it will often be very difficult to discriminate between pathogenic and really neutral (polymorphic) changes.
Initially, the "-"-character (hyphen) was used for two different purposes, i.e. to indicate a range (nucletotides c.12-13delTG) as well as to indicate a negative number (e.g. for intronic sequences like in c.77-2A>G). This description might cause confusion, which should be circumvented. For example when the change is c.12-13del, does this indicate a deletion from coding DNA nucletoide 12 to 13 or from the intronic nucleotide c.12-13 ?. Since for intronic positions both the "+" and "-" characters are essential, the recommendation is to use the "_"-character (underscore) to indicate a range.
As a consequence of the above mentioned change, the ";"-character should not be used to describe changes which affect RNA-processing, i.e. yielding two or more transcripts (den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12). The suggestion is to use the ","-character (comma) instead (see Recommendations). This rule applies to both description at RNA and protein level.
If a deletion is large and the reference sequence is split over several files, list at least once (in order) the respective files containing the overall reference sequence. When describing the change, to prevent confusion, include a reference to the sequence used, e.g. AC109326.2:g.82398_L78833.1:g.80466del. In the "Remarks" column of the summary table the size of the deletion could be mentioned (e.g. 160 kb deletion spanning exons 1-22). Please note that, since the reference sequence is split over several files, this size can not be deduced from the description of the sequence. See also Discussion - Fused genes.
Although duplications can be considered as a special type of insertion, the recommendation is to describe duplications independently from insertions, using the term "dup" . This recommendation also applies for a duplicated mono-, di-, tri-, etc. nucleotide stretch. There are several reasons why the recommendation is to describe such changes as a "duplication".
- the description is simpler, shorter and more unequivocal
- it is clearer and prevents confusion regarding the exact position introduced when an insertion is incorrectly reported like "22insG" (see Insertions)
- it prevents discussions regarding the position of the insertion; in the case of a duplication including the intron/exon border (e.g. c.123-8_143dup) is the "insertion" in the intron or the exon ?.
- insertion more or less includes "coming from elsewhere". Mechanistically, a duplication is more likely to be caused by DNA polymerase slippage, duplicating a local sequence.
Examples
Duplications are
indicated by the term "dup". The question arose what to do with
triplications, quadruplications, etc. There are several
possibilities. First, like "dup" for duplications one could use "tri"
for triplications, "qua" for quadruplications, etc. Another
possibility is to use the
recommendation to describe alleles of variable short sequence repeats
and to use [3] for triplication, [4] for quadruplication, etc. A variant of this
possibility is to use rep3, rep4, etc. Examples
When these amplifications involve imperfect copies of the unit sequence,
descriptions quickly become too complex to be meaningful. In such cases the
recommendation is to submit the sequence that has been determined to GenBank and
to use the accession.version number in the description (see
Recommendations). From Pat O'Neill (Burlington, USA): Reply (JdD): Basically the "dup" nomenclature was
suggested because the description is simpler, shorter and less unequivocal
(see Discussion). The
potential mechanistic relation is nice but was not decisive. Basically a
description should be clear/unequivocal and not so much contain additional
information. The description of insertions has had some discussion. The first point of discussion
was whether the nucleotides (amino acids) flanking the insertion site had to be given both
or not. In the past, the description 22insG (or Cys22insGly) was used both to indicate insertion
at position 22 and insertion after position 22. This situation
becomes even more complex when a "-" character is involved, like in -14insG or
456-13insG. Does the latter mean at or after intronic
nucleotide 456-13 and in addition, after nucleotide 456-13 is that position
456-12 or 456-14 ?. Consequently, to prevent confusion, both flanking residues
have to be listed. The occurrence of a combination of a deletion and insertion, sometimes named "indel",
is not rare. Based on existing terminology, a recommendation for their description can be
rather straightforward; a combination of a deletion and insertion at the same site is
described in the format 112_117delinsTG (alternatively c.112_117delAGGTCAinsTG). On protein
level, likewise, as p.Trp33_Lys35delinsArg. NOTE: description clarified with the help of Raymond Dalgleish
(Leicester, UK).
The recommendation is to designate
frame shifting variants by "fs". It is not useful to add much detail in the description of frame shifting
variants besides (especially in the case of C-terminal variants) the length of the new, shifted reading frame. Two notations can be
used to describe frame shift changes, a short or a long form. Short
description Long description Please note that the frame shift example given in den Dunnen & Antonarakis, 2000, Hum.Mut.
15: 7-12 contains a mistake; it should read p.R97PfsX23. As discussed, in some cases it is very difficult to
assign a sequence which can be used as a good reference for numbering. When a coding DNA
reference sequence is used it should represent the major transcript of the
gene. Alternatively spliced exons (5'-first, internal or 3'-terminal) derived from within
the gene can than be best numbered as for intronic sequences. Description of variants in
transcripts initiating or terminating outside this region is more difficult. The
suggestion is to described these as usual but to precede them with a unique identifier of
the alternative transcript and a ":"-character, like c.Dp427c:3G>T. The
alternative transcript should be precisely described and refer to a specific database
record (Genbank, EMBL, DDJB)., the accession number of which should be provided. For example, for the DMD-gene, involved in Duchenne Muscular Dystrophy (DMD), the major
transcript is that found in muscle, indicated with Dp427m. Other transcripts are initiated
from within the Dp427m gene, e.g. that found in Purkinje cells (Dp427p) and in retina
(Dp260), but these are all considered "alternative transcripts".
Variants in the respective promoter / exon 1 region can thus be described as intronic
sequences in relation to the Dp427m coding DNA sequence. However, the brain promoter / exon 1
lies 5' of the Dp427m promoter. Thus, a variant in this region should be described using
the format c.Dp427c:3G>T. Note that this transcript encodes a new translation
initiation site and that the numbering used starts with nucleotide +1 for the A of the
ATG-translation initiation codon of the Dp427c-transcript. For clarity reasons, e.g. to prevent confusion when in one manuscript
variants in relation to different reference sequences are described, it is
recommended to use unique sequence indicators as part of the description
of each variant. It should be noted however that for every indicator the
respective reference sequence used should always be mentioned. Unique
indicator and sequence description should be separated by a colon (":")
(see Recommendations). Publications reporting linkage or association studies often use a range of
different markers/SNP's. Such publications should contain an unequivocal
description of all markers used. An easy way to achieve this is to include in the
description a direct, unequivocal reference to the reference sequence used
(preferably a GenBank or dbSNP record). Regarding SNP's and their use in the text of papers Peter
Taschner (LUMC, Leiden, NEDERLAND) makes the following remark; The different alleles could then be designated as the OPRM1:c.118A allele and the
OPRM1:c.118G allele. In combination with variants of other genes, the genotype descriptions could be OPRM1:c.118AA,
GJB2:c.76AC double heterozygotes, etc. Haplotypes are a special form of two or more variants in one allele (see
Recommendations DNA changes). When it is once clearly described (e.g. in the
Materials & Methods) what the order of the variants is and which reference
sequences were used a rather simple description of a haplotype can be used.
Examples;
For the description of translocations the format "t(X;4)(p21.2;q34)",
suggested originally by the ISCN
(1985), is already used as a standard. Due to (large) deletions, translocations or inversions, genetic rearrangements may
have one breakpoint far from the gene under study. The breakpoint might lie in 'empty'
intragenic sequences or in another gene. Consequently, to describe the breakpoint at a
molecular level two Reference Sequences will be required. To describe cases like this, no
recommendations have been made yet. Discussions regarding the use of either the one- or three-letter amino acid code
to describe variants at protein level are ongoing. Basically, descriptions using the one-letter amino acid code are unequivocal, short and thus
preferred. However, since the one-letter amino acid code is not obvious (Ala, Arg, Asn,
Asp start with A, Gln, Glu, GLy with G, Leu, Lys
with L, Phe, Pro with P and Thr, Tyr with T)
publications often contain mistakes when the one-letter code is used. In addition, the 'X'
is not only used to indicate a stop codon (translation termination) but also to indicate
unknown residues. Consequently, to prevent mistakes, we favour the use of the three-letter amino acid code. Currently, variants in the translation initiating Methionine (M1) are usually
described as a substitution, e.g. M1V. This is not correct. Either no protein is produced
or the translation initiation site moves up- or downstream. Unless experimental proof is
available, it is probably best to report the effect on protein level as "p.Met1?"
(unknown). When experimental data show that no protein is made, the description "p.0"
is recommended. A gene conversion is a nonreciprocal transfer of genetic information between two homologous
sequences. As a result of a
gene conversion the sequence of (part of) a gene can be copied from a highly
similar sequence residing elsewhere in the genome. Usually, the converted
segment contains a range of sequence changes, making its description rather
complex. In such cases it is recommended to use a specific description using the
format "region_changed" con "region of origin".
Please note that also here the rule applies to
arbitrarily assign the most 3' position
possible as the first to have been changed
To prevent that more and more specific notations are designed, making the
overall description of DNA variants increasingly complicated the recommendation is
build further from existing recommendations. Thus, triplications,
quadruplications, etc. are described like the alleles
of variable short sequence repeats
using [3], [4], etc.
p.His5_Cys7[3] (or p.H5_C7[3]) describes as a triplication of the amino
acid sequence HQC in MKMGHQCC to MKMGHQCHQCHQCC
Loss from a run of nucleotides
I especially like the use of "dup" in place of "ins" when
the inserted base creates a run of 2 or more bases. I feel that there should
be a parallel term for the loss of a base from a run of 2 or more bases
instead of just "del". This is because of the mechanistic implications
of both an ins and a del of a base in a run. Has this been discussed? My only
thought for a term in place of "del" is "los" for loss.
Shuji Ogino (Boston, USA) agrees with this suggestion but suggest to use the term
"dec" for a decrease in length.
Agreed / not agreed - please tell us your opinion (mail to: ddunnen
@ lumc.nl
and Stylianos.Antonarakis @ medecine.unige.ch)Insertions
The second point of discussion was which character to use as a separator. The initial
suggestion was to use the "^"-character (e.g. p.Q83^C84insQ). However, since a
character to indicate a range was already available, it was decided to use this character,
i.e. the "_"-character (see above). Insertion-deletions (indels)
Frame shift variants
"fs" after a description of the first amino acid affected by the
change.
"fsX#" after a description of the amino acid(s) affected by the
change and the change occurring at the site of the frame shift. "X#" indicates at which codon position
the new reading frame ends in a stop (X). The position of the stop in the new reading frame is calculated starting at the first amino acid that is changed by the frame shift, and ending at the first stop codon (X#).
NOTE: the shifted reading frame is thus open for '#-1' amino acids.
New recommendations
Alternative transcripts (Reference
Sequence)
Accepted sequence indicators
Examples;
NOTE: the "{ }" characters are used separate gene symbol
and GenBank accession.version number
NOTE: the chromosome build used should always be mentioned (e.g. NCBI Build
36.1)Single Nucleotide Polymorphisms (SNP's)
Examples;
NOTE: descriptions like dbSNP2306220:A>G should not be used, they
are not unequivocal since it is unknown whether rs2306220 or
ss2306220 in meant.Homo/heterozygotes
most recommendations for sequence variant nomenclature apply to genotype descriptions in tables.
Unfortunately, these are not very useful in the general text of a paper. For instance, the OPRM1:c.118A>G or
dbSNP1799971:A>G designation can be used to describe the sequence variant, but in a paper you might like to discuss the phenotypic
consequences of different genotypes. In fact the current recommendation is to use
OPRM1:c.[118A>G]+[=] to describe a heterozygote and [=]+[=] and
OPRM1:c.[118A>G]+[118A>G]
for the homozygotes. I would like to suggest to describe the genotypes in the
text like;
Haplotypes
Translocations
NOTE: current recommendations are made by the "Standing Committee on Human Cytogenetic Nomenclature
(2001-2006)".
For a description at the molecular level this
notation can be followed, extended with the standard description indicating the exact
translocation breakpoint. When due to local similarities the exact breakpoint is
uncertain, following standard nomenclature rules, it will be arbitrarily assigned to the
most 3' nucleotide. Since the translocation breakpoints can have a complex structure and
since it involves two different chromosomal locations, the sequences of the two
translocation breakpoints should always be submitted to a sequence database (Genbank,
EMBL, DDJB). The accession numbers of these files should be listed in the report.
Next to the exact location of the translocation breakpoint, its
molecular characterisation will yield more details including e.g. deletions/duplications
at the junction and the sequence joined, derived from the other chromosome. We believe
that a description covering all these details will become too complex. However, when one
wants to include these details, the first description should be for the translocated 5'
segment of the gene, the second for the translocated 3' segment, separated by a
";"-character. It should also be noted that when a translocation joins
genes A and B, the description of the breakpoint in the sequence variation database of
gene A is different from that in gene B. The major difference being that the nucleotide
numbering is based on that of gene A or gene B respectively.
Fused genes
Recommendation: for the breakpoint residing in the gene under study,
nucleotide numbering is clear and follows the standard. When the breakpoint lies in
another gene, nucleotide numbering for that end should be based on the nucleotide
numbering for that gene (accession.version number of the Reference Sequence used should be
provided). To indicate that the end lies in another gene, the nucleotide number should
be preceded with the gene's official
Gene Symbol, like GJB2:c.233. When the breakpoint does not reside in another gene, the
accession number of the Reference Sequence will be used instead of the official Gene
Symbol, like AC012343.2:g.763 (please note that this is always a genomic Reference
Sequence). When the breakpoint ends on the opposite strand (reverse,
complementary, non-transcribed or anti-sense strand) of a gene or on the opposite strand of an
intragenic sequence, an "o" will precede the official Gene Symbol
(like oGJB2:c.233). Pleas note that the use of a "c" (complementary), "a" (anti-sense)
or "r" (reverse) might cause confusion with nucleotides C and A or the
"r" indicating description of a change on RNA-level.
One or three
letter amino acid code
Translation initiation codon changes
Gene conversions
| Top of page | MutNomen
homepage | Check-list |
| Recommendations: DNA, RNA,
protein, uncertain |
| FAQ's | Codons / amino acids |
History |
| Example descriptions: QuickRef / symbols,
DNA, RNA,
protein |
Copyright © HGVS 2007 All Rights Reserved |