 |
Recommendations
for the description of DNA sequence variants - v2.0
|
Last modified February 25, 2013
|
Since references to WWW-sites are not yet acknowledged as citations,
please mention den Dunnen JT
and Antonarakis SE (2000). Hum.Mutat. 15: 7-12 when referring to tseparated by at
least one nucleotidehese pages.
Contents
- Recommendations
- Explanations / examples
DNA level
(suggestions extending the published
recommendations are
in italics)
- nucleotides
description of nucleotides at DNA level follows the recommendations of the IUPAC-IUBMB.Nucleotides
are designated by the bases, in upper case, A (adenine), C (cytosine), G (guanine), T
(thymidine), including those for uncertain nucleotides like Y (pYrimidine) and R (puRine),
(see Standards).
- nucleotide numbering (for details and examples see Reference Sequence discussions)
- coding DNA reference sequence (see Examples
and Figure)
- there is no nucleotide 0
- nucleotide 1 is the A of the ATG-translation initiation codon
- the nucleotide 5' of the ATG-translation initiation codon is -1, the previous -2,
etc.
the
nucleotide 3' of the translation stop codon is *1, the next *2, etc.
- intronic nucleotides (coding DNA reference sequence only)
- beginning of the intron; the number of the last nucleotide of the preceding exon,
a plus sign and the position in the intron, like c.77+1G, c.77+2T, ....
- end of the intron; the number of the first nucleotide of the following exon, a
minus sign and the position upstream in the intron, like ..., c.78-2A, c.78-1G.
- in the middle of the intron, numbering changes from "c.77+.." to
"c.78-.."; for introns with an uneven number of nucleotides the central
nucleotide is the last described with a "+" (see
Discussion)
- NOTE: the format c.IVS1+1G and c.IVS1-2G should not be used (see Discussion)
- genomic reference sequence (see Examples
and Figure)
- nucleotide numbering starts with 1 at the first nucleotide of the sequence
NOTE: the sequence should include all nucleotides covering the sequence
(gene) of interest and should start well 5' of the promoter of a gene
- no +, - or other signs are used
- when the complete genomic sequence is not known, a coding DNA reference sequence should
be used
- for all descriptions the most 3' position possible is arbitrarily assigned to
have been changed (see Exception)
Substitutions
A nucleotide substitution is a sequence change where one nucleotide is replaced
by one other nucleotide (see Standards - Definition).
Nucleotide substituions are described using a ">"-character
(indicating "changes to").
NOTE: changes involving two or more consecutive nucleotides are described as
deletion/insertions (indels, see Deletion/insetions).
- c.76A>C denotes that at nucleotide 76 an A is changed to a C
- c.-14G>C denotes a G to C substitution 14 nucleotides 5' of the ATG translation
initiation codon
- c.88+1G>T denotes the G to T substitution at nucleotide +1 of an intron (in the
coding DNA positioned between nucleotides 88 and 89)
- c.89-2A>C denotes the A to C substitution at nucleotide -2 of an intron (in the
coding DNA positioned between nucleotides 88 and 89)
- c.*46T>A denotes a T to A substitution 46 nucleotides 3' of the translation
termination codon
- the description c.76_77delinsTT is preferred over c.[76A>T; 77G>T]
NOTE:
based on the definition of a substitution (see
Standards - Definition; one nucleotide replaced by one other nucleotide) this
change can not be described as a substitution (like c.76_77AG>TT or c.76AG>TT)
NOTE: it is not correct to describe "polymorphisms" as
c.76A/G (see Discussion).
Deletions
A nucleotide deletion is a sequence change where one or more nucleotides are removed (see
see Standards - Definition). Deletions are
described using "del" after an indication of the first and last
nucleotide(s) deleted, separated by a "_" (underscore). For all
descriptions the most 3' position possible is arbitrarily assigned to have
been changed.
NOTE: to discriminate known variable sequences from other changes it is
recommended to describe individual alleles differing from the reference sequence like
g.210T[5] (preferred over g.210_211delTT) or g.121T[9] (preferred over g.210_211dupTT) (see Repeated sequences).
- c.76_78del (alternatively c.76_78delACT) denotes a ACT deletion from nucleotides 76 to
78
- deletions with uncharacterised breakpoints (see
Uncertainties)
- c.88-?_923+?del denotes an exonic deletion starting at an unknown position in the intron
5' of coding DNA nucleotide 88 and ending at an unknown position in the intron 3' of
coding DNA nucleotide 923
- c.(?_-30)_(*220_?)del denotes the deletion of the entire gene (coding DNA reference
sequence running from -30 (cap site) to *220 (polyA-addition site)
c.88+101_oGJB2:c.355-1045del
denotes a deletion which ends in the flanking GJB2 gene at position 355-1045 (in the
intron between nucleotides 354 and 355) on the reverse strand (the genes are thus located
and fused in opposite transcriptional directions, see
Discussion)
- for all descriptions the most 3' position possible is
arbitrarily assigned to have been changed (see FAQ);
- ACTTTGTGCC to ACTTGCC is described as c.5_7del (c.5_7delTGT, not as
c.4_6delTTG)
- ctttagGCATG to cttagGCATG in an intron is described as c.301-3delT (not as
c.301-5delT)
- TCACTGTCTGCGGTAATC to TCACTG CGGTAATC is described as c.7_10del (c.7_10delTCTG)
and not as c.4_7del (c.4_7delCTGT).
- AAAGAAGAGGAG to AAAG GAG is described as c.5_9del (c.5_9delAAGAG) and not as
c.3_7del (c.3_7delAGAAG)
- Exceptions
- using a coding DNA reference sequence there is an exception to the
rule around exon/intron and exon/exon borders when identical nucleotides flank the
exon/intron or exon/exon border;
- when the exon 3/intron 3 border is ..CAGgtg.. and RNA analysis shows no effect on
splicing but a deletion of a G the change ..CAGgtg.. to ..CAgtg.. is described as c.3delG and
not c.3+1delG.
- when exon 3 ends with ..CAA.. and exon 4 starts with ..ACG.. and the sequence of genomic
DNA shows that the last A-nucleotide of exon 3 is deleted (and not the first A-nucleotide
in exon 4), the deletion changing ..CAAACG.. to ..CAACG.. is described as c.3delA and
not c.4delA
- c.1210-12T(5_9) (not c.1210-6T(5_9)) describes the variable stretch of 5 to 9
T-residues in intron 9 of the CFTR gene. The most commonly used CFTR coding DNA reference
sequence contains a stretch of 7 T's (see Repeated sequences).
Duplications
Duplications are designated by "dup" after an indication of the
first and last nucleotide(s) duplicated. It should be noted that the description "dup"
(see Standards) may by definition
only be used when the additional copy is directly 3'-flanking the original copy. For
all descriptions the most 3' position possible is arbitrarily assigned to
have been changed. For the addition of more then 1 copy (3, 4, 5, etc.) see
Repeated sequences and see Discussion.
NOTE: to discriminate known variable sequences from other changes it is
recommended to describe individual alleles differing from the reference sequence like
g.210T[5] (preferred over g.215_216del) or g.210T[9] (preferred over g.215_216dup) (see Repeated sequences).
- duplicating insertions should be described as duplications (see
Discussion)
- g.5dupT (or g.5dup, not g.5_6insT) denotes a duplication ("insertion")
of the T nucleotide at position 5 in the genomic reference sequence changing ACTCTGTGCC to
ACTCTTGTGCC
- g.7dupT (or g.7dup, not g.5dupT, not g.7_8insT) denotes a duplication ("insertion")
of the T nucleotide at position 7 in the genomic reference sequence changing AGACTTTGTGCC
to AGACTTTTGTGCC
- g.7_8dup (or g.7_8dupTG, not g.5_6dup, not g.8_9insTG) denotes a TG duplication
in the TG-tandem repeat sequence changing ACTTTGTGCC to ACTTTGTGTGCC
- g.7_8[4] (or g.5_6[4], or g.5TG[4], not g.7_10dup) is the preferred description
of the addition of two extra TG's to the variable TG repeated sequence changing
ACTTTGTGCC to ACTTTGTGTGTGCC (see Repeated
sequences)
- c.77_79dup (or c.77_79dupCTG) denotes that the three nucleotides 77 to 79 are duplicated
(present twice)
- c.88-?_301+?dup denotes the duplication of exons 3 to 4, starting at an unknown
position in intron 2 (upstream of coding DNA nucleotide 88) and ending at an unknown
position in intron 4 (downstream of coding DNA nucleotide 301). Using this description
exons 2 and 5 have been shown not to be duplicated (see
Uncertainties)
NOTE: the description "dup" (see Standards) may by definition only be
used when the additional copy is directly 3'-flanking of the original copy (tandem
duplication). In many cases there will be no experimental proof, the
additional copy may be anywhere in the genome (i.e. inserted / transposed). It has been
suggested that, unless there is experimental evidence, exonic duplications should be
described using the format, i.e. c.88-?_301+?(2) (see
Recommendations).
c.1032-?_1357+?(3)
denotes the presence of two additional copies of an exon, starting at an unknown position
in the flanking upstream intron (upstream of coding DNA nucleotide 1032) and ending at an
unknown position in the flanking downstram intron (downstream of coding DNA nucleotide
1357). The description, using "(3)" indicates there is no proof that the three
copies are in tandem (see Repeated sequences, see Uncertainties)
c.1032-?_1357+?[3]
denotes a direct triplication of an exon, starting at an unknown position in the
flanking upstream intron (upstream of coding DNA nucleotide 1032) and ending at an unknown
position in the flanking downstram intron (downstream of coding DNA nucleotide 1357) (see Repeated sequences)
Insertions
Insertions are designated by "ins" after an indication of the
nucleotides flanking the insertion site, followed by a description of the nucleotides
inserted. Duplicating insertions should be described as duplications (see Discussion), not as insertion. For large insertions
the number of inserted nucleotides should be mentioned, together with an accession.version
number referring to a sequence database file containing the complete inserted
sequence.
- c.76_77insT denotes that a T is inserted between nucleotides 76 and 77 of the coding DNA
reference sequence
- c.123+54_123+55insAB012345.2:g.76_420 (or c.123+54_123+55ins345, GenBank AB012345.2)
denotes an intronic insertion ( between nucleotides c.123+54 and 123+55) of 345
nucleotides (nucleotides 76 to 420 like in GenBank file AB012345 version 2)
Repeated sequences
A frequently occuring sequence change is the variability of repeated sequences. Within
this category we discriminate both small sequences (mono-, di-, tri-, etc nucleotide
repeats) as well as the much larger ones. Such changes are described using the format "position-first-repeat-unit_[number]"
(e.g. g.123_124[4]) where position-first-repeat-unit gives the
location of the first unit of the variable sequence repeat and [number]
the number of units present in the allele described.
- the first unit of the repeat is preferably described based on position, like. g.123_124.
For short/simple repeats it is acceptable to include the content of the repeated unit,
using the format "position-first-nucleotide-repeat_content" like g.123TG[4].
Do not use a mix of these descriptions like g.123_124TG[4]. This contains
redundant information (123_124 and TG) with the danger of being in conflict. Note that
including the content of the sequence involved quikly gives descriptions which become too
lengthy.
- the repeated sequences may have complex structures, consisting of a mix of several
repeated sequences directly following each other. To describe these one should describe
these using the individual repeated elements, like g.456TG[4]TA[9]TG[3] (or
g.456_465[4]466_489[9]490_499[3].
- the same format can be used to describe the presence of multiple copies (triplication,
quadruplication, etc.) of larger sequences, e.g. exons. Square brackets ("[]")
should only be used when their is experimental evidence that the additional copies are in
tandem on the same allele. When there is nu such proof the presence of additional copies
should be described using normal brackets ("()").
NOTE: the description "dup" (see Standards) may by definition only be
used when the additional copy is directly 3'-flanking of the original copy (tandem
duplication). In many cases there will be no experimental proof, the
additional copy may be anywhere in the genome (i.e. inserted / transposed). It has been
suggested that, unless there is experimental evidence, exonic duplications should be
described using the format, i.e. c.88-?_301+?(2).
Examples
- g.123_124[4] (alternatively g.123TG[4]) describes a sequence variable in the
number of TG repeats where the first unit is present at position g.123_124 in the genomic
reference sequence and the allele described contains 4 units.
- c.1210-12T(5_9) describes a famous variable stretch of 5 to 9 T-residues in intron 9 of
the CFTR gene. The most commonly used CFTR coding DNA reference sequence contains a
stretch of 7 T's. It is recommended to describe individual alleles differing from this
reference sequence as c.1210-12T[5] (not c.1210-7_1210-6delTT) or c.1210-12T[9] (not
c.1210-7_1210-6dupTT).
NOTE: the repeat should not be described as c.1210-6T(5_9)
- c.123+74TG(3_6) (alternatively c.123+74_123+75(3_6)) indicates that a TG
di-nucleotide repeat is present, starting at nucleotide 74 in the intron following cDNA
nucelotide c.123, which is found varying from 3 to 6 copies in the population
- c.123+74TG[4];[5] denotes that a person carries a TG di-nucleotide repeat of length 4
on one allele and of length 5 on the other allele
- based on the FMR1 coding DNA Reference Sequence (GenBank NM_002024.3),
c.-158GGC(1000) describes the presence of an extended GGC-repeat of about 1000 units
NOTE: "(1000)" is used to indicate uncertainties (see Uncertainties)
- c.-158GGC[79] describes the presence of an allele with an extended GGC-repeat of exactly
79 units
c.1032-?_1357+?(3)
denotes the presence of two additional copies of an exon, starting at an unknown position
in the flanking upstream intron (upstream of coding DNA nucleotide 1032) and ending at an
unknown position in the flanking downstram intron (downstream of coding DNA nucleotide
1357). The description, using "(3)" indicates there is no proof that the three
copies are in tandem (see Uncertainties)
c.1032-?_1357+?[3]
denotes a direct triplication of an exon, starting at an unknown position in the
flanking upstream intron (upstream of coding DNA nucleotide 1032) and ending at an unknown
position in the flanking downstram intron (downstream of coding DNA nucleotide 1357)
NOTE: the use of tri (for triplication), qua (for quadruplication), etc. is not
recommended (see Discussion)
- g.1209_4523(12_45) denotes that a 3.3 kb repeat sequence of which the first copy is
present in the genomic reference sequence from nucleotides 1209 to 4523 can be found
repeated 12 to 45 times in the population
- g.1209_4523[14];[23] denotes that a person carries a 3.3 kb repeat with 14 copies on
one allele and of 23 copies on the other allele
Deletion / insertions (indels)
Deletion/insertions of two or more consecutive nucleotides (indels) are described as a deletion
followed by an insertion (see Discussion).
- c.112_117delinsTG (alternatively c.112_117delAGGTCAinsTG) denotes the replacement of
nucleotides 112 to 117 (AGGTCA) by TG
- c.113delinsTACTAGC (alternatively c.113delGinsTACTAGC) denotes the replacement of
nucleotide 113 by 7 new nucleotides (TACTACG)
- c.114_115delinsA (alternative c.[114G>A; 115delT])
Inversions
Inversions are designated by "inv" after an indication of the first
and last nucleotides affected by the inversion.
- c.203_506inv (or 203_506inv304) denotes that the 304 nucleotides from position 203 to
506 have been inverted
Conversions
Conversions are designated by "con" after an indication of the
first and last nucleotides affected by the conversion, followed by a description of the
origin of the new nucleotides (see Discussion).
- c.123_678conNM_004006.1:c.123_678 describes a gene conversion replacing nucleotides
c.123 to c.4567 of the coding DNA sequence of the transcript of interest with nucleotides
c.123 to c.678 from a transcript sequence as present in GenBank file NM_004006 (version 1)
Translocations
Translocations are described at the molecular level using the format
"t(X;4)(p21.2;q34)", followed by the usual numbering, indicating the position
translocation breakpoint. The sequences of the translocation breakpoints need to be
submitted to a sequence database (Genbank, EMBL, DDJB) and the accession.version numbers
should be given (see Discussion).
- t(X;4)(p21.2;q35)(c.857+101_857+102) denotes a translocation breakpoint in the intron
between coding DNA nucleotides 857+101 and 857+102, joining chromosome bands Xp21.2 and
4q34
More changes in one individual
Two or more changes in one individual are described by combining the changes, per
allele (chromosome) between square brackets ("[]").
Changes in different alleles (e.g. in recessive diseases) are
described as "[change allele 1];[change allele 2]". Mixed descriptions like c.[76A>C];g.[91C>G]
should not be used.
- c.[76A>C];[76A>C] denotes a homozygous A to C change at nucleotide 76
- c.[76A>C];[(76A>C)] denotes a homozygous A to C change at nucleotide 76, not
confirmed by analysis of both parents, leaving the possibility of non-amplification of the
second allele due to a primer mismatch or a deletion
- c.[76A>C];[?] denotes a A to C change at nucleotide 76 in one allele and an unknown
change in the other allele
- c.[76A>C];[=] denotes a A to C change at nucleotide 76 in one allele and a normal
sequence in the other allele (see FAQ)
c.[76A>C];[0]
denotes a A to C change at nucleotide 76 in one allele and the absence of a sequence from
the other allele (e.g. for a variant in a gene on the X-chromosome in a male where only
one allele is present)
descriptions
of sequence changes in different genes (e.g. for recessive diseases) are listed between
square brackets, separated by a ";"-character and include a reference to the
sequence (gene) changed; [DMD:c.76A>C];[GJB:c.87delG] (see
Discussion)
Two variations in one allele, separated by at least one nucleotide, are described
as "[first change ; second change]". For the description of haplotypes see Discussion.
NOTE: "separated by at least one nucleotide" means the
description c.76_77delinsTT is preferred over c.[76A>T; 77G>T].
- c.[76A>C; 83G>C] denotes two changes in one allele; A to C change at nucleotide 76
and a G to C change at nucleotide 83
Mosaicism - two
different nucleotides in one position caused by somatic mosaicims are
described as "[=/nucleotide 2]" (see FAQ).
- c.[=/83G>C] describes a mosaic case where at position 83 besides the normal
sequence (a G, described as '=') also chromosomes are found containing a C (c.83G>C)
Chimerism - two
different nucleotides in one position caused by chimerism are described
as "[=//nucleotide 2]"
- c.[=//83G>C] describes a chimeric case where at position 83 besides the normal
sequence (a G, described as '=') also cells are found containing a C (c.83G>C)
Two sequence changes with
alleles unknown are described as "[change allele 1 (;) change allele 2]" (see FAQ).
- c.[76A>C(;)283G>C] denotes that two changes were identified in one individual
(an A to C change at nucleotide 76 and a G to C change at nucleotide 283), but it is not
known whether these changes are in the same allele or in different alleles
Complex changes
Sequence changes can be very complex, involving several changes at a specific location.
The description of such changes using the recommendations given above can become rather
complicated and at some point, although literally correct, effectively meaningless. In
such cases the recommendation is to submit the sequence that has been determined to
GenBank and to use the accession.version number in the description.
- c.123_678conNM_004006.1:c.123_678 describes a gene conversion replacing nucleotides
c.123 to c.4567 of the coding DNA sequence of the transcript of interest with nucleotides
c.123 to c.678 from a transcript sequence as present in GenBank file NM_004006 (version 1)
- c.88+101_oGJB2:c.355-1045del denotes a deletion which ends in the flanking GJB2 gene
at position 355-1045 (in the intron between nucleotides 354 and 355) on the reverse strand
(the genes are thus located and fused in opposite transcriptional directions, see Discussion)
- c.123+54_123+55insAB012345.2:g.76_420 denotes an intronic insertion (between
nucleotides c.123+54 and 123+55) of 345 nucleotides (nucleotides 76 to 420 like in GenBank
file AB012345 version 2)
| Top of page | MutNomen
homepage | Check-list | Symbols,
codons, etc. |
| Recommendations: general, DNA, RNA, protein, uncertain |
| Discussions | FAQ's | Symbols, codons, etc. | History |
| Example descriptions: QuickRef, DNA, RNA, protein |
Copyright © HGVS 2007
All Rights Reserved
Website Created by Rania Horaitis, Nomenclature by J.T. Den Dunnen - Disclaimer |