 |
Recommendations
for the description of DNA sequence variants
|
Last modified June 15, 2007
|
Since references to WWW-sites
are not yet acknowledged as citations, please mention den
Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15: 7-12 when referring to
these pages.
Contents
- Recommendations
- Explanations / examples
DNA level
(suggestions extending the published
recommendations are
in italics)
- nucleotides are designated by the bases (in upper case);
- A (adenine)
- C (cytosine)
- G
(guanine)
- T (thymidine)
- nucleotide numbering (for details and examples see
Reference Sequence discussions)
- coding DNA Reference Sequence (see
Examples and Figure)
- there is no nucleotide 0
- nucleotide 1 is the A of the ATG-translation initiation codon
- the nucleotide 5' of the ATG-translation initiation codon is -1,
the previous -2, etc.
the nucleotide 3' of the translation stop codon is
*1, the next *2, etc.
- intronic nucleotides for a coding DNA reference sequence
- beginning of the intron; the number of the last nucleotide of the
preceding exon, a plus sign and the position in the intron, like c.77+1G, c.77+2T, etc.
- end of the intron; the number of the first nucleotide of the following
exon, a minus sign and the position upstream in the intron, like c.78-1G.
- in the middle of the intron, numbering changes from "c.77+.." to
"c.78-.."; for introns with an uneven number of
nucleotides the central nucleotide is the last described with a
"+" (see Discussion)
- NOTE: current opinions do not favour descriptions using the format
c.IVS1+1G and c.IVS1-2G (see Discussion)
- genomic Reference Sequence (see
Examples and Figure)
- nucleotide numbering is purely arbitrary and starts with 1 at the
first nucleotide of the database reference file
- no +, - or other signs are used
- the sequence should include all nucleotides covering the sequence
(gene) of interest and should start well 5' of the
promoter of a gene
- when the complete genomic sequence is not known, a coding DNA reference sequence should be
used
- for all descriptions the most 3' position
possible is arbitrarily assigned to have been changed (see
Exception)
Substitutions
Single nucleotide substitutions are designated by a ">"-character (indicating "changes
to"). Changes of two or more consecutive nucleotides are described as
deletion/insertions (indels, see Deletion/insetions).
- c.76A>C denotes that at nucleotide 76 an A is changed to a C
- c.-14G>C denotes a G to C substitution 14 nucleotides 5' of the ATG
translation initiation codon
- c.88+1G>T denotes the G to T substitution at nucleotide +1 of an intron (in the
coding DNA positioned between nucleotides 88 and 89)
- c.89-2A>C denotes the A to C substitution at nucleotide -2 of an intron (in the
coding DNA positioned between nucleotides 88 and 89)
- c.*46T>A denotes a T to A substitution 46 nucleotides 3' of the
translation termination codon
- the description c.76_77delinsTT is preferred over c.[76A>T;
77G>T].
NOTE: it is not correct to describe polymorphic variants as c.76A/G (see Discussion).
Deletions
Deletions
are designated by "del" after an indication of the first and
last nucleotide(s) deleted.
- c.76_78del (alternatively c.76_78delACT) denotes a ACT deletion from nucleotides 76 to
78
- c.7_8del (alternatively c.7_8delTG) denotes a TG deletion in the sequence ACTTTGTGCC
to ACTTTGCC
- c.88-?_923+?del denotes an exonic deletion starting at an unknown position in the
intron 5' of coding DNA nucleotide 88 and ending at an unknown position in the intron 3' of
coding DNA nucleotide 923 (see Uncertainties)
- c.(?_-30)_(*220_?)del denotes the deletion of the entire gene (coding DNA reference sequence
running from -30 (cap site) to *220 (polyA-addition site) (see Uncertainties)
c.88+101_oGJB2:c.355-1045del denotes a deletion which ends in
the flanking GJB2 gene at position 355-1045 (in the intron between
nucleotides 354 and 355) on the reverse strand (the
genes are thus located and fused in opposite transcriptional directions,
see Discussion)
- for all descriptions the most 3' position
possible is arbitrarily assigned to have been changed (see
FAQ);
- ACTTTGTGCC to ACTTGCC is described as c.5_7delTGT (not as
c.4_6delTTG).
- ctttagGCATG to cttagGTCCA in an intron is described as
c.301-3delT (not as c.301-5delT)
- TCACTGTCTGCGGTAATC to TCACTG CGGTAATC is
described as c.7_10del (alternatively c.7_10delTCTG) and not as
c.4_7delCTGT
- AAAGAAGAGGAG to AAAG GAG is described as
c.5_9del (alternatively c.5_9delAAGAG) and not as c.3_7delAGAAG
- Exception
- c.1210-12T(5_9) and not c.1210-6T(5_9)describes the variable stretch of 5 to 9
T-residues in intron 9 of the CFTR gene. The most commonly used CFTR
coding DNA Reference Sequence contains a stretch of 7 T's (see
Variability of short sequence repeats).
NOTE: to discriminate known variable sequences from other
changes it is
recommended to describe individual alleles differing from the reference
sequence like c.1210-12T[5] (preferred over c.1210-7_1210-6delTT) or c.1210-12T[9]
(preferred over c.1210-7_1210-6dupTT).
- using a coding DNA Reference Sequence there
is an exception to this rule when identical nucleotides flank an intron (e.g. exon 3 ends
with ..CAAgt, exon 4 starts with agAAG.., C being nucleotide c.123). When
the genomic sequence shows that the last A-nucleotide of exon 3 is deleted
(and not the A-nucleotide in exon 4), the deletion changing ..CAAAAG.. to
..CA AAG.. is described as c.125delA and not
c.127delA.
Duplications, triplications, ...
Duplications
are designated by "dup" after an indication of the first and
last nucleotide(s) duplicated.
- c.77_79dup (or c.77_79dupCTG, c.77_79dup3) denotes that the three nucleotides 77 to 79
are duplicated
- duplicating insertions of a mono-nucleotide and of extensions of di-, tri-, etc. nucleotide stretches
should be
described as duplications (see Discussion)
- c.5dupT (or c.5dup) denotes a duplication ("insertion")
of the T nucleotide at position 5 in the
coding DNA sequence ACTCTGTGCC to ACTCTTGTGCC
- c.5dupT (or c.5dup) denotes a duplication ("insertion")
of the T nucleotide at position 5 in the
coding DNA sequence ACTTTGTGCC to ACTTTTGTGCC
- c.7_8dup (or c.7_8dupTG) or c.7_8dup2) denotes a TG duplication in the TG-tandem repeat sequence
of ACTTTGTGCC to ACTTTGTGTGCC.
NOTE: this change should not be described as an insertion, i.e. c.8_9insTG
- c.(?_-30)_(*220_?)dup denotes the duplication of the entire gene (coding DNA reference sequence
running from -30 (cap site) to *220 (polyA-addition site) (see Uncertainties)
c.32-?_357+?[3]
denotes a triplication of an exon (coding DNA reference sequence
running from nucleotide 32 to 357, see Variability of short sequence repeats)
NOTE: the use of tri (for triplication), qua (for quadruplication),
etc. is not recommended (see Discussion)
Insertions
Insertions are designated by "ins" after an indication of the nucleotides flanking the insertion site, followed by a description of the nucleotides inserted.
Duplicating insertions should be described as duplications (see Discussion), not as insertion. For large insertions the number of inserted nucleotides
should be mentioned, together with an accession.version number referring to a sequence database file
containing the complete inserted sequence.
- c.76_77insT denotes that a T is inserted between nucleotides 76 and 77
of the coding DNA reference sequence
- c.123+54_123+55insAB012345.2:g.76_420 (or c.123+54_123+55ins345,
GenBank AB012345.2) denotes an intronic insertion ( between nucleotides c.123+54
and 123+55) of 345
nucleotides (nucleotides 76 to 420 like in GenBank file AB012345 version
2)
Variability of short sequence repeats
Variability of short sequence repeats (e.g. ATGCGATGTGTGCC)
are described as c.123+74TG(3_6); c.123+74 indicates the start of the
first nucelotide of the variable repeat (not the end like c.123+79TG) and
TG indicates the sequence of the repeat unit.
NOTE: the underscore is used to indicate the range (3 to 6 times); when
the repeat-sequence becomes too large its size is indicated, not its range.
- c.1210-12T(5_9) describes a famous variable stretch of 5 to 9
T-residues in intron 9 of the CFTR gene. The most commonly used CFTR
coding DNA Reference Sequence contains a stretch of 7 T's. It is
recommended to describe individual alleles differing from this reference
sequence as c.1210-12T[5] (not c.1210-7_1210-6delTT) or c.1210-12T[9]
(not c.1210-7_1210-6dupTT).
NOTE: the repeat should not be described as c.1210-6T(5_9)
- c.123+74TG(3_6) (alternatively c.123+74_123+75(3_6)) indicates that a TG di-nucleotide repeat is present, starting at nucleotide 74
in the intron following cDNA nucelotide c.123, which is found repeated 3 to 6 times in the
population
- c.123+74TG[4]+[5] denotes that a person carries a TG di-nucleotide repeat of length 4
on one allele and of length 5 on the other allele
- based on the FMR1 coding DNA Reference Sequence (GenBank NM_002024.3),
c.-158GGC(1000) describes the presence of an extended GGC-repeat of about
1000 units
NOTE: "()" is used to indicate uncertainties (see
Uncertainties)
- c.-158GGC[79] describes the presence of an extended GGC-repeat of exactly
79 units
c.32-?_357+?[3]
denotes a triplication of an exon ( coding DNA reference sequence
running from nucleotide 32 to 357, see Variability of short sequence repeats)
NOTE: the use of tri (for triplication), qua (for quadruplication),
etc. is not recommended (see Discussion)
- g.1209_4523(12_45) denotes that a 3.3 kb repeat sequence of which the first copy is present
in the genomic reference sequence from nucleotides 1209 to 4523 can be found
repeated 12 to 45 times in the population
- g.1209_4523[14]+[23] denotes that a person carries a 3.3 kb repeat 14
times on one allele and of 23 times on the other allele
Deletion / insertions (indels)
Deletion/insertions (indels)
are described as a deletion
followed by an insertion after an indication of the nucleotides flanking the
site of the deletion/insertion (see
Discussion). Changes of two or more consecutive nucleotides
are described as deletion/insertions (indels).
- c.112_117delinsTG (alternatively c.112_117delAGGTCAinsTG) denotes the replacement of
nucleotides 112 to 117 (AGGTCA) by TG
- c.114_115delinsA (alternative c.[114G>A; 115delT])
Inversions
Inversions are designated by "inv" after an indication of
the first
and last nucleotides affected by the inversion.
- c.203_506inv (or 203_506inv304) denotes that the 304 nucleotides from position 203
to 506 have been inverted
Conversions
Conversions
are designated by "con" after an indication of the first
and last nucleotides affected by the conversion, followed by a description
of the origin of the new nucleotides (see Discussion).
- c.123_678conNM_004006.1:c.123_678 describes a gene conversion
replacing nucleotides c.123 to c.4567 of the coding DNA sequence of the
transcript of interest with nucleotides c.123 to c.678 from
a transcript sequence as present in GenBank file NM_004006 (version 1)
Translocations
Translocations are described at the molecular level using the format "t(X;4)(p21.2;q34)", followed by the usual numbering, indicating the
position translocation breakpoint. The sequences of the translocation breakpoints need to be
submitted to a sequence database (Genbank, EMBL, DDJB) and the accession.version
numbers should
be given (see Discussion).
- t(X;4)(p21.2;q35)(c.857+101_857+102) denotes a translocation breakpoint in the
intron between coding DNA nucleotides 857+101 and 857+102, joining chromosome bands Xp21.2 and
4q34
More changes in one individual
Two or more changes in one individual are described by combining the changes,
per allele (chromosome) between square brackets ("[]").
Changes in different alleles (e.g. in recessive diseases) are
described as "[change allele 1]+[change allele 2]". Mixed
descriptions like c.[76A>C]+g.[91C>G] should
not be used.
- c.[76A>C]+[76A>C] denotes a homozygous A to C change at nucleotide 76
- c.[76A>C]+[?] denotes a A to C change at nucleotide 76 in one allele and an
unknown change in the other allele
- c.[76A>C]+[=] denotes a A to C change at nucleotide 76 in one allele and a normal
sequence in the other allele (see FAQ)
descriptions
of sequence changes in different genes (e.g. for recessive diseases)
are listed between square brackets, separated by a
"+"-character and include a reference to the sequence
(gene) changed; [DMD:c.76A>C]+[GJB:c.87delG]
(see Discussion)
Two variations in one allele, separated by at least one nucleotide, are described as "[first
change; second change]". Consequently, the description
c.76_77delinsTT is preferred over c.[76A>T; 77G>T]. For the description of
haplotypes see Discussion.
- c.[76A>C; 83G>C] denotes two changes in one allele; A to C change at nucleotide 76 and a G to C
change at nucleotide 83
Mosaic
cases - two different nucleotides in one position are described as "[=,
nucleotide 2]" (see FAQ).
- c.[=, 83G>C] describes a mosaic case where at position 83
besides the normal sequence (a G, described as '=') also chromosomes
are found containing a C (c.83G>C).
Two sequence changes with alleles unknown are described as "[change allele
1(+)change allele 2]" (see FAQ).
- c.[76A>C(+)283G>C] denotes that two changes were
identified in one individual (an A to C change at nucleotide 76 and a G to C
change at nucleotide 283), but it is not
known whether these changes are in the same allele or in different
alleles
Complex
changes
Sequence changes can be very complex, involving several changes at a
specific location. The description of such changes using the recommendations
given above can become rather complicated and at some point, although literally
correct, effectively meaningless. In
such cases the recommendation is to submit the sequence that has been determined
to GenBank and to use the accession.version number in the description.
- c.123_678conNM_004006.1:c.123_678 describes a gene conversion
replacing nucleotides c.123 to c.4567 of the coding DNA sequence of the
transcript of interest with nucleotides c.123 to c.678 from
a transcript sequence as present in GenBank file NM_004006 (version 1)
- c.88+101_oGJB2:c.355-1045del denotes a deletion which ends in
the flanking GJB2 gene at position 355-1045 (in the intron between
nucleotides 354 and 355) on the reverse strand (the
genes are thus located and fused in opposite transcriptional directions,
see Discussion)
- c.123+54_123+55insAB012345.2:g.76_420 denotes an intronic insertion
(between nucleotides c.123+54
and 123+55) of 345
nucleotides (nucleotides 76 to 420 like in GenBank file AB012345 version
2)
| Top of page | MutNomen
homepage | Check-list |
| Recommendations: general, RNA,
protein, uncertain |
| Discussions | FAQ's | Codons / amino acids | History
|
| Example descriptions: QuickRef
/ symbols, DNA, RNA,
protein |
Copyright © HGVS 2007 All Rights Reserved
Website Created by Rania Horaitis, Nomenclature by J.T. Den Dunnen - Disclaimer |