HGVS recommendations: examples DNA

Description of sequence changes:
examples DNA-level

Last modified November 16, 2015

Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15: 7-12 when referring to these pages.

Introduction
- Reference Sequence
- numbering
Examples - DNA changes
- general
- seven elementary changes
  - substitution
  - deletion (incl. in single nucleotide stretches)
  - duplication
  - insertion
  - sequence repeat variability (CA, CAG, ...)
  - inversion
  - gene conversion
  - translocation
  - transposition
- complex rearrangements
  - deletion/insertion (indel)
- miscellaneous
  - two changes in a gene on one chromosome
  - two changes in one individual, chromosome unknown
  - recessive diseases - changes in different chromosomes (known and unknown)
    - probably homozygous
    - several changes in two chromosomes
    - changes in different genes
- uncertainties (exact position not known; Southern blot, PCR, arrayCGH, SNP-array, ...)
Examples - RNA changes
Examples - protein changes

Introduction

Within this page examples will be given for the description of sequence variations. The examples will be given independently for descriptions at DNA, RNA and protein level. All examples are described relative to a Reference Sequence, depending on the level a genomic or coding DNA sequence (DNA-level), an mRNA sequence (RNA-level) or an amino acid sequence (protein level).

Reference sequence DNA-level

Within this page examples will be given for the description of sequence variations in a DNA sequence. For other examples go to those describing changes in RNA. Examples for protein level are given at the protein page. All examples are described relative to a Reference Sequence, here a coding DNA sequence.

Part of gene		nucleotide numbering genomic Reference Sequence	nucleotide numbering coding DNA Reference Sequence	nucleotide numbering protein Reference Sequence
5' gene flanking region		1 to 270	(-300 to -31)	-
exon 1	5' UTR	271 to 300	-30 to -1	-
exon 1	coding region	301 to 312	1 to 12	1 to 4
intron 1		313 to 412	12+1 ... 12+50, 13-50 ... 13-1	-
exon 2		413 to 488	13 to 88	5 to 29 (30)
intron 2		489 to 688	88+1 ... 88+100, 89-100 ... 89-1	-
exon 3		689 to 723	89 to 123	30 to 41
intron 3	contains rare alternatively spliced exon from 800 to 859 (coding DNA 123+77 to 123+136)	724 to 1023	123+1 ... 123+150, 124-150 ... 124-1	-
exon 4		1024 to 1200	124 to 300	42 to 100
intron 4		1201 to 1600	300+1 ... 300+200, 301-200 ... 301-1	-
exon 5	coding region	1601 to 1630	301 to 330	101 to 109
exon 5	3' UTR, containing a (CA)₇-stretch from nucleotides 1700 to 1713 (coding DNA 70 to 83); poly-A addition site at 1825 (coding DNA *195)	1631 to 1850	1 to 220	-
3' gene flanking region		1851 to 2000	(221 to 370)	-

NOTE: nucleotides in introns in the 5' UTR are numbered like -23+1, -23+2, ..., -22-2, -22-1. Nucleotides in introns in the 3' UTR are numbered like *154+1, *154+2, ..., *155-2, *155-1.

Legend:
Reference sequence of imaginary gene used for the exaples given on this page. Nucleotide +1 in the coding DNA reference sequence is the A of the ATG translation initiation codon. Abbreviations used: nt = nucleotide, UTR = untranslated region of the mRNA. For a picture of part of this hypothetical sequence see Figure.

General

Publications reporting changes in different sequences (genes) or which report linkage or association studies should prevent any confusion regarding which variant resides in which sequence. An easy way to achieve this is to include an unequivocal identifier to the reference sequence used in the description, e.g. NM_004006.2:c.3G>T or DMD:c.3G>T (see Discussion).

Substitutions

Substitutions are designated by a ">"-character after the number of the affected nucleotide.

5' gene flanking region - T to C substitution of nt 241 (located 30 nucleotides upstream of the transcription initiation site, i.e. in the promoter region)

genomic Reference Sequence	coding DNA Reference Sequence
g.241T>C	-

5' UTR - G to A substitution of nt 289, 12 nucleotides upstream of the ATG translation initiation codon (coding DNA -12). For nucleotide numbering in a case where the ATG is not in exon 1 see here.

genomic Reference Sequence	coding DNA Reference Sequence
g.289G>A	c.-12G>A

translation initiation codon - A to C substitution of nt 301, i.e. nt 1 of the coding region (coding DNA 1)

genomic Reference Sequence	coding DNA Reference Sequence
g.301A>C	c.1A>C

coding region - G to C substitution of nt 423 in exon 2, i.e. nt 23 of the coding region (coding DNA 23)

genomic Reference Sequence	coding DNA Reference Sequence
g.423G>C	c.23G>C

intron (regarding the numbering of intronic nucleotides see Discussion)

5' part intron - T to G substitution of the second nt in the intron (88+2) positioned between coding DNA nucleotides 88 and 89 (intron 2)

genomic Reference Sequence	coding DNA Reference Sequence
g.490T>G	c.88+2T>G

3' part intron - G to T substitution of the last nt of the intron (89-1) positioned between coding DNA nucleotides 88 and 89 (intron 2)

genomic Reference Sequence	coding DNA Reference Sequence
g.688G>T	c.89-1G>T

alternatively spliced exon - G to C substitution of intronic nt 812 (coding DNA 123+89)

genomic Reference Sequence	coding DNA Reference Sequence
g.812C>T	c.123+89C>T

translation termination codon - G to C substitution of nt 1629, i.e. nt 329 of the coding region (coding DNA 329)

genomic Reference Sequence	coding DNA Reference Sequence
g.1629G>C	c.329G>C

3' UTR - T to A substitution of nt 1700 (coding DNA 70), located in the 3' UTR (70 nucleotides downstream of the termination codon)

genomic Reference Sequence	coding DNA Reference Sequence
g.1700T>A	c.*70T>A

3' gene flanking region - C to A substitution of nt 1923 (located 293 nucleotides downstream of the gene, i.e. the polyA-addition site)

genomic Reference Sequence	coding DNA Reference Sequence
g.1923C>A	c.*293C>A

Deletion

Deletions are designated by "del" after a description of the deleted segment, i.e. the first (and last) nucleotide(s) deleted (see also Discussion). To describe deletions with unknown breakpoints, e.g. based on Southern blotting, PCR, arrayCGH, SNP array data, etc. see Uncertainties.

single nucleotide deletion - deletion of nt 13 of the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.413del (g.413delG)	c.13del (c.13delG)
g.304del (g.304delG) (not g.303del / g.303delG)	c.4del (c.4delG) (not c.3del / c.3delG)
g.1598delG (not g.1596del / g.1596delG)	c.301-3del (c.301-3delT) (not c.301-5del or c.301-5delT)

deletion of more then 1 nucleotide

deletion of nucleotides -11 to -4 in the 5' UTR

genomic Reference Sequence	coding DNA Reference Sequence
g.290_297del	c.-11_-4del

deletion of nucleotides 92 to 94 (GAC) of the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.692_694del (g.692_694delGAC)	c.92_94del (c.92_94delGAC)

deletion of nucleotides *8 to *21 in the 3' UTR

genomic Reference Sequence	coding DNA Reference Sequence
g.1638_1651del	c.8_21del

deletion across the exon 3 / intron 3 border, nucleotides 120 to 123 of the coding region (exon 3) and the first 48 nucleotides of intron 3 (nucleotides 123+1 to 123+48)

genomic Reference Sequence	coding DNA Reference Sequence
g.720_771del	c.120_123+48del

deletion across the intron 3 / exon 4 border, the last 12 nucleotides of intron 3 (nucleotides 124-12 to 124-1) and nucleotides 124 to 129 of the coding region (exon 4)

genomic Reference Sequence	coding DNA Reference Sequence
g.1012_1029del	c.124-12_129del

deletion of a TG dinucleotide in the sequence ATGTTGTGCC to ATGTTG_CC

genomic Reference Sequence	coding DNA Reference Sequence
g.307_308del (g.307_308delTG) *NOT* g.305_306del	c.7_8del (c.7_8delTG) NOT c.5_6del)

deletion of an A nucleotide in the sequence CAAgt... / ..agAAG to CAgt... / ..agAAG

genomic Reference Sequence	coding DNA Reference Sequence
g.723del (g.723delA)	c.123del (c.123delA) NOT c.125delA

variability in short sequence repeat - see below

(multi) exon deletion

breakpoint not sequenced

deletion of exons 2 to 4 (e.g. detected on Southern blot, see Discussion)

genomic Reference Sequence	coding DNA Reference Sequence
-	c.13-?_300+?del

deletion of the entire gene; coding DNA reference sequence runs from -30 (cap site) to *220 (polyA-addition site); see Recommendations

genomic Reference Sequence	coding DNA Reference Sequence
-	c.(?_-30)_(*220_?)del

breakpoints sequenced - deletion of exons 2 to 4; the sequences of the deletion breakpoints need to be submitted to a sequence database (Genbank, EMBL, DDJB) and the accession number should be given

genomic Reference Sequence	coding DNA Reference Sequence
g.390_1458del (g.390_1458del1069)	c.13-23_301-143del (c.13-23_301-143del1069)

Duplication

Duplications are designated by "dup" after a description of the duplicated segment, i.e. the first (and last) nucleotide(s) duplicated (even when a mono-nucleotide is duplicated, see Recommendations). To describe duplications with unknown breakpoints, e.g. based on Southern blotting, PCR, arrayCGH, SNP array data, etc. see also Uncertainties.

single nucleotide duplication - duplication of nt 13 of the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.413dup (g.413dupG)	c.13dup (c.13dupG)

several nucleotide duplication

duplication of nucleotides 92 and 94 (GAC) of the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.692_694dup (g.692_694dupGAC)	c.92_94dup (c.92_94dupGAC)

duplication across the exon 3 / intron 3 border, nucleotides 120 to 123 of the coding region (exon 3) and the first 48 nucleotides of intron 3 (nucleotides 123+1 to 123+48)

genomic Reference Sequence	coding DNA Reference Sequence
g.720_771dup	c.120_123+48dup

duplication of a TG dinucleotide in the sequence ATGTTGTGCC to ATGTTGTGTGCC

genomic Reference Sequence	coding DNA Reference Sequence
g.307_308dup (g.307_308dupTG) *NOT* g.305_306dup	c.7_8dup (c.7_8dupTG) NOT c.5_6dup)

variability in short sequence repeat - see below

(multi) exon duplication
note that a duplication by definition (see Standards) should only be used when the additional copy is located directly 3'-flanking the original copy (a tandem duplication). In most cases there will be no experimental proof of this, the additional copy may reside anywhere in the gene/genome (i.e. inserted/transposed). It seems thus better to describe exonic duplications using the format for sequence repeat variability, i.e. c.88-?_301+?[2] (see Recommendations and below).

breakpoints not sequenced - duplication of exons 2 to 4 (e.g. detected on Southern blot, see Discussion)

genomic Reference Sequence	coding DNA Reference Sequence
-	c.13-?_300+?[2] or c.13-?_300+?dup

breakpoints not sequenced - duplication of the entire gene; coding DNA reference sequence runs from -30 (cap site) to *220 (polyA-addition site); see Recommendations

genomic Reference Sequence	coding DNA Reference Sequence
-	c.(?_-30)_(220_?)[2] or c.(?_-30)_(220_?)dup

breakpoints sequenced - duplication of exons 2 to 4

genomic Reference Sequence	coding DNA Reference Sequence
g.390_1458dup	c.13-23_301-143dup

Insertion

Insertions are designated by "ins" after the nucleotides flanking the insertion. NOTE: duplicating insertions (incl. duplication of a mono-nucleotide) should be described as duplications (see above).

single nucleotide insertion - insertion of a T between nucleotides 51 and 52 of the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.451_452insT	c.51_52insT

several nucleotide insertion

insertion of a GAGA-sequence between nucleotides 51 and 52 of the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.451_452insGAGA	c.51_52insGAGA

insertion of a TG dinucleotide in the sequence ATGTTGTGCC to ATGTTGTGTGCC; is a duplication (see above)
variability in short sequence repeat (see Recommendations)

variability in short sequence repeat - see below

large insertion - insertion of a 345 nucleotide sequence in intron 3 from AB012345.1; note that to be able to describe it the sequence of the insertion should be submitted to a sequence database (Genbank, EMBL, DDJB) and the accession.version number should be given.

genomic Reference Sequence	coding DNA Reference Sequence
g.777_778insAB012345.1	c.123+54_123+55insAB012345.1

Sequence repeat variability

For the recommendations how to describe sequence repeat variability see Recommendations

polymorphic CA-repeat - a person carries a CA di-nucleotide repeat of length 6 on one chromosome (the reference sequence has a repeat length of 7)

genomic Reference Sequence	coding DNA Reference Sequence
g.1700_1701[6] or g.1700CA[6] NOT g.1712_1713del	c.70_71[6] or c.70CA[6] NOT c.*82_83del

polymorphic CA-repeat - a person carries a CA di-nucleotide repeat of length 8 on one chromosome (the reference sequence has a repeat length of 7)

genomic Reference Sequence	coding DNA Reference Sequence
g.1700_1701[8] or g.1700CA[8] NOT g.1712_1713dup	c.70_71[8] or c.70CA[8] NOT c.*82_83dup

polymorphic CA-repeat - a person carries a CA di-nucleotide repeat of length 6 on one chromosome and of length 11 on the other

genomic Reference Sequence	coding DNA Reference Sequence
g.1700_1701[6];[11] or g.1700CA[6];[11]	c.70_71[6];[11] or c.70CA[6];[11]

FMR1 GGC-repeat - in literature the Fragile-X tri-nucleotide repeat is known as the CGG-repeat, but based on the coding DNA reference sequence (GenBank NM_002024.5) and the rule that for all descriptions the most 3' position possible should be arbitrarily assigned the repeat has to be described as a GGC-repeat (see Recommendations).
- c.-128_-126[79] describes the presence of an extended repeat of exactly 79 units
- c.-128_-126[(600_800)] describes the presence of an extended epeat with an estimated length between 600 and 800 copies
  NOTE: "()" is used to indicate uncertainties (see Uncertainties), a description like c.-128GGC[(...)] can not be used since the GGC-repeat is probably interrupted by one or more GGA-triplets (see Recommendations)
HD AGC-repeat - based on the HTT (huntingtin) coding DNA reference sequence (GenBank NM_002111.6) the Huntington's Disease tri-nucleotide repeat has to be described as an AGC (not CAG) repeat. The reference sequence represents an allele of 21 AGC repeats, described as c.53AGC[21]. On protein level the reference allele contains 23 Gln's, described as p.Gln[23] (alternatively p.Q[23]). The difference derives from the fact that the AGC repeat is interrupted by a AAC-triplet ("CAA" coding) at position 22.

Inversion

Inversions are designated by "inv" after the nt number of the nucleotides inverted.

short inversion - inversion of nucleotides 177 to 180 of the coding region, changing -CTGA- to -TCAG-

genomic Reference Sequence	coding DNA Reference Sequence
g.1077_1080inv (g.1077_1080invCTGA)	c.77_80inv (c.77_80invCTGA)

large inversion - a large inversion (212,434 nucleotides in length), starting in intron 4, inverts the entire 3' end of the gene and fuses it to position 233+17 (intron) of the XYZ-gene, having an opposite transcriptional orientation (indicated by the "o")

genomic Reference Sequence	coding DNA Reference Sequence
g.1458_oXYZ:457inv	c.301-143_oXYZ:233+17inv

Gene conversion

Gene conversions are designated by "con" after the nt number of the nucleotides converted, followed by a description of the origin on the new sequence; "region_changed" con "region of origin" (see Discussion).

a gene conversion replacing a segment of the coding region of a gene for a segment derived from elsewhere in the genome (as described in another GenBank file; AC096506.5 and NM_004006.1 resp.)

genomic Reference Sequence	coding DNA Reference Sequence
g.415_1655conAC096506.5:g.409_1683	c.15_355conNM_004006.1:c.15_355

Translocation

Translocations are designated in the format "t(X;4)(p21.2;q34)", followed by the usual description, placed between brackets, indicating the exact translocation breakpoint. The sequences of the translocation breakpoints need to be submitted to a sequence database (Genbank, EMBL, DDJB) and the accession numbers should be given (see Discussion).

a translocation breakpoint in the 3' half of intron 4, between nucleotides 1453 and 1454 (coding DNA 301-148 and 301-147), joining chromosome bands Xp21.2 and 4q35

genomic Reference Sequence	coding DNA Reference Sequence
t(X;4)(p21.2;q35)(g.1453_1454)	t(X;4)(p21.2;q35)(c.301-148_301-147) [t(X;4)(p21.2;q35)(c.IVS4)]

Complex

Complex rearrangements are rearrangements which consist of several different types of the six elementary content changes substitution, deletion, duplication, insertion, inversion and translocation. Such rearrangements can be very complex and difficult to describe. Specific recommendations to describe such changes have not made. Complex rearrangements can be best described as a combination of the elementary changes.

Deletion / insertions ("indels") are described as a deletion ("del"), followed by an insertion ("ins") after a description of the deleted segment, i.e. the first (and last) nucleotide(s) deleted (see Discussion).

deletions of nucleotides 712 to 717 of the genomic region (coding DNA 112 to 117) and an TG-insertion at the same site

genomic Reference Sequence	coding DNA Reference Sequence
g.712_717delinsTG (g.712_717delAGGGCAinsTG)	c.112_117delinsTG (c.112_117delAGGGCAinsTG)

Miscellaneous

Two variants in a gene on one chromosome of an individual are described as "[first change; second change]" (see Discussion)

a gene on one chromosome containing an C to T change at nt 476 (coding DNA nt 76) and a G to C change at nt 483 (coding DNA 83)

genomic Reference Sequence	coding DNA Reference Sequence
g.[476C>T; 483G>C]	c.[76C>T; 83G>C]

Two variants in a gene in one individual with chromosome unknown are described as "[first change (;) second change]" (see Discussion)

one individual containing an C to T change at nt 476 (coding DNA nt 76) and a G to C change at nt 1083 (coding DNA 183) while it is unknown whether these changes are on the same or different chromosomes

genomic Reference Sequence	coding DNA Reference Sequence
g.[476C>T(;)1083G>C]	c.[76C>T(;)183G>C]

Recessive disease - changes in a gene on different chromosomes are described as "[change gene 1];[change gene 2]" (see Discussion)

both variants identified - a homozygous C to T change at nt 76 of the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.[476C>T];[476C>T]	c.[76C>T];[76C>T]

one variant not yet identified - one chromosome containing a C to T change at nt 76 of the coding region, the other chromosome containing an unknown change

genomic Reference Sequence	coding DNA Reference Sequence
g.[476C>T];[?]	c.[76C>T];[?]

probably homozygous variant - a homozygous C to T change at nt 76 of the coding region, not confirmed by analysis of both parents leaving the possibility of non-amplification of the second chromosome due to a primer mismatch or a deletion

genomic Reference Sequence	coding DNA Reference Sequence
g.[476C>T];[(476C>T)]	c.[76C>T];[(76C>T)]

one variant and a normal chromosome - one chromosome containing a C to T change at nt 76 of the coding region, the other having a reference sequence (wild type)

genomic Reference Sequence	coding DNA Reference Sequence
g.[476C>T];[=]	c.[76C>T];[=]

several variants in a gene on both chromosomes - a gene containing a C to G change at nt -5 and a G to C change at nt 183 and one copy containing a C to T change at nt 76 all in relation to the coding region

genomic Reference Sequence	coding DNA Reference Sequence
g.[266C>G;476C>T; 1083G>C]; g.[266C>G; 1083G>C]	c.[-5C>G; 76C>T; 183G>C]; c.[-5C>G; 183G>C]

two variants in different genes - the GJB2 gene contains a deletion of a G, the GJB6 gene a T insertion

genomic Reference Sequence	coding DNA Reference Sequence
	NM_004004.2:c.[35delG] NM_006783.1:c.[689T>C] (GJB2:c.[35delG] GJB6:c.[689T>C])

Mosaicism - two different sequences in one position caused by somatic mosaicim are described as "[=/variant]" (see FAQ)
NOTE: descriptions modified after acceptance of proposal SVD-WG001

in a tumor some cells contain at position 476 (coding DNA nt 76) the reference sequence while other cells contain a C to T change

genomic Reference Sequence	coding DNA Reference Sequence
g.476C=/>T	c.76C=/>T

in a tumor at position 476 (coding DNA nt 76) both the reference sequence and a C to T change are found but it is not known whether the is variant is somatic or germline (inherited)

genomic Reference Sequence	coding DNA Reference Sequence
g.476C(=/)>T	c.76C(=/)>T

Chimerism - two different nucleotides in one position caused by chimerism are described as "[=//nucleotide 2]"

a chimeric individual some cells contain at position 476 (coding DNA nt 76) the reference sequence while other cells contain a C to T change

genomic Reference Sequence	coding DNA Reference Sequence
g.=//476C>T	c.=//76C>T

one chromosome containing a TG di-nucleotide repeat of length 4, the other chromosome containing a repeat of length 5 (see Recommendation)

genomic Reference Sequence	coding DNA Reference Sequence
g.983TG[4];[5]	c.88+495TG[4];[5]

Description of sequence changes: examples DNA-level