HGVS recommendations; Frequently Asked Questions (FAQ)

Frequently asked questions regarding the description of sequence variants

Last modified November 16, 2015

NOTE: this website is frozen since May 1, 2016. It has been replaced by a new version at http://www.HGVS.org/varnomen. These pages serve as archival copy only.

Introduction
Reference sequences
- changes in mitochondrial DNA
- changes in non-coding RNA genes
DNA level changes
RNA level changes
Protein level changes
- change in 5'UTR, new initiation codon
- should I use the one- or three-letter amino acid code
- should I incorporate knowledge of the change at DNA-level
- frameshifts
- no stop codon encountered

Introduction

This page gives an overview of the questions we have received regarding the description of sequence variations based on the existing recommendations (published in by den Dunnen and Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format).

For reactions: E-mail (to: HGVSmn @ JohanDenDunnen.nl) or use the HGVS variant description forum.

DNA level changes

Question (Marco Montagna, Padua, Italia)
Recently, I have been involved in the molecular characterization of BRCA1 gene rearrangements that are becoming more and more frequent in breast/ovarian cancer families. Most often these rearrangements are mediated by Alu sequences with a very high homology that reaches 100% in the breakpoint region. I looked at the reference papers on mutation nomenclature, but I still have some doubts on how to define such kind of mutations. In particular, if a genomic deletion is mediated by Alu sequences that are identical over a large nucleotide stretch containing the breakpoint, what nucleotide should be indicated?. Could I indicate the most 3' one (considering the "sense" strand), similarly to the rule for deletions in repeated sequences?. Moreover, if more than one genomic sequence is present in GenBank, which one should be considered ?. For instance, for a rearrangement that deletes a genomic region of 20kb containing exon 1 and the upstream sequence, with a breakpoint occurring over a stretch of nucleotides that are identical in the two recombining sequences, and a genomic reference sequence of the antisense strand, I would suggest the following definition: nt.X (the most 5' in the identity region of the "antisense" reference sequence, i.e. the most 3' in the "sense" strand) -- nt.Y del 20kb (exon > 1). Would be that fine ?.

Answer
You touch on two subjects; location of the breakpoint and reference sequence.
     Breakpoint; indeed, like you suggest, when breakpoints occur in stretches of identical sequences the most 3' position (considering the sense strand) is used to describe the position of the breakpoint (see Recommendations).
     Reference sequence; any reference sequence would be OK at least when you specify the one you use (database accession.version number, see Reference sequence discussion). When present, it would be best to use the genomic Reference Sequence from the RefSeq database. When such a sequence is not present you should make, annotate and submit one (see Discussion).
     Depending on whether a genomic or a coding DNA reference sequence is used the final description should have the format; g.1234_7234del (alternative g.1234_7246del6012) or c.123+45_955-234del (alternative c.123+45_955-234del6012).

Question (Erik-Jan Kamsteeg, Nijmegen, Nederland)
The recommendations to describe unknown breakpoints are not exactly clear to me. For example, PCR analysis of a gene on the X-chromosome shows products for exons 1-3 and no product is detected for exons 4-14 (exon 14 is the last exon of the gene). Since PCR does not work with one primer, we are not sure whether exon 4 and 14 are completely absent, or only partially. Therefore, using the first base of exon 4 and the '-?' (see Recommendations) could be wrong, as could be the last base of exon 14 with a '+?'. Therefore, I would like to use the last base of exon 3 with '+?' and the last base of exon 13 with a '+?'. What are your recommendations?

Answer
Literally speaking you are right and it is best to set the borders as precise as possible. So when exon 3 is present in fact the location of the reverse primer can be used to set the most 5' border (and the same for the exon 14 primer). Consequently the description could be something like (87+123_88-?)_(923+?_924-98)del. Although precise one might wonder whether such a description is attractive; c.(87+1_88-1)_(923+1_924-1)del is as clear (see Uncertainties). When it is difficult to give an exact nucleotide position for a specific probe/sequence tested, a rule of thumb is to use the central nucleotide.

NOTE: for simplicity there are more descriptions that are not fully correct. For example, stop codons are reported as p.Cys123* while one could argue that p.Cys123_Met2376del is more precise (Met2376 being the last amino acid of the protein).

Question
Is a description like c.EX17del, indicating a deletion of exon 17, still valid?

Answer
A description like c.EX17del has never been accepted. Descriptions should indicate the nucleotides affected by the change. Note also that for many genes exon numbering is often not clearly defined and/or not described accurately.

Question
How should I describe the change TGT GC CA to TGT TG CA. Can I call it a dinucleotide mutation or is it a deletion / insertion mutation ?.

Answer
Simply describe it as c.4_5delinsTG (alternatively it can be described c.[4G>T; 5C>G]). Although c.4_5GC>TG is clear and unequivocal, the description as a deletion/insertion follows the general recommendations more precisely (see Recommendations).

Question
At position c.2077_2078 in the BRCA1 gene I have a TA insertion. The published sequence for c.2076_2077 is TG however the individual has a common variant at c.2077 (G>A) and the TA insertion is on that allele. Should I call it c.2076_2077dupTA since I know that is the description of the change on that specific allele or should I call it c.2077_2078insTA which would be the correct description based on the more common sequence at that position.

Summary; the BRCA1 coding DNA reference sequence from position 2074_2080 is ..CATGACA.. A frequent variant in the population is ..CATAACA. and the sequence found in the individual is ..CATA TA ACA.

Answer
The basic rule is to describe variants in relation to a reference sequence. In this respect, the description c.2076_2077dup (c.2076_2077dupTA) is not correct because the reference sequence does not contain a TA dinucleotide at position c.2076_2077 (it has TG). The description c.2077_2078insTA is also not correct because the change c.2077G>A is neglected and all changes should be described. So the correct description is c.2077delGinsATA (or c.2077delGinsATA).

NOTE: in cases like the above, where frequent variants are present at the site changed it is allowed to describe these individually. c.[4G>T; 5C>G] in the first case, assuming either c.[4G>T] or c.[5C>G] is a known frequent variant. c.[2077G>A; 2077_2078insTA] in the second case with c.2077G>A known as the frequent variant. Of course it is essential in such cases that the variants reside on one allele.

Question (Ron Agatep, Toronto, Canada)
Several groups have identified a duplication in the CDKN2A locus that has been labeled in various ways. The mutation is a duplication of the first 24 bp

normal = ggcggcggggagcagc atg gag ccG GCG GCG GGG AGC AGC
Met Glu Pro Ala Ala Gly Ser Ser
ATG GAG CCt tcg gct
Met Glu Pro Ser Ala
variant = ggcggcggggagcagc atg gag ccG GCG GCG GGG AGC AGC
Met Glu Pro Ala Ala Gly Ser Ser

ATG GAG CCG GCG GCG GGG AGC AGC ATG GAG CCt tcg gct
Met Glu Pro Ala Ala Gly Ser Ser Met Glu Pro Ser Ala

The ATG translation initiation codon is underlined (translational start). One group has described the mutation as 23ins24 is this correct? My interpretation of your recent paper suggests I should name it 1_24dup. Could you provide me with the correct nomenclature ?.

Answer
Correct is c.9_32dup (p.Ala4_Pro11dup) - the description c.1_24dup (p.Met1_Ser8dup) seems correct but please note that for all descriptions the most 3' position possible should be arbitrarily assigned to have been changed (see Recommendations). c.23ins24 is not correct, first because the position of the insertion is not clear (see Discussion), second 'ins24' does not indicate which sequence was inserted.

Question
How should I describe a change where ATCG-ATCGATCGATCG-A-GGGTCCC becomes ATCG-ATCGATCGATCG-A-ATCGATCGATCG-GGGTCCC ?. The fact that the inserted sequence (ATCGATCGATCG) is present in the original sequence suggests it derives from a duplicative event.

Answer
A correct description of the insertion is c.17_18ins5_16 (see Recommendations). A description using 'dup' is not correct since by definition a duplication is a sequence change where a copy of one or more nucleotides are inserted directly 3'-flanking of the original copy (see Standards). Still, the description given makes it clear that the sequence inserted between nucleotides c.17 and c.18 is probably derived from nearby, i.e. position c.5_16, and thus likely derived from a duplicative event.

Question
The 3' end of intron 8 of the CFTR gene contains a variable sequence; IVS8(TG)mTn. The CFTR genomic reference sequence of the end of intron 8 is ...TGTGTGTGTGTTTTTTTAACAG[..exon9..], with a tract of (TG)11 and T7. When we describe this sequence variation as c.1210-14(TG)9-13(T)5-9 and that of the IVS8Tn as c.1210-6(T)5-9, are we right? Is the description of a T5 tract variant as c.1210-14(TG)12T5 correct ?.

Answer (see Repeated sequences)
A difficult case; please note that following current recommendations it is not a TG11 but a GT11 variant (see Recommendations), overlapping one T-nucleotide with the T7 stretch. However, to prevent confusion it is probably best to use in this exceptional case TG11.
The correct description depends on the reference sequence used. Assuming this reference sequence is as described, i.e. TG11 followed by T7, the TG11 stretch is located at c.1210-34_1210-13 and T7 stretch at c.1210-12_1210-6. A correct description of the variants is then c.1210-34TG(9_13)T(4_8) (or c.1210-34_1201-33(9_13)T(4_8)). c.1210-34 because the variable tract starts at that position.
When only the T stretch is described the correct description is c.1210-12T(5_9). A correct description of the T5 variant is c.1210-12T[5].
NOTE: to indicate the range, "_" must be used and not "-".

Question
Is the description NM_012345.3:c.123+45_123+51TSDinsL1.603bp acceptable (TSD = target site duplication, L1 indicates the nature of the insert (L1, Alu or SVA) after "ins"; 603bp = the number of inserted base pairs) ?.

Answer
Following the current recommendations the description should be NM_012345.3:c.123+45_123+51dupinsAB012345.3:g.393_1295 (alternatively NM_012345.3:c.123+45_123+51dupins603). So use "dup" (not "TSD") and leave out "bp" (not necessary). The insertion itself is described as AB012345.3:g.393_1295, indicating that the inserted sequences are nucleotides 393 to 1295 from GenBank file AB012345.3. Adding "(L1)" in the description to indicate the nature of the inserted sequence is not recommended, it might cause confusion. The "Remarks" column of the summary sequence variant Table can be used for this annotation.

Question
How should we, using the most current recommendations, indicate a change in one allele. The notation we envisage should indicate that the other allele has no change compared to the reference sequence. For the unchanged allele "[?]" would not be appropriate since it is not the case that allele 2 has an unknown variant; it simply has change. The notation "c.[76A>C]" without describing the second allele would be misleading; not enough researchers would be familiar enough with the nomenclature to know that this refers to only one of the two alleles present. Would the description "c.[76A>C];[]" be OK ?.

Answer
The character used to indicate 'no change' is the '=' (see Recommendations). The recommended description is thus "c.[76A>C];[=]".

Question (Andrew Grimm, Coordinator RettBASE)
When I come across cases where a person has two variants and it isn't known whether or not they are on the same chromosome how should I describe this ?.

Answer
Although we do not recommend to describe uncertainties, in this case it is clear that to prevent mistakes a recommendation is required. Two changes in one allele should be described as c.[76A>C; 91C>G] and two changes on different alleles as c.[76A>C];[91C>G]. When it is not clear whether the changes are on the same or on different alleles the recommendation is to describe this using the format c.[76A>C(;)91C>G] (see Recommendations).

Question (Nancy Carson, Ottawa, Canada)
The recommendations for mutation nomenclature give guidelines on the proper nomenclature for recessive diseases where there are two mutations identified in one gene. I have a patient with hearing loss who has a mutation in GJB2 (c.35delG) and a mutation in GJB6 (c.689_690insT). Any suggestions on how I should write this?

Answer
The recommendation is to use the format GJB2:c.[35delG] GJB6:c.[689_690insT] (see Discussion). This format prevents confusion regarding the reference sequence used (i.e. "GJB2:") and combines this with the normal format to describe variants in different alleles. Using the format given it is of course still essential to describe the reference sequence used (GenBank file with version number). Another format, coping with this directly, is to describe the variants as NM_004004.2:c.[35delG] NM_006783.1:c.[689_690insT], i.e. using the Genbank reference sequences in stead of the HGNC Gene Symbol.

Question
I study a gene located on the X-chromosome. How should I describe the variants detected in males and females?

Answer
In females the description is straightforward, like "c.[76A>C];[=]". In males there is no second allele (X-chromosome) which can be described as c.[76A>C];[0]" (see Recommendations).

Question
Detailed analysis of a DMD patient showed that it was a mosaic case; consequently two different nucleotides were found at one position, a G and a C (a G is the normal sequence). How should I describe this?

Answer
Mosaic cases, i.e. two different nucleotides found at one position on one allele (chromosome) should be described as c.[83G=/>C] (see Recommendations).

NOTE: this recommendation was changed (Aug.2010 and Nov.2015). Initially the suggestion was to describe mosaic cases using c.[=, 83G>C]. This recommendation was changed to follow standards from the ISCN (International System for Human Cytogenetic Nomenclature, see Recent changes) and from c.=/83G>C after acceptance of proposal SVD-WG001.

Question (Harriet Meyer, JAMA-archives.org)
The subject of promoter polymorphisms has come up, and I would be grateful for your recommendation of how these should be described.

Answer
For variants in the promoter region it is recommended to describe these in relation to a genomic reference sequence (like L01538.1:g.1407C>T). Describing a promoter variant in relation to a coding DNA reference sequence is possible and should be in relation to the A of the ATG initiation codon, counting backwards to the variant nucleotide; in the example given c.-401C>T indicating a change of the C 401 nucleotides upstream of the ATG (in the promoter), to an T. To be unequivocal, next to the coding DNA reference sequence (to identify the A of the ATG) one should also mention the genomic reference sequence used (to identify the C at -401) or include upstream sequences in the coding DNA reference sequence (see Discussion). This would make it rather complex - one has to retrieve two sequence files. Consequently, it would be much easier to describe the variant directly in relation to the genomic reference sequence. A format which one could use is "L01538.1:g.1407C>T (at -401 of the ATG)".
Please note that it is not correct to provide descriptions in relation to the start site of the mRNA. There is often a debate as to where the RNA exactly starts and one should not describe DNA variants in relation to such a 'variable' site (see Discussion). Of course it is acceptable that the authors mention, between brackets, the approximate position of the change in relation to the promoter.

RNA Level Changes

Protein Level Changes

Question
How should a mutation in the 5'UTR be described that gives rise to a new translation initiation site ?

Answer
Description at the DNA-level should be e.g. c.-23A>T (changing -25 caGggt -19 to caTggt, creating a new ATG-triplet). Description at the RNA-level should be like r.-23a>u and description at the protein level could be like p.Met1extMet-8 (or p.M1extM-8, see Recommendation protein level). This indicates that due to a variant the protein sequence becomes extended N-terminally by the addition of 8 new amino acids. Note that descriptions on RNA and protein level should only be given when this was experimentally verified; if not, changes should be placed between brackets to indicate that it is a prediction only.

Question (Dean J. Danner, Atlanta, USA)
We are characterizing mutations in nuclear encoded proteins that function in mitochondria. The problem is in proteins that have amino terminal mitochondrial signal peptides. The current rules for proteins say to start numbering with the initiating methionine. However, the functional protein has this target peptide removed and therefore many investigators begin numbering at the amino acid residue of the mature protein. Mutations that result in changes in the targeting peptide suggest that numbering should begin with the Met-1. An alternative would be to give the targeting peptide negative numbers as in the nucleic acids upstream of the transcriptional start site. It would be helpful to have some rules for consistency in the field.

Answer
As already suggested in your question, protein reference sequences should always represent the complete primary translation product, not a processed mature or functional protein (see Recommendations).

Question (Sven Arnold, Austria)
There are several examples you give where changes affecting a series of amino acids are described using the most 3' amino acid. Does this also apply when it is known exactly which amino acid is affected? Example; the sequence ATGTCAAGCTCT codes for MetSerSerSer. An insertion of AGT (c.9_10insAGT) gives ATGTCAAGCAGTTCT, coding for MetSerSerSerSer. Looking at the protein sequence you would describe the change as p.Ser3dup. Knowing the nucleotide changes, it would be accurately described as p.Ser2_Ser3insSer. My question is, do we describe the protein change as it appears, or do we try and describe it according to the (known) underlying DNA change?

Answer
Descriptions at protein level should describe the changes observed on protein level - one should not try to incorporate knowledge regarding the change at DNA-level (see Recommendations). As a consequence, the amino acid change described may be caused by a change which at DNA level lies several nucleotides upstream, like in the example you give. Another example is that where a frame shift deletion at DNA level does not immediately affect the protein sequence.

Question
When a protein description does not contain "fs" (frame shift) does this mean there is no frame shift?

Answer
By definition frame shifts are a special type of amino acid deletion/insertion replacing the normal C-terminal sequence with one encoded by another reading frame (specified 2013-03-16, see Describing protein variants). Descriptions at protein level describe the consequences of a change on the protein irrespective of the changes at DNA or protein level. Translating back from protein to DNA (or RNA) is therefore difficult and usually only works for simple cases like substitutions. Examples of what one might call frame shifts that can not ne seen from the protein description include;

deletions at DNA level that lead to an immediate stop codon, e.g. c.4delC in 5'-AAT CTG AGC-3' gives a predicted protein change p.Cys2Ter (C4*)
variants at DNA level that introduce a frame shift which is followed by another variant on the same allele shifting the reading frame back to normal (before the shifted frame encountered a translation termination codon) are described as a deletion/insertion (delins)
no-stop changes (see Describing protein variants), i.e.variants affecting the translation termination codon (Ter/*1) introducing a new downstream termination codon extending the C-terminus of the encoded protein, are described as extension.
NOTE: since technically there is no reading frame after the translation termination codon there is also no shifted frame.

Question (Giampaolo Trivellin, London)
How should we describe the consequences of a duplication of a G at DNA level that causes a frameshift at the protein level where the shifted frame does not encounter a new stop codon. I was thinking to describe it using the short description (e.g. p.Ile327fs), but we expect that the protein is not formed since the aberrant RNA will be degraded lacking a stop codon.

Answer
The description p.(Ile327fs) can indeed be used (please note the brackects to indicate RNA was not yet analysed). It circumvents however the problem that the current recommendations did not yet indicate how to describe a frame shift that does not encounter a stop codon. The recommendation is to describe this using "fs*?", so p.(Ile327Argfs*?) (see Recommendations). The "?" indicates uncertainty, in this case that the position of the stop codon is not known. When you have analysed RNA and it is indeed undetectable (degraded) the RNA description would be r.0 (no RNA) and the protein description p.0 (no protein).