Frequently asked questions regarding the description of sequence variants
This page gives an overview of the questions we have received regarding the description of sequence variations based on the existing recommendations (published in by den Dunnen and Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format).
For reactions: E-mail (to: HGVSmn @ JohanDenDunnen.nl) or use the HGVS variant description forum.
Montagna, Padua, Italia)
Recently, I have been involved in the molecular characterization of BRCA1 gene rearrangements that are becoming more and more frequent in breast/ovarian cancer families. Most often these rearrangements are mediated by Alu sequences with a very high homology that reaches 100% in the breakpoint region. I looked at the reference papers on mutation nomenclature, but I still have some doubts on how to define such kind of mutations. In particular, if a genomic deletion is mediated by Alu sequences that are identical over a large nucleotide stretch containing the breakpoint, what nucleotide should be indicated?. Could I indicate the most 3' one (considering the "sense" strand), similarly to the rule for deletions in repeated sequences?. Moreover, if more than one genomic sequence is present in GenBank, which one should be considered ?. For instance, for a rearrangement that deletes a genomic region of 20kb containing exon 1 and the upstream sequence, with a breakpoint occurring over a stretch of nucleotides that are identical in the two recombining sequences, and a genomic reference sequence of the antisense strand, I would suggest the following definition: nt.X (the most 5' in the identity region of the "antisense" reference sequence, i.e. the most 3' in the "sense" strand) -- nt.Y del 20kb (exon > 1). Would be that fine ?.
You touch on two subjects; location of the breakpoint and reference sequence.
Breakpoint; indeed, like you suggest, when breakpoints occur in stretches of identical sequences the most 3' position (considering the sense strand) is used to describe the position of the breakpoint (see Recommendations).
Reference sequence; any reference sequence would be OK at least when you specify the one you use (database accession.version number, see Reference sequence discussion). When present, it would be best to use the genomic Reference Sequence from the RefSeq database. When such a sequence is not present you should make, annotate and submit one (see Discussion).
Depending on whether a genomic or a coding DNA reference sequence is used the final description should have the format; g.1234_7234del (alternative g.1234_7246del6012) or c.123+45_955-234del (alternative c.123+45_955-234del6012).
Kamsteeg, Nijmegen, Nederland)
The recommendations to describe unknown breakpoints are not exactly clear to me. For example, PCR analysis of a gene on the X-chromosome shows products for exons 1-3 and no product is detected for exons 4-14 (exon 14 is the last exon of the gene). Since PCR does not work with one primer, we are not sure whether exon 4 and 14 are completely absent, or only partially. Therefore, using the first base of exon 4 and the '-?' (see Recommendations) could be wrong, as could be the last base of exon 14 with a '+?'. Therefore, I would like to use the last base of exon 3 with '+?' and the last base of exon 13 with a '+?'. What are your recommendations?
Literally speaking you are right and it is best to set the borders as precise as possible. So when exon 3 is present in fact the location of the reverse primer can be used to set the most 5' border (and the same for the exon 14 primer). Consequently the description could be something like (87+123_88-?)_(923+?_924-98)del. Although precise one might wonder whether such a description is attractive; c.(87+1_88-1)_(923+1_924-1)del is as clear (see Uncertainties). When it is difficult to give an exact nucleotide position for a specific probe/sequence tested, a rule of thumb is to use the central nucleotide.
NOTE: for simplicity there are more descriptions that are not fully correct. For example, stop codons are reported as p.Cys123* while one could argue that p.Cys123_Met2376del is more precise (Met2376 being the last amino acid of the protein).
Is a description like c.EX17del, indicating a deletion of exon 17, still valid?
A description like c.EX17del has never been accepted. Descriptions should indicate the nucleotides affected by the change. Note also that for many genes exon numbering is often not clearly defined and/or not described accurately.
How should I describe the change TGT GC CA to TGT TG CA. Can I call it a dinucleotide mutation or is it a deletion / insertion mutation ?.
Simply describe it as c.4_5delinsTG (alternatively it can be described c.[4G>T; 5C>G]). Although c.4_5GC>TG is clear and unequivocal, the description as a deletion/insertion follows the general recommendations more precisely (see Recommendations).
At position c.2077_2078 in the BRCA1 gene I have a TA insertion. The published sequence for c.2076_2077 is TG however the individual has a common variant at c.2077 (G>A) and the TA insertion is on that allele. Should I call it c.2076_2077dupTA since I know that is the description of the change on that specific allele or should I call it c.2077_2078insTA which would be the correct description based on the more common sequence at that position.
|Summary; the BRCA1 coding DNA reference sequence from position 2074_2080 is ..CATGACA.. A frequent variant in the population is ..CATAACA. and the sequence found in the individual is ..CATA TA ACA.|
The basic rule is to describe variants in relation to a reference sequence. In this respect, the description c.2076_2077dup (c.2076_2077dupTA) is not correct because the reference sequence does not contain a TA dinucleotide at position c.2076_2077 (it has TG). The description c.2077_2078insTA is also not correct because the change c.2077G>A is neglected and all changes should be described. So the correct description is c.2077delGinsATA (or c.2077delGinsATA).
NOTE: in cases like the above, where frequent variants are present at the site changed it is allowed to describe these individually. c.[4G>T; 5C>G] in the first case, assuming either c.[4G>T] or c.[5C>G] is a known frequent variant. c.[2077G>A; 2077_2078insTA] in the second case with c.2077G>A known as the frequent variant. Of course it is essential in such cases that the variants reside on one allele.
Agatep, Toronto, Canada)
Several groups have identified a duplication in the CDKN2A locus that has been labeled in various ways. The mutation is a duplication of the first 24 bp
The ATG translation initiation codon is underlined (translational start). One group has described the mutation as 23ins24 is this correct? My interpretation of your recent paper suggests I should name it 1_24dup. Could you provide me with the correct nomenclature ?.
Correct is c.9_32dup (p.Ala4_Pro11dup) - the description c.1_24dup (p.Met1_Ser8dup) seems correct but please note that for all descriptions the most 3' position possible should be arbitrarily assigned to have been changed (see Recommendations). c.23ins24 is not correct, first because the position of the insertion is not clear (see Discussion), second 'ins24' does not indicate which sequence was inserted.
How should I describe a change where ATCG-ATCGATCGATCG-A-GGGTCCC becomes ATCG-ATCGATCGATCG-A-ATCGATCGATCG-GGGTCCC ?. The fact that the inserted sequence (ATCGATCGATCG) is present in the original sequence suggests it derives from a duplicative event.
A correct description of the insertion is c.17_18ins5_16 (see Recommendations). A description using 'dup' is not correct since by definition a duplication is a sequence change where a copy of one or more nucleotides are inserted directly 3'-flanking of the original copy (see Standards). Still, the description given makes it clear that the sequence inserted between nucleotides c.17 and c.18 is probably derived from nearby, i.e. position c.5_16, and thus likely derived from a duplicative event.
The 3' end of intron 8 of the CFTR gene contains a variable sequence; IVS8(TG)mTn. The CFTR genomic reference sequence of the end of intron 8 is ...TGTGTGTGTGTTTTTTTAACAG[..exon9..], with a tract of (TG)11 and T7. When we describe this sequence variation as c.1210-14(TG)9-13(T)5-9 and that of the IVS8Tn as c.1210-6(T)5-9, are we right? Is the description of a T5 tract variant as c.1210-14(TG)12T5 correct ?.
Answer (see Repeated
A difficult case; please note that following current recommendations it is not a TG11 but a GT11 variant (see Recommendations), overlapping one T-nucleotide with the T7 stretch. However, to prevent confusion it is probably best to use in this exceptional case TG11.
The correct description depends on the reference sequence used. Assuming this reference sequence is as described, i.e. TG11 followed by T7, the TG11 stretch is located at c.1210-34_1210-13 and T7 stretch at c.1210-12_1210-6. A correct description of the variants is then c.1210-34TG(9_13)T(4_8) (or c.1210-34_1201-33(9_13)T(4_8)). c.1210-34 because the variable tract starts at that position.
When only the T stretch is described the correct description is c.1210-12T(5_9). A correct description of the T5 variant is c.1210-12T.
NOTE: to indicate the range, "_" must be used and not "-".
Is the description NM_012345.3:c.123+45_123+51TSDinsL1.603bp acceptable (TSD = target site duplication, L1 indicates the nature of the insert (L1, Alu or SVA) after "ins"; 603bp = the number of inserted base pairs) ?.
Following the current recommendations the description should be NM_012345.3:c.123+45_123+51dupinsAB012345.3:g.393_1295 (alternatively NM_012345.3:c.123+45_123+51dupins603). So use "dup" (not "TSD") and leave out "bp" (not necessary). The insertion itself is described as AB012345.3:g.393_1295, indicating that the inserted sequences are nucleotides 393 to 1295 from GenBank file AB012345.3. Adding "(L1)" in the description to indicate the nature of the inserted sequence is not recommended, it might cause confusion. The "Remarks" column of the summary sequence variant Table can be used for this annotation.
How should we, using the most current recommendations, indicate a change in one allele. The notation we envisage should indicate that the other allele has no change compared to the reference sequence. For the unchanged allele "[?]" would not be appropriate since it is not the case that allele 2 has an unknown variant; it simply has change. The notation "c.[76A>C]" without describing the second allele would be misleading; not enough researchers would be familiar enough with the nomenclature to know that this refers to only one of the two alleles present. Would the description "c.[76A>C];" be OK ?.
The character used to indicate 'no change' is the '=' (see Recommendations). The recommended description is thus "c.[76A>C];[=]".
Grimm, Coordinator RettBASE)
When I come across cases where a person has two variants and it isn't known whether or not they are on the same chromosome how should I describe this ?.
Although we do not recommend to describe uncertainties, in this case it is clear that to prevent mistakes a recommendation is required. Two changes in one allele should be described as c.[76A>C; 91C>G] and two changes on different alleles as c.[76A>C];[91C>G]. When it is not clear whether the changes are on the same or on different alleles the recommendation is to describe this using the format c.[76A>C(;)91C>G] (see Recommendations).
Carson, Ottawa, Canada)
The recommendations for mutation nomenclature give guidelines on the proper nomenclature for recessive diseases where there are two mutations identified in one gene. I have a patient with hearing loss who has a mutation in GJB2 (c.35delG) and a mutation in GJB6 (c.689_690insT). Any suggestions on how I should write this?
The recommendation is to use the format GJB2:c.[35delG] GJB6:c.[689_690insT] (see Discussion). This format prevents confusion regarding the reference sequence used (i.e. "GJB2:") and combines this with the normal format to describe variants in different alleles. Using the format given it is of course still essential to describe the reference sequence used (GenBank file with version number). Another format, coping with this directly, is to describe the variants as NM_004004.2:c.[35delG] NM_006783.1:c.[689_690insT], i.e. using the Genbank reference sequences in stead of the HGNC Gene Symbol.
I study a gene located on the X-chromosome. How should I describe the variants detected in males and females?
In females the description is straightforward, like "c.[76A>C];[=]". In males there is no second allele (X-chromosome) which can be described as c.[76A>C];" (see Recommendations).
Detailed analysis of a DMD patient showed that it was a mosaic case; consequently two different nucleotides were found at one position, a G and a C (a G is the normal sequence). How should I describe this?
Mosaic cases, i.e. two different nucleotides found at one position on one allele (chromosome) should be described as c.[83G=/>C] (see Recommendations).
NOTE: this recommendation was changed (Aug.2010 and Nov.2015). Initially the suggestion was to describe mosaic cases using c.[=, 83G>C]. This recommendation was changed to follow standards from the ISCN (International System for Human Cytogenetic Nomenclature, see Recent changes) and from c.=/83G>C after acceptance of proposal SVD-WG001.
The subject of promoter polymorphisms has come up, and I would be grateful for your recommendation of how these should be described.
For variants in the promoter region it is recommended to describe these in relation to a genomic reference sequence (like L01538.1:g.1407C>T). Describing a promoter variant in relation to a coding DNA reference sequence is possible and should be in relation to the A of the ATG initiation codon, counting backwards to the variant nucleotide; in the example given c.-401C>T indicating a change of the C 401 nucleotides upstream of the ATG (in the promoter), to an T. To be unequivocal, next to the coding DNA reference sequence (to identify the A of the ATG) one should also mention the genomic reference sequence used (to identify the C at -401) or include upstream sequences in the coding DNA reference sequence (see Discussion). This would make it rather complex - one has to retrieve two sequence files. Consequently, it would be much easier to describe the variant directly in relation to the genomic reference sequence. A format which one could use is "L01538.1:g.1407C>T (at -401 of the ATG)".
Please note that it is not correct to provide descriptions in relation to the start site of the mRNA. There is often a debate as to where the RNA exactly starts and one should not describe DNA variants in relation to such a 'variable' site (see Discussion). Of course it is acceptable that the authors mention, between brackets, the approximate position of the change in relation to the promoter.
How should a mutation in the 5'UTR be described that gives rise to a new translation initiation site ?
Description at the DNA-level should be e.g. c.-23A>T (changing -25 caGggt -19 to caTggt, creating a new ATG-triplet). Description at the RNA-level should be like r.-23a>u and description at the protein level could be like p.Met1extMet-8 (or p.M1extM-8, see Recommendation protein level). This indicates that due to a variant the protein sequence becomes extended N-terminally by the addition of 8 new amino acids. Note that descriptions on RNA and protein level should only be given when this was experimentally verified; if not, changes should be placed between brackets to indicate that it is a prediction only.
Question (Dean J.
Danner, Atlanta, USA)
We are characterizing mutations in nuclear encoded proteins that function in mitochondria. The problem is in proteins that have amino terminal mitochondrial signal peptides. The current rules for proteins say to start numbering with the initiating methionine. However, the functional protein has this target peptide removed and therefore many investigators begin numbering at the amino acid residue of the mature protein. Mutations that result in changes in the targeting peptide suggest that numbering should begin with the Met-1. An alternative would be to give the targeting peptide negative numbers as in the nucleic acids upstream of the transcriptional start site. It would be helpful to have some rules for consistency in the field.
As already suggested in your question, protein reference sequences should always represent the complete primary translation product, not a processed mature or functional protein (see Recommendations).
Question (Sven Arnold,
There are several examples you give where changes affecting a series of amino acids are described using the most 3' amino acid. Does this also apply when it is known exactly which amino acid is affected? Example; the sequence ATGTCAAGCTCT codes for MetSerSerSer. An insertion of AGT (c.9_10insAGT) gives ATGTCAAGCAGTTCT, coding for MetSerSerSerSer. Looking at the protein sequence you would describe the change as p.Ser3dup. Knowing the nucleotide changes, it would be accurately described as p.Ser2_Ser3insSer. My question is, do we describe the protein change as it appears, or do we try and describe it according to the (known) underlying DNA change?
Descriptions at protein level should describe the changes observed on protein level - one should not try to incorporate knowledge regarding the change at DNA-level (see Recommendations). As a consequence, the amino acid change described may be caused by a change which at DNA level lies several nucleotides upstream, like in the example you give. Another example is that where a frame shift deletion at DNA level does not immediately affect the protein sequence.
When a protein description does not contain "fs" (frame shift) does this mean there is no frame shift?
By definition frame shifts are a special type of amino acid deletion/insertion replacing the normal C-terminal sequence with one encoded by another reading frame (specified 2013-03-16, see Describing protein variants). Descriptions at protein level describe the consequences of a change on the protein irrespective of the changes at DNA or protein level. Translating back from protein to DNA (or RNA) is therefore difficult and usually only works for simple cases like substitutions. Examples of what one might call frame shifts that can not ne seen from the protein description include;
How should we describe the consequences of a duplication of a G at DNA level that causes a frameshift at the protein level where the shifted frame does not encounter a new stop codon. I was thinking to describe it using the short description (e.g. p.Ile327fs), but we expect that the protein is not formed since the aberrant RNA will be degraded lacking a stop codon.
The description p.(Ile327fs) can indeed be used (please note the brackects to indicate RNA was not yet analysed). It circumvents however the problem that the current recommendations did not yet indicate how to describe a frame shift that does not encounter a stop codon. The recommendation is to describe this using "fs*?", so p.(Ile327Argfs*?) (see Recommendations). The "?" indicates uncertainty, in this case that the position of the stop codon is not known. When you have analysed RNA and it is indeed undetectable (degraded) the RNA description would be r.0 (no RNA) and the protein description p.0 (no protein).
| Top of page | Homepage
| Check-list | Symbols,
codons, etc. |
| Recommendations: general, DNA, RNA, protein, uncertain |
| Discussions | FAQ's | Symbols, codons, etc.| History |
| Example descriptions: QuickRef, DNA, RNA, protein |