![]() |
Frequently asked questions regarding the description of sequence variants |
|
Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.
This page gives an overview of the questions we have received regarding the description of sequence variations based on the existing recommendations (published in by den Dunnen and Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format).
For reactions E- mail to: ddunnen@LUMC.nl and Stylianos.Antonarakis@medecine.unige.ch
Question (Marco Montagna, Padua,
Italia)
Recently, I have been involved in the molecular characterization of BRCA1 gene
rearrangements that are becoming more and more frequent in breast/ovarian cancer families.
Most often these rearrangements are mediated by Alu sequences with a very high homology
that reaches 100% in the breakpoint region. I looked at the reference papers on mutation
nomenclature, but I still have some doubts on how to define such kind of mutations. In
particular, if a genomic deletion is mediated by Alu sequences that are identical over
a large nucleotide stretch containing the breakpoint, what nucleotide should be indicated?.
Could I indicate the most 3' one (considering the "sense" strand), similarly to
the rule for deletions in repeated sequences?. Moreover, if more than one genomic sequence
is present in GenBank, which one should be considered ?. For instance, for a rearrangement
that deletes a genomic region of 20kb containing exon 1 and the upstream sequence, with a
breakpoint occurring over a stretch of nucleotides that are identical in the two
recombining sequences, and a genomic reference sequence of the antisense strand, I would
suggest the following definition: nt.X (the most 5' in the identity region of the
"antisense" reference sequence, i.e. the most 3' in the "sense"
strand) -- nt.Y del 20kb (exon > 1). Would be that fine ?.
Answer
You touch on two subjects; location of the breakpoint and reference
sequence.
Breakpoint; indeed, like you suggest, when
breakpoints occur in stretches of identical sequences the most 3' position (considering
the sense strand) is used to describe the position of the breakpoint (see Recommendations).
Reference sequence; any reference sequence
would be OK at least when you specify the one you use (database accession.version number, see Reference sequence discussion). When present, it would be
best to use the genomic Reference Sequence from the RefSeq database. When such a sequence is
not present you should make, annotate and submit one (see
Discussion).
Depending on whether a genomic or a coding DNA reference sequence
is used the final description should have the format; g.1234_7234del (alternative
g.1234_7246del6012) or c.123+45_955-234del (alternative c.123+45_955-234del6012).
Question (Erik-Jan Kamsteeg, Nijmegen,
Nederland)
The recommendations to describe
unknown breakpoints are not exactly clear to me. For example, PCR analysis of a gene on
the X-chromosome shows products for exons 1-3 and no product are detected for exons 4-14
(exon 14 is the last exon of the gene). Since PCR does not work with one primer, we are
not sure whether exon 4 and 14 are completely absent, or only partially. Therefore, using
the first base of exon 4 and the '-?' (see Recommendations)
could be wrong, as could be the last base of exon 14 with a '+?'. Therefore, I would like
to use the last base of exon 3 with '+?' and the last base of exon 13 with a '+?'. What
are your recommendations?
Answer
Literally speaking you are
right and it is best to set the borders as precise as possible. So when exon 3 is present
in fact the location of the reverse primer can be used to set the most 5' border (and the
same for the exon 14 primer). Consequently the description could be something like
(87+123_88-?)_(923+?_924-98)del. Although precise one might wonder whether such a
description is attractive; c88-?_923+?del is as clear (see
Uncertainties).
Please note that - for simplicity - more descriptions are not fully correct. For example,
stop codons are reported as p.Cys123* while one could argue that p.Cys125_Met2376del is
more precise (Met2376 being the last amino acid of the protein).
Question
How should I describe the change TGT GC CA to TGT TG CA.
Can I call it a dinucleotide mutation or is it a deletion / insertion mutation ?.
Answer
Simply describe it as c.4_5delinsTG (alternatively it can be described c.[4G>T;
5C>G]). Although c.4_5GC>TG is clear and unequivocal, the description as a
deletion/insertion follows the general recommendations more precisely (see Recommendations). Unless c.4G>T or c.5C>G
have been reported as allelic variants, we should assume the change occurred as the
consequence of one mutational event and it can be described as a dinucleotide variant.
Question (Ron Agatep, Toronto,
Canada)
Several groups have identified a duplication in the CDKN2A locus that has been labeled
in various ways. The mutation is a duplication of the first 24 bp
The ATG translation initiation codon is underlined (translational start). One group has described the mutation as 23ins24 is this correct? My interpretation of your recent paper suggests I should name it 1_24dup. Could you provide me with the correct nomenclature ?.
Answer
Correct is c.9_32dup (p.Ala4_Pro11dup) - the description c.1_24dup (p.Met1_Ser8dup)
seems correct but please note that for all descriptions the most 3' position
possible should be arbitrarily assigned to have been changed (see Recommendations). c.23ins24 is not correct, first
because the position of the insertion is not clear (see
Discussion), second 'ins24' does not indicate which sequence was inserted.
Question
How should I describe a change where ATCG-ATCGATCGATCG-A-GGGTCCC becomes
ATCG-ATCGATCGATCG-A-ATCGATCGATCG-GGGTCCC ?. The fact that the inserted
sequence (ATCGATCGATCG) is present in the original sequence suggests it derives from a
duplicative event.
Answer
A correct description of the insertion is c.17_18ins5_16 (see
Recommendations). A description using 'dup' might cause confusion since the rule
is that the duplicated region is indicated before the word "dup"
and not after it (like in c.17_18dup5_16). Still, the description given
makes it clear that the sequence inserted between nucleotides c.17 and c.18 is probably
derived from nearby, i.e. position c.5_16, and thus likely derived from a duplicative
event.
Question
The 3' end of intron 8 of the CFTR gene contains a variable sequence; IVS8(TG)mTn. The
CFTR genomic reference sequence of the end of intron 8 is
...TGTGTGTGTGTTTTTTTAACAG[..exon9..], with a tract of (TG)11 and T7. When we describe this
sequence variation as c.1210-14(TG)9-13(T)5-9 and that of the IVS8Tn as c.1210-6(T)5-9,
are we right? Is the description of a T5 tract variant as c.1210-14(TG)12T5 correct ?.
Answer
A difficult case; please note that following current recommendations it is not a TG11
but a GT11 variant, overlapping one T-nucleotide with the T7 stretch. However, to
prevent confusion it is probably best to use in this exceptional case TG11.
The correct description depends on the reference sequence used. Assuming this
reference sequence is as described, i.e. TG11 followed by T7, the TG11 stretch is located
at c.1210-34_1210-13 and T7 stretch at c.1210-12_1210-6. A correct description of the
variants is then c.1210-34GT(9_13)T(4_8); c.1210-34 because the variable tract starts
at that position. When only the T stretch is described the correct notation is
c.1210-12T(5_9). A correct description of the T5 variant is c.1210-34GT(9_13)T5. NOTE:
to indicate the range, "_" must be used and not "-".
Question
Is the description NM_012345.3:c.123+45_123+51TSDinsL1.603bp acceptable (TSD =
target site duplication, L1 indicates the nature of the insert (L1, Alu or SVA) after
"ins"; 603bp = the number of inserted base pairs) ?.
Answer
Following the current recommendations the description should be NM_012345.3:c.123+45_123+51dupinsAB012345.3:g.393_1295
(alternatively NM_012345.3:c.123+45_123+51dupins603). So use "dup" (not
"TSD") and leave out "bp" (not necessary). The insertion itself is
described as AB012345.3:g.393_1295, indicating that the inserted sequences are nucleotides
393 to 1295 from GenBank file AB012345.3. Adding "(L1)" in the description to
indicate the nature of the inserted sequence is not recommended, it might cause confusion.
The "Remarks" column of the summary sequence variant Table can be used for this
annotation.
Question
How should we, using the most current recommendations, indicate a change in one
allele. The notation we envisage should indicate that the other allele has no change
compared to the reference sequence. For the unchanged allele "[?]" would
not be appropriate since it is not the case that allele 2 has an unknown variant; it
simply has change. The notation "c.[76A>C]" without describing the second
allele would be misleading; not enough researchers would be familiar enough with the
nomenclature to know that this refers to only one of the two alleles present. Would the
description "c.[76A>C];[]" be OK ?.
Answer
The character used to indicate 'no change' is the '=' (see
Recommendations). The recommended description is thus "c.[76A>C];[=]".
Question (Andrew Grimm,
Coordinator RettBASE)
When I come across cases where a person has two variants and it isn't known whether or not
they are on the same chromosome how should I describe this ?.
Answer
Although we do not recommend to describe uncertainties, in this case it is
clear that to prevent mistakes a clear recommendation is required. Two changes in one
allele should be described as c.[76A>C; 91C>G] and two changes on different alleles
as c.[76A>C];[91C>G]. When it is not clear whether the changes are on
the same or on different alleles the recommendation is to describe this using the format c.[76A>C
(;) 91C>G] (see Recommendations).
Question (Nancy Carson, Ottawa,
Canada)
The recommendations for mutation nomenclature give guidelines on the proper nomenclature
for recessive diseases where there are two mutations identified in one gene. I have a patient
with hearing loss who has a mutation in GJB2 (c.35delG) and a mutation in GJB6
(c.689_690insT). Any suggestions on how I should write this?
Answer
The recommendation is to use the format GJB2:c.[35delG]; GJB6:c.[689_690insT] (see Discussion). This format prevents confusion regarding
the reference sequence used (i.e. "GJB2:") and combines this with the
normal format to describe variants in recessive diseases (format c.[76C>T];[87G>A])
Using the format given it is of course still essential to describe the reference sequence
used (GenBank file with version number). Another format, coping with this directly, is to
describe the variants as NM_004004.2:c.[35delG]; NM_006783.1:c.[689_690insT], i.e.
using the Genbank reference sequences in stead of the HGNC Gene Symbol.
Question
I study a gene located on the X-chromosome. How should I describe the variants detected in
males and females?
Answer
In females the description is straightforward, like "c.[76A>C];[=]".
In males there is no second allele (X-chromosome) which can be described as c.[76A>C];[0]"
(see DNA recommendations).
Question
Detailed analysis of a DMD patient showed that it was a mosaic case; consequently
two different nucleotides were found at one position, a G and a C (a G is the normal
sequence). How should I describe this?
Answer
Mosaic cases,
i.e. two different nucleotides found at one position on one allele (chromosome) should be described
as c.[=/83G>C] (see General recommendations).
NOTE: this recommendation was recently changed (Aug.2010). Initially the suggestion was to describe mosaic cases using c.[=, 83G>C]. This recommendation was changed to follow standards from the ISCN (International System for Human Cytogenetic Nomenclature, see Recent changes)
Question (Harriet Meyer,
JAMA-archives.org)
The subject of promoter polymorphisms has come up, and I would be grateful for your
recommendation of how these should be described.
Answer
For variants in the promoter region it is recommended to describe these in
relation to a genomic reference sequence (like L01538.1:g.1407C>T).
Describing a promoter variant in relation to a coding DNA reference sequence is possible
and should be in relation to the A of the ATG initiation codon, counting backwards to the
variant nucleotide (in the example given c.-401C>T indicating a change of the C 401
nucleotides upstream of the ATG, in the promoter, to an T). To be unequivocal, next to the
coding DNA reference sequence (to identify the A of the ATG) one should also mention the
genomic reference sequence used (to identify the C at -401) or include upstream sequences
in the coding DNA reference sequence (see Discussion).
This would make it rather complex - one has to retrieve two sequence files.
Consequently, it would be much easier to describe the variant directly in relation to the
genomic reference sequence. A format which one could use is "L01538.1:g.1407C>T
(at -401 of the ATG)".
Please note that it is not correct to provide descriptions in relation
to the start site of the mRNA. There is often a debate as to where the RNA exactly starts
and one should not describe DNA variants in relation to such a 'variable' site. Of course
it is acceptable that the authors mention, between brackets, the approximate position of
the change in relation to the promoter.
Question
How should a mutation in the 5'UTR be described that gives rise to a new translation
initiation site ?
Answer
Description at the DNA-level should be e.g. c.-23A>T (changing -25 caGggt
-19 to caTggt, creating a new ATG-triplet). Description at
the RNA-level should be like r.-23a>u and description at the protein level could be
like p.Met1extMet-8 (or p.M1extM-8, see Recommendation
protein level). This indicates that due to a variant the protein sequence becomes extended
N-terminally by the addition of 8 new amino acids. Note that descriptions on RNA and
protein level should only be given when this was experimentally verified; if not, changes
should be placed between brackets to indicate that it is a prediction only.
Question (Dean J. Danner, Atlanta, USA)
We are characterizing mutations in nuclear encoded proteins that function in
mitochondria. The problem is in proteins that have amino terminal mitochondrial signal
peptides. The current rules for proteins say to start numbering with the initiating
methionine. However, the functional protein has this target peptide removed and therefore
many investigators begin numbering at the amino acid residue of the mature protein.
Mutations that result in changes in the targeting peptide suggest that numbering should
begin with the Met-1. An alternative would be to give the targeting peptide negative
numbers as in the nucleic acids upstream of the transcriptional start site. It would be
helpful to have some rules for consistency in the field.
Answer
As already suggested in your question, protein reference sequences should always represent
the complete primary translation product, not a processed mature or
functional protein (see Recommendations).
Question (Sven Arnold, Austria)
There are several examples you give where changes affecting a series of amino acids are
described using the most 3' amino acid. Does this also apply when it is known exactly
which amino acid is affected? Example; the sequence ATGTCAAGCTCT codes for MetSerSerSer.
An insertion of AGT (c.9_10insAGT) gives ATGTCAAGCAGTTCT, coding for MetSerSerSerSer.
Looking at the protein sequence you would describe the change as p.Ser3dup. Knowing the
nucleotide changes, it would be accurately described as p.Ser2_Ser3insSer. My question is,
do we describe the protein change as it appears, or do we try and describe it according to
the (known) underlying DNA change?
Answer
Descriptions at protein level should describe the changes observed on protein level - one
should not try to incorporate knowledge regarding the change at DNA-level
(see Recommendations). As a consequence,
the amino acid change described may be caused by a change which at DNA level lies several
nucleotides upstream, like in the example you give. Another example is that where a frame
shift deletion at DNA level does not immediately affect the protein sequence.
| Top of page | MutNomen
homepage | Check-list |
| Recommendations: DNA, RNA, protein, uncertain |
| Discussions | FAQ's | Codons / amino acids | History |
| Example descriptions: QuickRef / symbols,
DNA, RNA, protein |
Copyright © HGVS 2007 All Rights Reserved |