Frequently asked questions regarding the description of sequence variants |
|
This page gives an overview of the questions we have received regarding the description of sequence variations based on the existing recommendations (published in by den Dunnen and Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format).
For reactions: E-mail (to: HGVSmn @ JohanDenDunnen.nl) or use the HGVS variant description forum.
Question (Marco
Montagna, Padua, Italia)
Recently, I have been involved in the molecular characterization of
BRCA1 gene rearrangements that are becoming more and more frequent in
breast/ovarian cancer families. Most often these rearrangements are
mediated by Alu sequences with a very high homology that reaches 100% in
the breakpoint region. I looked at the reference papers on mutation
nomenclature, but I still have some doubts on how to define such kind of
mutations. In particular, if a genomic deletion is mediated by Alu
sequences that are identical over a large nucleotide stretch
containing the breakpoint, what nucleotide should be indicated?.
Could I indicate the most 3' one (considering the "sense" strand),
similarly to the rule for deletions in repeated sequences?. Moreover, if
more than one genomic sequence is present in GenBank, which one should
be considered ?. For instance, for a rearrangement that deletes a
genomic region of 20kb containing exon 1 and the upstream sequence, with
a breakpoint occurring over a stretch of nucleotides that are identical
in the two recombining sequences, and a genomic reference sequence of
the antisense strand, I would suggest the following definition:
nt.X (the most 5' in the identity region of the "antisense"
reference sequence, i.e. the most 3' in the "sense" strand) --
nt.Y del 20kb (exon > 1). Would be that fine ?.
Answer
You touch on two subjects; location of the breakpoint and reference
sequence.
Breakpoint; indeed, like
you suggest, when breakpoints occur in stretches of identical sequences
the most 3' position (considering the sense strand) is used to describe
the position of the breakpoint (see
Recommendations).
Reference sequence; any
reference sequence would be OK at least when you specify the one you use
(database accession.version number, see
Reference sequence discussion). When present, it would be best
to use the genomic Reference Sequence from the RefSeq
database. When such a sequence is not present you should make,
annotate and submit one (see Discussion).
Depending on whether a genomic or a coding DNA
reference sequence is used the final description should have the format;
g.1234_7234del (alternative g.1234_7246del6012) or c.123+45_955-234del
(alternative c.123+45_955-234del6012).
Question (Erik-Jan
Kamsteeg, Nijmegen, Nederland)
The recommendations to describe unknown breakpoints are not exactly
clear to me. For example, PCR analysis of a gene on the X-chromosome
shows products for exons 1-3 and no product is detected for exons 4-14
(exon 14 is the last exon of the gene). Since PCR does not work with one
primer, we are not sure whether exon 4 and 14 are completely absent, or
only partially. Therefore, using the first base of exon 4 and the '-?' (see
Recommendations) could be wrong, as could be the last base of exon
14 with a '+?'. Therefore, I would like to use the last base of exon 3
with '+?' and the last base of exon 13 with a '+?'. What are your
recommendations?
Answer
Literally speaking you are right and it is best to set the borders as
precise as possible. So when exon 3 is present in fact the location of the
reverse primer can be used to set the most 5' border (and the same for the
exon 14 primer). Consequently the description could be something
like (87+123_88-?)_(923+?_924-98)del. Although precise one
might wonder whether such a description is attractive;
c.(87+1_88-1)_(923+1_924-1)del is as clear (see
Uncertainties). When it is
difficult to give an exact nucleotide position for a specific
probe/sequence tested, a rule of thumb is to use the central nucleotide.
NOTE: for simplicity there are more
descriptions that are not fully correct. For example, stop codons are
reported as p.Cys123* while one could argue that p.Cys123_Met2376del is
more precise (Met2376 being the last amino acid of the protein).
Question
Is a description like c.EX17del, indicating a deletion of exon 17, still
valid?
Answer
A description like c.EX17del has never been accepted. Descriptions
should indicate the nucleotides affected by the change. Note also that for
many genes exon numbering is often not clearly defined and/or not
described accurately.
Question
How should I describe the change TGT GC CA to TGT TG
CA. Can I call it a dinucleotide mutation or is it a deletion /
insertion mutation ?.
Answer
Simply describe it as c.4_5delinsTG (alternatively it can be described
c.[4G>T; 5C>G]). Although c.4_5GC>TG is clear and unequivocal,
the description as a deletion/insertion follows the general
recommendations more precisely (see
Recommendations).
Question
At position c.2077_2078 in the BRCA1 gene I have a TA insertion. The
published sequence for c.2076_2077 is TG however the individual has a
common variant at c.2077 (G>A) and the TA insertion is on that
allele. Should I call it c.2076_2077dupTA since I know that is the
description of the change on that specific allele or should I call it
c.2077_2078insTA which would be the correct description based on the
more common sequence at that position.
Summary; the BRCA1 coding DNA reference sequence from position 2074_2080 is ..CATGACA.. A frequent variant in the population is ..CATAACA. and the sequence found in the individual is ..CATA TA ACA. |
Answer
The basic rule is to describe variants in relation to a
reference sequence. In this respect, the description
c.2076_2077dup (c.2076_2077dupTA) is not correct because the
reference sequence does not contain a TA dinucleotide at position
c.2076_2077 (it has TG). The description c.2077_2078insTA is also not
correct because the change c.2077G>A is neglected and all
changes should be described. So the correct description
is c.2077delGinsATA (or c.2077delGinsATA).
NOTE: in cases like the above, where frequent variants are present at the site changed it is allowed to describe these individually. c.[4G>T; 5C>G] in the first case, assuming either c.[4G>T] or c.[5C>G] is a known frequent variant. c.[2077G>A; 2077_2078insTA] in the second case with c.2077G>A known as the frequent variant. Of course it is essential in such cases that the variants reside on one allele.
Question (Ron
Agatep, Toronto, Canada)
Several groups have identified a duplication in the CDKN2A locus that
has been labeled in various ways. The mutation is a duplication of the
first 24 bp
The ATG translation initiation codon is underlined (translational start). One group has described the mutation as 23ins24 is this correct? My interpretation of your recent paper suggests I should name it 1_24dup. Could you provide me with the correct nomenclature ?.
Answer
Correct is c.9_32dup (p.Ala4_Pro11dup) - the description c.1_24dup
(p.Met1_Ser8dup) seems correct but please note that for all descriptions
the most 3' position possible should be arbitrarily assigned to
have been changed (see Recommendations).
c.23ins24 is not correct, first because the position of the insertion is
not clear (see Discussion), second
'ins24' does not indicate which sequence was inserted.
Question
How should I describe a change where ATCG-ATCGATCGATCG-A-GGGTCCC becomes
ATCG-ATCGATCGATCG-A-ATCGATCGATCG-GGGTCCC ?. The fact that
the inserted sequence (ATCGATCGATCG) is present in the original sequence
suggests it derives from a duplicative event.
Answer
A correct description of the insertion is c.17_18ins5_16 (see
Recommendations). A description using 'dup' is not correct
since by definition a duplication is a sequence change where a copy of one
or more nucleotides are inserted directly 3'-flanking of the
original copy (see
Standards). Still, the description given makes it
clear that the sequence inserted between nucleotides c.17 and c.18 is
probably derived from nearby, i.e. position c.5_16, and thus likely
derived from a duplicative event.
Question
The 3' end of intron 8 of the CFTR gene contains a variable sequence;
IVS8(TG)mTn. The CFTR genomic reference sequence of the end of intron 8
is ...TGTGTGTGTGTTTTTTTAACAG[..exon9..], with a tract of (TG)11 and T7.
When we describe this sequence variation as c.1210-14(TG)9-13(T)5-9 and
that of the IVS8Tn as c.1210-6(T)5-9, are we right? Is the description
of a T5 tract variant as c.1210-14(TG)12T5 correct ?.
Answer (see Repeated
sequences)
A difficult case; please note that following current recommendations it is
not a TG11 but a GT11 variant (see
Recommendations), overlapping one T-nucleotide
with the T7 stretch. However, to prevent confusion it is probably best to
use in this exceptional case TG11.
The correct description depends on the reference sequence used.
Assuming this reference sequence is as described, i.e. TG11 followed by
T7, the TG11 stretch is located at c.1210-34_1210-13 and T7 stretch at
c.1210-12_1210-6. A correct description of the variants is then
c.1210-34TG(9_13)T(4_8) (or c.1210-34_1201-33(9_13)T(4_8)). c.1210-34
because the variable tract starts at that position.
When only the T stretch is described the correct description is
c.1210-12T(5_9). A correct description of the T5 variant is c.1210-12T[5].
NOTE: to indicate the range, "_" must be used and not
"-".
Question
Is the description NM_012345.3:c.123+45_123+51TSDinsL1.603bp
acceptable (TSD = target site duplication, L1 indicates the nature of
the insert (L1, Alu or SVA) after "ins"; 603bp = the number of inserted
base pairs) ?.
Answer
Following the current recommendations the description should be NM_012345.3:c.123+45_123+51dupinsAB012345.3:g.393_1295
(alternatively NM_012345.3:c.123+45_123+51dupins603). So use "dup"
(not "TSD") and leave out "bp" (not necessary). The insertion itself is
described as AB012345.3:g.393_1295, indicating that the inserted sequences
are nucleotides 393 to 1295 from GenBank file AB012345.3. Adding "(L1)" in
the description to indicate the nature of the inserted sequence is not
recommended, it might cause confusion. The "Remarks" column of the summary
sequence variant Table can be used for this annotation.
Question
How should we, using the most current recommendations, indicate a
change in one allele. The notation we envisage should indicate
that the other allele has no change compared to the reference
sequence. For the unchanged allele "[?]" would not be appropriate
since it is not the case that allele 2 has an unknown variant; it simply
has change. The notation "c.[76A>C]" without describing the second
allele would be misleading; not enough researchers would be familiar
enough with the nomenclature to know that this refers to only one of the
two alleles present. Would the description "c.[76A>C];[]" be OK ?.
Answer
The character used to indicate 'no change' is the '=' (see
Recommendations). The recommended description is thus "c.[76A>C];[=]".
Question (Andrew
Grimm, Coordinator RettBASE)
When I come across cases where a person has two variants and it isn't
known whether or not they are on the same chromosome how should I
describe this ?.
Answer
Although we do not recommend to describe uncertainties, in
this case it is clear that to prevent mistakes a recommendation is
required. Two changes in one allele should be described as c.[76A>C;
91C>G] and two changes on different alleles as c.[76A>C];[91C>G].
When it is not clear whether the changes are on the same or
on different alleles the recommendation is to describe this using the
format c.[76A>C(;)91C>G] (see
Recommendations).
Question (Nancy
Carson, Ottawa, Canada)
The recommendations for mutation nomenclature give guidelines on the
proper nomenclature for recessive diseases where there are two mutations
identified in one gene. I have a patient with hearing loss who
has a mutation in GJB2 (c.35delG) and a mutation in GJB6
(c.689_690insT). Any suggestions on how I should write this?
Answer
The recommendation is to use the format GJB2:c.[35delG]
GJB6:c.[689_690insT] (see Discussion).
This format prevents confusion regarding the reference sequence used (i.e.
"GJB2:") and combines this with the normal format to describe
variants in different alleles. Using the format given it is of course
still essential to describe the reference sequence used (GenBank file with
version number). Another format, coping with this directly, is to describe
the variants as NM_004004.2:c.[35delG] NM_006783.1:c.[689_690insT],
i.e. using the Genbank reference sequences in stead of the HGNC
Gene Symbol.
Question
I study a gene located on the X-chromosome. How should I describe the
variants detected in males and females?
Answer
In females the description is straightforward, like "c.[76A>C];[=]".
In males there is no second allele (X-chromosome) which can
be described as c.[76A>C];[0]" (see
Recommendations).
Question
Detailed analysis of a DMD patient showed that it was a mosaic case;
consequently two different nucleotides were found at one position, a G
and a C (a G is the normal sequence). How should I describe this?
Answer
Mosaic cases,
i.e. two different nucleotides found at one position on one allele
(chromosome) should be described as c.[83G=/>C] (see
Recommendations).
NOTE: this recommendation was changed (Aug.2010 and Nov.2015). Initially the suggestion was to describe mosaic cases using c.[=, 83G>C]. This recommendation was changed to follow standards from the ISCN (International System for Human Cytogenetic Nomenclature, see Recent changes) and from c.=/83G>C after acceptance of proposal SVD-WG001.
Question (Harriet
Meyer, JAMA-archives.org)
The subject of promoter polymorphisms has come up, and I would be
grateful for your recommendation of how these should be described.
Answer
For variants in the promoter region it is recommended to
describe these in relation to a genomic reference sequence
(like L01538.1:g.1407C>T). Describing a promoter variant in relation to
a coding DNA reference sequence is possible and should be in relation to
the A of the ATG initiation codon, counting backwards to the variant
nucleotide; in the example given c.-401C>T indicating a change of the C
401 nucleotides upstream of the ATG (in the promoter), to an T. To be
unequivocal, next to the coding DNA reference sequence (to identify the A
of the ATG) one should also mention the genomic reference sequence used
(to identify the C at -401) or include upstream sequences in the coding
DNA reference sequence (see Discussion).
This
would make it rather complex - one has to retrieve two sequence
files. Consequently, it would be much easier to describe the
variant directly in relation to the genomic reference sequence. A format
which one could use is "L01538.1:g.1407C>T (at -401 of the ATG)".
Please note that it is not correct to provide descriptions in relation to
the start site of the mRNA. There is often a debate as to where the RNA
exactly starts and one should not describe DNA variants in relation to
such a 'variable' site (see Discussion).
Of course it is acceptable that the authors mention, between brackets, the
approximate position of the change in relation to the promoter.
Question
How should a mutation in the 5'UTR be described that gives rise to a new
translation initiation site ?
Answer
Description at the DNA-level should be e.g. c.-23A>T (changing
-25 caGggt -19 to caTggt,
creating a new ATG-triplet). Description at the RNA-level should
be like r.-23a>u and description at the protein level could be like
p.Met1extMet-8 (or p.M1extM-8, see
Recommendation protein level). This indicates that due to a
variant the protein sequence becomes extended N-terminally
by the addition of 8 new amino acids. Note that descriptions on RNA and
protein level should only be given when this was experimentally verified;
if not, changes should be placed between brackets to indicate that it is a
prediction only.
Question (Dean J.
Danner, Atlanta, USA)
We are characterizing mutations in nuclear encoded proteins that
function in mitochondria. The problem is in proteins that have amino
terminal mitochondrial signal peptides. The current rules for proteins
say to start numbering with the initiating methionine. However, the
functional protein has this target peptide removed and therefore many
investigators begin numbering at the amino acid residue of the mature
protein. Mutations that result in changes in the targeting peptide
suggest that numbering should begin with the Met-1. An alternative would
be to give the targeting peptide negative numbers as in the nucleic
acids upstream of the transcriptional start site. It would be helpful to
have some rules for consistency in the field.
Answer
As already suggested in your question, protein reference sequences should
always represent the complete primary translation product,
not a processed mature or functional protein (see
Recommendations).
Question (Sven Arnold,
Austria)
There are several examples you give where changes affecting a series of
amino acids are described using the most 3' amino acid. Does this also
apply when it is known exactly which amino acid is affected? Example;
the sequence ATGTCAAGCTCT codes for MetSerSerSer. An insertion of AGT
(c.9_10insAGT) gives ATGTCAAGCAGTTCT, coding for MetSerSerSerSer.
Looking at the protein sequence you would describe the change as
p.Ser3dup. Knowing the nucleotide changes, it would be accurately
described as p.Ser2_Ser3insSer. My question is, do we describe the
protein change as it appears, or do we try and describe it according to
the (known) underlying DNA change?
Answer
Descriptions at protein level should describe the changes observed on
protein level - one should not try to incorporate knowledge
regarding the change at DNA-level (see
Recommendations). As a consequence, the amino acid change
described may be caused by a change which at DNA level lies several
nucleotides upstream, like in the example you give. Another example is
that where a frame shift deletion at DNA level does not immediately affect
the protein sequence.
Question
When a protein description does not contain "fs" (frame
shift) does this mean there is no frame shift?
Answer
By definition frame shifts are a special
type of amino acid deletion/insertion replacing the
normal C-terminal sequence with one encoded by another reading frame
(specified 2013-03-16,
see Describing protein variants).
Descriptions
at protein level describe the consequences of a change on the protein
irrespective of the changes at DNA or protein level. Translating back from
protein to DNA (or RNA) is therefore difficult and usually only works for
simple cases like substitutions. Examples of what one might call frame
shifts that can not ne seen from the protein description include;
Question (Giampaolo
Trivellin, London)
How should we describe the consequences of a duplication of a G at DNA
level that causes a frameshift at the protein level where the shifted
frame does not encounter a new stop codon. I was thinking to describe it
using the short description (e.g. p.Ile327fs), but we expect that the
protein is not formed since the aberrant RNA will be degraded lacking a
stop codon.
Answer
The description p.(Ile327fs) can indeed be used (please note the brackects
to indicate RNA was not yet analysed). It circumvents however the problem
that the current recommendations did not yet indicate how to describe a
frame shift that does not encounter a stop codon. The recommendation is to
describe this using "fs*?", so p.(Ile327Argfs*?) (see
Recommendations). The "?" indicates uncertainty, in this case
that the position of the stop codon is not known. When you have analysed
RNA and it is indeed undetectable (degraded) the RNA description would be
r.0 (no RNA) and the protein description p.0 (no protein).
| Top of page | Homepage
| Check-list | Symbols,
codons, etc. |
| Recommendations: general, DNA,
RNA, protein,
uncertain |
| Discussions | FAQ's | Symbols,
codons, etc.| History |
| Example descriptions: QuickRef,
DNA, RNA,
protein |
Copyright
© HGVS 2007 All Rights Reserved |