![]() |
A reference sequence - discussions and FAQs |
|
Since references to WWW-sites are not yet acknowledged as citations
, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.Discussions on a proper reference sequence have been very lively. In general it can be concluded that all suggestions made have their pro's and con's, but there is no perfect solution.
Theoretically, a genomic reference sequence is the best choice. By simply numbering nucleotides from 1 to the end of the file no problems occur with complex gene structures like multiple transcription start sites (promoters / 5'-first exons), multiple translation initiation sites (ATG-codons), alternative splicing and the use of different 3'-terminal exons and poly-A addition sites.
In practice a coding DNA reference sequence is mostly preferred. The most important reason is that from the description one immediately gets some information regarding the location of the variant; exonic or intronic, 5' of the ATG or 3' of the stop codon and, by dividing the nucelotide number by 3, the number of the amino acid residue that is affected (see Nucleotide numbering).
- for a human, a genomic reference sequence does not contain any useful information, a coding DNA reference sequence does.
- a gene can be very large (over 2.0 Mb) - this makes nucleotide numbering based on a genomic reference sequence rather impractical (e.g. g.1567234_1567235insTG). Furthermore, genomic reference sequences based on GenBank NT_ files become increasingly long (e.g. the CFTR gene in NT_007933.15, >77 Mb) and consequently loose their informativity. Downloading such large files is, even with good internet connections, time consuming and working with them rather difficult.
- when a genomic reference sequence is taken from a complete genome sequence, e.g. a bacterium or the human X-chromosome, the transcriptional orientation of the gene of interest may be on the minus (-) strand. This makes the description of sequence variants rather complicated, especially when the consequences on RNA and/or protein level need to be described; nucleotides on DNA and RNA level are complementary and numbering goes in different directions - a confusing situation that should be prevented.
- when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).
- when the gene sequence is incomplete (especially when large introns are present) - a genomic sequence can not be used.
- genes may contain very large introns with many intronic (length) variants present in the population - it is thus very difficult to give THE genomic reference sequence (see Genomic sequence changes regularly).
- the exact transcriptional start site (cap-site) of a gene has often not been determined and/or its assignment is debated - the first nucleotide can thus not be assigned with certainty. The same might be true for the translation initiation site (ATG-codon).
- a gene may have several transcripts, using different promoters / 5'-first exons, alternatively spliced internal exons, different 3'-terminal exons and polyA-addition sites - a complete coding DNA reference sequence can thus not be generated (see Alternatively spliced exons - nucleotide numbering),
- the different transcripts may encode different proteins (isoforms) with, when different promoters are used, different N-terminal sequences and even using different reading frames in one or more exons. A pcomplete protein reference sequence can thus not be assigned.
- when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).
NOTE:
the recommendation is to use a
LRG (Locus Reference Genomic sequence, Dalgleish
et al. 2010). Information on LRG's, and how to get one for your gene of
interest, can be found at the LRG
website.
When a genomic reference sequence is used the following recommendations should be followed;
For complex genes, when on the genomic reference sequence all transcripts are annotated properly, computational tools (like Mutalyzer and the Genomic Mutation Consequence Calculator) can easily predict the consequences of a sequence change on all transcripts and their encoded protein isoforms, incl. when they derive from overlapping genes.
When a coding DNA reference sequence is used the following recommendations should be followed;
As discussed, genes can be rather complex and the choice of a good coding DNA reference sequence can be very difficult. Below we will refer to some examples of how experts have tried to resolve the issue
Do you have other examples - please let us know (E-mail to: J.T.den_Dunnen @ LUMC.nl) !.
The HGVS recommendations for the description of sequence variants does not include suggestions for the numbering of exons and introns. The simple reason is that exon/intron numbers are not required for a correct description. When necessary, the exon/intron numbers can be derived from the description at DNA level.
In fact, using exon/intron numbers introduces a lot of confusion, which is undesired; assume an exon number is in conflict with the description of the variant at DNA level, what to do ?. In many genes there is no consensus on exon/intron numbering and originally used numbering schemes had to be revised to include newly discovered exons (internal as well as 5' and/or 3' of the gene). This led to all kinds of numbering schemes using no consensus or overall logic, making it very difficult for non-experts in the specific gene to keep track of all details (see Dalgleish et al. 2010). With the increasing use of genome browsers, numbering exons simply from start to end 1, 2, 3, etc., legacy numbering schemes have become even more confusing.
The only logical thing to do is to follow the standard set by the genome browsers and to start numbering with 1 for the first exon. Although this is probably difficult to accept by the experts, we can not keep on confusing newcomers by forever using legacy numbering systems. We should realize that, at some point wrong assumptions will be made and a patient wiil end up with an erroneous diagnosis, which is of course unacceptable.
Describe variants at DNA level and do not include exon or intron numbers as part of the description. Exon and intron numbers may be mentioned but only when there use is specified and reference sequences for the exons and introns are gvien. Since history will leave its tracks, when refering to older data, always mention changed numbering schemes in M&M and in Figure and Table legends to prevent any confusion. For tables even consider to add an additional column indicating the legacy numbering.
Examples;
Initial recommendations (see e.g. Antonarakis [1998] Hum.Mut. 11: 1-3) suggested two alternative descriptions for variants in intron sequences based on a coding DNA reference sequence; the formats c.88+2T>G / c.89-1G>T and c.IVS2+2T>G / c.IVS2-1G>T. The current recommendation is that the format c.IVS2+2T>G / c.IVS2-1G>T should not be used anymore.
Reason: from the description c.IVS2+2T>G it is difficult to deduce
where the position of the intron relative to the coding DNA sequence is. In addition, when
one wants to deduce this position, this is often problematic. First, many authors fail to
mention the genomic + coding DNA reference sequences that were used as the basis of exon/intron numbering. Second, since on first publication gene sequences are often based on
incomplete sequences, initial exon / intron structure often turns out incomplete and
numbering changes later (see Numbering exons / introns).
Consequently, descriptions using the format c.IVS2+2T>G fail the basic criterion
to be unequivocal and should thus not be used. Descriptions using the format
c.88+2T>G do not suffer from these problems.
NOTE: when intronic variants are described in relation to a coding DNA
reference sequence authors should not forget to mention the genomic reference sequence where the
intron sequence can be found.
A basic recommendations is to use the shortest description as much as possible. Therefore, in the middle of an intron nucleotide numbering changes from + to - (e.g. from "c.88+.." to "c.89-.."). In addition, when a change in an intron is described as c.88+4356A>G (in stead of c.89-2A>G) it will not be clear that the change might be close to the splice acceptor site, and thus might affect splicing. This is immediately clear when the description c.89-2A>G is used.
Question
When description in relation to a Reference Sequence is problematic could one
specify the change in between 20 bp of sequence on both sides ?.
Answer
In many cases this would be OK but for recently duplicated genes or genes which contain
repeated segments even giving 20 nucleotides to either side will not be sufficient.
Furthermore, descriptions will become very long. For problematic cases the best method is probably
to include the raw data, i.e. the sequence file itself.
Question
When I retrieve a cDNA sequence from GenBank nucleotide numbering does not start
with +1 at the A of the ATG translation initiation codon.
Answer
True, but such a file can be simply obtained. When you retrieve the sequence from the
RefSeq-database (i.e. start at EntrezGene, enter the
gene symbol or gene name, select the gene of interest, click the mRNA entry) it will
be annotated extensively (see
Example). Clicking the "CDS" annotation (CoDing
Sequence) opens a window where the nucleotide numbering will start with 1 at
the A of the ATG translation initiation codon (see
Example). To assist those studying or reporting sequence variants a locus specific
database (LSDB, see HGVS - list of
LSDBs) usually provides the coding DNA reference sequence with the nucleotide
numbering (see Example).
Question
The recommendation on numbering genomic and coding DNA variants based on the first
nucleotide of the initiation codon ATG is workable only if the reference sequence in the
database is published as a single file. In the case of the gene CDKN2A, its genomic
sequence is stored as multiple files, each containing one exonic sequence and partial
intronic sequences on both ends of the exon. I can use the above recommendation easily to
number variants in exon 1 where the initiation codon is located. The problem is how should
I number variants in exon 2 which is located in another database file ?.
Answer
If no database file is available that contains the complete genomic sequence, a coding
DNA Reference Sequence, preferably from the RefSeq database, should be used. Since for
many organisms a genome sequence is freely available, a database curator can easily make a
fully annotated file (genomic and coding DNA) covering the sequence of interest and submit
it to the RefSeq database. This file can than be used as the reference sequence.
Question (Isabelle Touitou, Montpellier, FRANCE)
If the first translation ATG is in exon 2, and we find a variant 5' to exon 1, should
we include intron 1 in the counting process?.
NOTE: based on a coding DNA reference sequence intron 1 is located between
nucleotides -15 and -14.
Answer
Nucleotides in introns 5' of the ATG translation initiation codon (i.e. in the 5'UTR) are numbered
as all other nucleotides (see Examples and Figure).
In your example, based on a coding DNA reference sequence, an intron is present between
nucleotides -15 and -14. The nucleotides for this intron are numbered as -15+1,
-15+2, -15+3, ...., -14-3, -14-2, -14-1. Consequently, regarding the question,
when a coding DNA reference sequence is used, these intronic nucleotides are not counted.
Question
The CBS gene was originally thought to contain 16 exons. Later it was recognised that exon
15 does not exist, and recently two additional non-translated 5' exons were
detected. The current gene structure therefore includes 17 exons, of which exons 3 to 17
are translated. Should the exons of a gene be counted from the exon that contains the
start codon rather than the beginning of the cDNA ?. If so, should exons preceding the
start codon be counted 0, -1, -2, etc. or should the 0 be skipped ?. Is there an agreement
on how to deal with corrections in exon numbering ?.
Answer
For the description of sequence changes it does not matter how exons are numbered
!; exon (and intron) numbers are not used in the descriptions. In fact this is one reason
why the recommendation is as it is (see Description of intronic variants).
Examples (using a coding DNA reference sequence);
- c.-5G>T: a change 5' of the ATG (in the 5'UTR)
- c.5G>T: a change in the coding (related to a change in amino acid 2)
- c.256+1G>T: a change in the 5' end of an intron
- c.257-1G>T: a change in the 3' end of an intron
- c.*5G>T: a change 3' of the stop codon (in the 3'UTR)
For exon numbering the only logical thing to do is to start with 1 for the first
exon, otherwise eventually problems will emerge. For other numbering schemes only
the experts will know the history; newcomers just blindly
assume that the first exon is exon 1. Consequently, when historic numbering schemes are
used, at some point wrong assumptions will be made and a patient might end up with an
erroneous diagnosis.
However, since history will leave its tracks it
is suggested to always mention changed numbering schemes in M&M and in all Figure and
Table legends to prevent any further confusion. For tables even consider to add an
additional column indicating the historic / old exon number.
Question (Alessandra Splendore, Rio de Janeiro, Brasil)
Recently two previously unidentified exons of the TCOF1 gene were identified, and named
6A and 16A. Exon 6A is present in most of the transcribed isoforms, exon 16A is included
only in minor isoforms. In updating the nomenclature of reported mutations in TCOF1,
should I use a sequence that corresponds to the major isoform (with exon 6A, but without
16A) or the sequence that corresponds to the longest ("most complete") isoform
?.
Answer
This is the eternal problem of changes in the coordinates of a reference sequence. The best
solution is that the TCOF1-community gets together and decides to use an updated
reference sequence representing the most complete transcript, i.e including exons 6A and
16A. This updated sequence should be annotated properly, submitted to the RefSeq database and used from then
on.
Question (JM Friedman, Vancouver, CANADA)
We are working on a new locus-specific mutation database for NF1 and NF2, and we have
run into a problem with the standard mutation nomenclature based on the genomic
sequence. The problem is that the canonical genomic sequence (and consequent
numbering) we are using as the basis of the mutation nomenclature has changed
repeatedly since many of the mutations were described, and it is continuing to
change. If we use the names assigned to the mutations on the basis of the version of
the sequence that was used to name the mutations, they do not map to the proper position
in the current version of the sequence. If we change the names to match the new sequence,
they will not match the published names for these mutations and may need to be changed
again the next time time the sequence changes. (Actually, the current version of the NF1
sequence is annotated on the wrong strand, so all the numbering would be backwards if we
used the annotated strand instead of its complement, which is the really the correct one).
The solution to identifying the
mutation unequivocally is to provide enough of the surrounding sequence to permit a unique
result on a BLAST search, and we are doing this. However, this does not solve the problem
of naming the mutations. What is your recommendation for this ?.
Answer
Indeed the problems you mention make live very hard. In fact, especially with genes
containing large introns, there will be no one genomic reference sequence since every gene
will be slightly different (see above). The problem of
continuously changing genomic sequences will not settle rapidly. When designating
"THE genomic reference sequence" now one can already foresee future discussions
whether this choice was proper; it will be a "random pick" and might not be the
evolutionary correct choice. The way to go in our eyes is to declare one sequence
THE genomic reference sequence (starting several kilo base pairs 5' of the
promoter region), annotate it properly, submit it to the RefSeq database and use it from
then on. The RefSeq database has NG_ files specifically made for this purpose (see e.g. NG_000004.2).
These problems are one of the reasons why for the
LSDB's I curate (i.e. Johan den Dunnen), I prefer a coding DNA Reference
Sequence. In that case the effect of the ever changing intronic sequences has only
a marginal effect.
Question
We are preparing an annotated set of Hox genes from the zebrafish for publication. If
the coding DNA sequence is not completely known, but only an EST lacking 5'
sequence and a genomic sequence covering the EST, how do you describe a change in
this sequence; do you number it in relation to the EST or the genomic sequence ?.
Furthermore, if there is a mismatch between the genomic and the EST sequence, and
you don't know which one is correct, how do you define e.g. whether the genomic sequence
has an insertion or the EST has a deletion ?.
Answer
First, the reference sequence chosen is always assumed to be the
correct sequence simply because changes are described in relation to this
sequence.
Second, when the EST sequence is incomplete one should describe changes in relation to this sequence like AA010203.2:54_55insG (assuming the reference sequence used is AA010203.2). So do not use a 'c.' or 'g.' prefix, since neither a coding DNA nor a genomic reference sequence is used. However, when a genomic sequence covering this EST is available the recommendation is to use this as a reference sequence.
Question
Making a judgment on what is the "wild type" (wt) nucleotide for some
sequences seems arbitrary at best. How would you suggest that the description be presented
for these ?.
Answer
Changes are always described in relation to a "reference sequence".
This reference sequence is considered to be the "wild type"
sequence and is expected to be the one present in the database (GenBank). Consequently,
reference and wild type sequence can be different. Note however that everybody has
influence on the sequences in the RefSeq
database and thus may request that a variant is changed into the more common
allele. However, the debate about what is wild type can be unsolvable when variants are
very common (near 50%) or differ between populations.
Question (M Paalman, Human Mutation)
How should sequence variants in the mitochondrial DNA (mtDNA) be described ?.
Answer
The mtDNA genome is rather small, completely sequenced and numbered. According to current
recommendations variants in the mitochondrial DNA should be described in relation to a the
full mitochondrial DNA sequence, i.e. the genomic reference sequence
(GenBank NC_001807.4).
Descriptions should be preceded by "m.", like m.8994T>C (see Recommendations). The mtDNA encodes a range of
different proteins. To prevent confusion, changes at protein level should be described
including a reference to the protein changed, like p.ATP6:Leu156Pro (GenBank NP_536848.1).
| Top of page | MutNomen
homepage | History | Check-list
|
| Recommendations: DNA, RNA,
protein, uncertain |
| Definitions & symbols | Nucleotides,
codons & amino acids | Discussions | FAQ's
|
| Example descriptions: QuickRef,
DNA, RNA,
protein |
Copyright © HGVS 20107 All Rights Reserved |