A reference sequence - discussions and FAQs |
|
Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.
To indicate what type of reference sequence is used for the description of the sequence variant, specific indicators are used;
For details on residue numbering see Standards. Note that recommendations also exist to describe different transcripts/protein isoforms generated from one gene (see Standards).
Discussions on a proper reference sequence have been very lively. In general it can be concluded that all suggestions made have their pro's and con's, but there is no perfect solution.
Theoretically, a genomic reference sequence is the best choice. By simply numbering nucleotides from 1 to the end of the file no problems occur with complex gene structures like multiple transcription start sites (promoters / 5'-first exons), multiple translation initiation sites (ATG-codons), alternative splicing and the use of different 3'-terminal exons and poly-A addition sites.
In practice a coding DNA reference sequence is mostly preferred. The most important reason is that from the description one immediately gets some information regarding the location of the variant; exonic or intronic, 5' of the ATG or 3' of the stop codon and, by dividing the nucleotide number by 3, the number of the amino acid residue that is affected (see Nucleotide numbering).
- for a human, a genomic reference sequence does not contain any useful information (a coding DNA reference sequence does)
- a gene can be very large (over 2.0 Mb) - this makes nucleotide numbering based on a genomic reference sequence rather impractical (e.g. g.1567234_1567235insTG). Furthermore, genomic reference sequences based on GenBank NT_ files become increasingly long (e.g. the CFTR gene in NT_007933.15, >77 Mb) and consequently loose their informativity. Downloading such large files is, even with good internet connections, time consuming and working with these files is rather difficult.
- when a genomic reference sequence is taken from a complete genome sequence, e.g. a bacterium or the human X-chromosome, the transcriptional orientation of the gene of interest may be on the minus (-) strand. This makes the description of sequence variants rather complicated, especially when the consequences on RNA and/or protein level need to be described; nucleotides on DNA and RNA level are complementary and numbering goes in different directions - a confusing situation that should be prevented.
- when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).
- when the gene sequence is incomplete (especially when large introns are present) - a genomic sequence can not be used.
- genes may contain very large introns with many intronic (length) variants present in the population - it is thus very difficult to give THE genomic reference sequence (see Genomic sequence changes regularly).
- the exact transcriptional start site (cap-site) of a gene has often not been determined and/or its assignment is debated - the first nucleotide can thus not be assigned with certainty. The same might be true for the translation initiation site (ATG-codon).
- a gene may have several transcripts, using different promoters / 5'-first exons, alternatively spliced internal exons, different 3'-terminal exons and polyA-addition sites - one complete coding DNA reference sequence can thus not be generated (see Alternatively spliced exons - nucleotide numbering),
- the different transcripts may encode different proteins (isoforms) with, when different promoters are used, different N-terminal sequences and even using different reading frames in one or more exons. One complete protein reference sequence can thus not be assigned.
- when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).
The recommendation is to use a LRG (Locus Reference Genomic sequence, e.g. LRG_329. Information on LRG's (Dalgleish et al. 2010, MacArthur et al. 2014), and how to get one for your gene of interest, can be found at the LRG website. When no LRG is available, one should be requested. In the mean time, a RefSeqGene record is a good alternative (RefSeq database, format NG_008797.2). When both are not available, request an LRG, a RefSeqGene record will be made in parallel. DO NOT use LRGs that are "pending", they might change before officially released.
Reporting using LRGs as a reference is possible for genomic DNA (e.g. LRG_1:g.8463G>C), coding DNA (e.g. LRG_1t1:c.572G>C), non-coding RNA (e.g. LRG_163t1:n.5C>T) and protein (e.g. LRG_1p1:p.Gly191Ala) variants. To describe coding DNA/non-coding RNA variants the transcript must be indicated (e.g. "t1"), for protein variants the protein isoform (e.g. "p1").
When a genomic reference sequence is used the following recommendations should be followed;
For complex genes, when on the genomic reference sequence all transcripts are annotated properly, computational tools (like Mutalyzer and the Genomic Mutation Consequence Calculator) can easily predict the consequences of a sequence change on all transcripts and their encoded protein isoforms, incl. when they derive from overlapping genes.
When a coding DNA reference sequence is used the following recommendations should be followed;
NOTE: the numbering suggested can also be applied to amino acid numbering (e.g. p.Ala456+1, Cys456+2, ..., etc. or ..., p.Phe457-2, p.Gln457-1)
Suggestions have been made to extend the recommendations for the nucleotide numbering of coding DNA reference sequences to specifically indicate untranscribed nucleotides (see Discussion).
As discussed, genes can be rather complex and the choice of a good coding DNA reference sequence can be very difficult. Below we will refer to some examples of how experts have tried to resolve the issue.
Do you have other examples - please let us know (E-mail to: J.T.den_Dunnen @ LUMC.nl) !.
The HGVS recommendations for the description of sequence variants does not include suggestions for the numbering of exons and introns. The simple reason is that exon/intron numbers are not required for a correct description. When necessary, the exon/intron numbers can be derived from the description at DNA level.
In fact, using exon/intron numbers introduces a lot of confusion, which is undesired; assume an exon number is in conflict with the description of the variant at DNA level, what to do ?. In many genes there is no consensus on exon/intron numbering and originally used numbering schemes had to be revised to include newly discovered exons (internal as well as 5' and/or 3' of the gene). This led to all kinds of numbering schemes using no consensus or overall logic, making it very difficult for non-experts in the specific gene to keep track of all details (see Dalgleish et al. 2010). With the increasing use of genome browsers, numbering exons simply from start to end 1, 2, 3, etc., legacy numbering schemes have become even more confusing.
The only logical thing to do is to follow the standard set by the genome browsers and to start numbering with 1 for the first exon. Although this is probably difficult to accept by the experts, we can not keep on confusing newcomers by forever using legacy numbering systems. We should realize that, at some point wrong assumptions will be made and a patient wiil end up with an erroneous diagnosis, which is of course unacceptable.
Describe variants at DNA level and do not include exon or intron numbers as part of the description. Exon and intron numbers may be mentioned but only when there use is specified and reference sequences for the exons and introns are gvien. Since history will leave its tracks, when refering to older data, always mention changed numbering schemes in M&M and in Figure and Table legends to prevent any confusion. For tables even consider to add an additional column indicating the legacy numbering.
Examples;
Initial recommendations (see e.g. Antonarakis [1998] Hum.Mut. 11: 1-3) suggested two alternative descriptions for variants in intron sequences based on a coding DNA reference sequence; the formats c.88+2T>G / c.89-1G>T and c.IVS2+2T>G / c.IVS2-1G>T. The current recommendation is that the format c.IVS2+2T>G / c.IVS2-1G>T should not be used anymore.
Reason: from the description c.IVS2+2T>G it is difficult
to deduce where the position of the intron relative to the coding DNA
sequence is. In addition, when one wants to deduce this position, this is
often problematic. First, many authors fail to mention the genomic +
coding DNA reference sequences that were used as the basis of exon/intron
numbering. Second, since on first publication gene sequences are often
based on incomplete sequences, initial exon / intron structure often turns
out incomplete and numbering changes later (see Numbering
exons / introns). Consequently, descriptions using the
format c.IVS2+2T>G fail the basic criterion to be unequivocal
and should thus not be used. Descriptions using the format c.88+2T>G do
not suffer from these problems.
NOTE: when intronic variants are described in relation to a
coding DNA reference sequence authors should not forget to mention the
genomic reference sequence where the intron sequence can be found.
A basic recommendations is to use the shortest description as much as possible. Therefore, in the middle of an intron nucleotide numbering changes from + to - (e.g. from "c.88+.." to "c.89-.."). In addition, when a change in an intron is described as c.88+4356A>G (in stead of c.89-2A>G) it will not be clear that the change might be close to the splice acceptor site, and thus might affect splicing. This is immediately clear when the description c.89-2A>G is used.
Question
When description in relation to a Reference Sequence is problematic could
one specify the change in between 20 bp of sequence on both sides
?.
Answer
In many cases this would be OK but for recently duplicated genes or genes
which contain repeated segments even giving 20 nucleotides to either side
will not be sufficient. Furthermore, descriptions will become very long.
For problematic cases the best method is probably to include the raw data,
i.e. the sequence file itself.
Question
When I retrieve a cDNA sequence from GenBank nucleotide numbering does
not start with +1 at the A of the ATG translation initiation
codon.
Answer
True, but such a file can be simply obtained. When you retrieve the
sequence from the RefSeq-database (i.e. start at EntrezGene,
enter the gene symbol or gene name, select the gene of interest, click
the mRNA entry) it will be annotated extensively (see
Example). Clicking the "CDS" annotation (CoDing
Sequence) opens a window where the nucleotide numbering will
start with 1 at the A of the ATG translation initiation codon
(see
Example). To assist those studying or reporting sequence
variants a locus specific database (LSDB, see
HGVS - list of LSDBs) usually provides the coding DNA
reference sequence with the nucleotide numbering (see
Example).
Question
The recommendation on numbering genomic and coding DNA variants based
on the first nucleotide of the initiation codon ATG is workable only if
the reference sequence in the database is published as a single file. In
the case of the gene CDKN2A, its genomic sequence is stored as
multiple files, each containing one exonic sequence and partial
intronic sequences on both ends of the exon. I can use the above
recommendation easily to number variants in exon 1 where the initiation
codon is located. The problem is how should I number variants in exon 2
which is located in another database file ?.
Answer
If no database file is available that contains the complete genomic
sequence, a coding DNA Reference Sequence, preferably
from the RefSeq database,
should be used. Since for many organisms a genome sequence is freely
available, a database curator can easily make a fully annotated file
(genomic and coding DNA) covering the sequence of interest and submit it
to the RefSeq database. This file can than be used as the reference
sequence.
Question (Tracy Lester, Oxford, UK)
We are wondering how to name variants in ZRS, a regulatory sequence
for SHH that lies 1 Mb upstream of SHH in intron 5 of LMBR1. Variants in
ZRS are associated with various limb abnormalities and to-date have been
numbered according to a sequence which does not follow HGVS guidelines.
Should we create a genomic reference sequence for SHH that includes 1 Mb
of upstream sequence to encompass the ZRS, number it according to the
LMBR1 reference sequence, or something else?
Answer
A difficult case. I see 3 options;
Question (Isabelle Touitou, Montpellier,
FRANCE)
If the first translation ATG is in exon 2, and we find a variant 5' to
exon 1, should we include intron 1 in the counting process?.
NOTE: based on a coding DNA reference sequence intron 1 is
located between nucleotides -15 and -14.
Answer
Nucleotides in introns 5' of the ATG translation initiation codon (i.e. in
the 5'UTR) are numbered as all other nucleotides (see
Examples and Figure). In your example, based on a coding DNA
reference sequence, an intron is present between nucleotides -15 and -14.
The nucleotides for this intron are numbered as -15+1, -15+2,
-15+3, ...., -14-3, -14-2, -14-1. Consequently, regarding the
question, when a coding DNA reference sequence is used, these intronic
nucleotides are not counted.
Question
The CBS gene was originally thought to contain 16 exons. Later it was
recognised that exon 15 does not exist, and recently two
additional non-translated 5' exons were detected. The current gene
structure therefore includes 17 exons, of which exons 3 to 17 are
translated. Should the exons of a gene be counted from the exon that
contains the start codon rather than the beginning of the cDNA ?. If so,
should exons preceding the start codon be counted 0, -1, -2, etc. or
should the 0 be skipped ?. Is there an agreement on how to deal with
corrections in exon numbering ?.
Answer
For the description of sequence changes it does not matter how
exons are numbered !; exon (and intron) numbers are not used
in the descriptions. In fact this is one reason why the recommendation is
as it is (see Description of intronic
variants). Examples (using a coding DNA reference sequence);
- c.-5G>T: a change 5' of the ATG (in the 5'UTR)
- c.5G>T: a change in the coding (related to a change in amino acid 2)
- c.256+1G>T: a change in the 5' end of an intron
- c.257-1G>T: a change in the 3' end of an intron
- c.*5G>T: a change 3' of the stop codon (in the 3'UTR)
For exon numbering the only logical thing to do is to start with 1
for the first exon, otherwise eventually problems will emerge.
For other numbering schemes only the experts will know the
history; newcomers just blindly assume that the first exon
is exon 1. Consequently, when historic numbering schemes are used, at some
point wrong assumptions will be made and a patient might end up with an
erroneous diagnosis.
However, since history will
leave its tracks it is suggested to always mention changed numbering
schemes in M&M and in all Figure and Table legends to prevent any
further confusion. For tables even consider to add an additional column
indicating the historic / old exon number.
Question (Alessandra Splendore, Rio de Janeiro,
Brasil)
Recently two previously unidentified exons of the TCOF1 gene were
identified, and named 6A and 16A. Exon 6A is present in most of the
transcribed isoforms, exon 16A is included only in minor isoforms. In
updating the nomenclature of reported mutations in TCOF1, should I use a
sequence that corresponds to the major isoform (with exon 6A, but
without 16A) or the sequence that corresponds to the longest ("most
complete") isoform ?.
Answer
This is the eternal problem of changes in the coordinates of a reference
sequence. The best solution is that the TCOF1-community
gets together and decides to use an updated reference sequence
representing the most complete transcript, i.e including exons 6A and 16A.
This updated sequence should be annotated properly, submitted to
the RefSeq database
and used from then on.
Question (JM Friedman, Vancouver,
CANADA)
We are working on a new locus-specific mutation database for NF1 and
NF2, and we have run into a problem with the standard mutation
nomenclature based on the genomic sequence. The problem is that
the canonical genomic sequence (and consequent numbering) we are using
as the basis of the mutation nomenclature has changed repeatedly
since many of the mutations were described, and it is continuing to
change. If we use the names assigned to the mutations on the basis
of the version of the sequence that was used to name the mutations, they
do not map to the proper position in the current version of the
sequence. If we change the names to match the new sequence, they will
not match the published names for these mutations and may need to be
changed again the next time time the sequence changes. (Actually, the
current version of the NF1 sequence is annotated on the wrong strand, so
all the numbering would be backwards if we used the annotated strand
instead of its complement, which is the really the correct one).
The solution to
identifying the mutation unequivocally is to provide enough of the
surrounding sequence to permit a unique result on a BLAST search, and we
are doing this. However, this does not solve the problem of naming the
mutations. What is your recommendation for this ?.
Answer
Indeed the problems you mention make live very hard. In fact, especially
with genes containing large introns, there will be no one genomic
reference sequence since every gene will be slightly different (see
above). The problem of continuously changing genomic sequences
will not settle rapidly. When designating "THE genomic reference sequence"
now one can already foresee future discussions whether this choice was
proper; it will be a "random pick" and might not be the evolutionary
correct choice. The way to go in our eyes is to declare one
sequence THE genomic reference sequence (starting several kilo
base pairs 5' of the promoter region), annotate it properly, submit
it to the RefSeq
database and use it from then on. The RefSeq database has NG_
files specifically made for this purpose (see e.g. NG_000004.2).
These
problems are one of the reasons why for the
LSDB's
I curate (i.e. Johan den Dunnen), I prefer a coding
DNA Reference Sequence. In that case the effect of the ever
changing intronic sequences has only a marginal effect.
Question
For genes that are on the minus strand of a chromosome (opposite
transcriptional orientation) the description based on chromosome
coordinates may differ significantly from that based on the coding DNA
reference sequence. Say the chromosome sequence is -TGGGGCAT- and one of
the G's is deleted (change to -TGGG_CAT-). Based on chromosome
coordinates the description is g.5delG. However based on the coding DNA
reference sequence (ATGCCCCA) the description is c.7delC. Not only is
the deleted nucleotide different (delG vs. delC), in fact the
descriptions also point to another nucleotide, g.5 vs. g.2 (equal to
c.7delC). Is this correct?
Answer
Yes, this is correct. When genes are on the minus strand of a chromosome
(opposite transcriptional orientation) and the change is located in a
repeated sequence (mono-, di-, tri-, etc. stretches) the rule that for all
descriptions the most 3' position possible
should be assigned (see
General recommendations) has this as a consequence.
Question
We are preparing an annotated set of Hox genes from the zebrafish for
publication. If the coding DNA sequence is not completely known,
but only an EST lacking 5' sequence and a genomic sequence covering the
EST, how do you describe a change in this sequence; do you
number it in relation to the EST or the genomic sequence ?. Furthermore,
if there is a mismatch between the genomic and the EST sequence,
and you don't know which one is correct, how do you define e.g. whether
the genomic sequence has an insertion or the EST has a deletion ?.
Answer
First, the reference sequence chosen is always assumed
to be the correct sequence simply because changes are
described in relation to this sequence.
Second, when the EST sequence is incomplete one should describe changes in relation to this sequence like AA010203.2:54_55insG (assuming the reference sequence used is AA010203.2). So do not use a 'c.' or 'g.' prefix, since neither a coding DNA nor a genomic reference sequence is used. However, when a genomic sequence covering this EST is available the recommendation is to use this as a reference sequence.
Question
Making a judgment on what is the "wild type" (wt) nucleotide
for some sequences seems arbitrary at best. How would you suggest that
the description be presented for these ?.
Answer
Changes are always described in relation to a "reference sequence".
This reference sequence is considered to be the "wild type"
sequence and is expected to be the one present in the database (GenBank).
Consequently, reference and wild type sequence can be different. Note
however that everybody has influence on the sequences in the RefSeq
database and thus may request that a variant is changed
into the more common allele. However, the debate about what is wild type
can be unsolvable when variants are very common (near 50%) or differ
between populations.
Question (M Paalman, Human Mutation)
How should sequence variants in the mitochondrial DNA (mtDNA) be
described ?.
Answer
The mtDNA genome is rather small, completely sequenced and numbered.
According to current recommendations variants in the mitochondrial DNA
should be described in relation to a the full mitochondrial DNA sequence,
i.e. for human the Homo sapiens mitochondrion, complete genome (GenBank
NC_012920.1).
Descriptions should be preceded by "m.", like m.8993T>C (see
Recommendations). The mtDNA encodes a range of different
proteins. To prevent confusion, changes at protein level should be
described including a reference to the protein changed, like
ATP6:p.Leu156Pro (GenBank YP_003024031.1,
ATP synthase 6).
NOTE: for issues related to mitochondrial DNA sequences see
MITOMAP.
Question
How should sequence variants be described in genes that produce only
RNA (so no protein), e.g. ncRNA, miRNA, etc. ?.
Answer
To describe variants in genes that produce an RNA molecule but no protein
a genomic reference sequence can be used ("g."
description). When available, it is also possible to use a NR_
transcript reference sequence (e.g. NR_000020.1
for the small nucleolar RNA, C/D box 33 (SNORD33) gene) using the prefix "n."
( see Standards). Numbering for the
transcript reference sequence starts with position "n.1"
and ends with the last position.
NOTE:
suggested addition, see SVD-WG002
| Top of page | MutNomen
homepage | Check-list | Symbols,
codons,
etc. |
| Recommendations: DNA, RNA,
protein, uncertain
|
| Discussions | FAQ's | History
|
| Example descriptions: QuickRef,
DNA, RNA,
protein |
Copyright
© HGVS 2010 All Rights Reserved |