HGVS recommendations: reference sequence

A reference sequence - discussions and FAQs

Last modified January 11, 2016

Since references to WWW-sites are not yet acknowledged as citations, please mention den Dunnen JT and Antonarakis SE (2000). Hum.Mutat. 15:7-12 when referring to these pages.

Reference sequence descriptions
- reference sequence indicators
Reference sequence - genomic or coding DNA ?
- practical problems genomic reference sequence
- practical problems coding DNA reference sequence
Reference sequence - recommendations
use a LRG (Locus Reference Genomic sequence, Dalgleish et al. 2010), see LRG website
Numbering exons & introns
- discussion & recommendations
Changed recommendations
- description of intronic variants
Frequently Asked Questions (FAQ)

Reference sequence descriptions

To indicate what type of reference sequence is used for the description of the sequence variant, specific indicators are used;

c. = coding DNA reference sequence
- covers the part of the transcript that is translated into protein; numbering starts at the translation initiation site (the A of the ATG codon) and ends at the last nucleotide of the translation termination site (translation stop codon TAG, TAA TGA)
g. = genomic reference sequence
- numbering starts at the first nucleotide of the reference sequence and ends at the last nucleotide
m. = mitochondrial reference sequence (see details)
- numbering starts at the first nucleotide of the reference sequence and ends at the last nucleotide
n. = non-coding RNA reference sequence (gene producing an RNA transcript but not a protein)
- numbering starts at the first nucleotide of the non-coding transcript and ends at the last nucleotide
r. = RNA reference sequence
- covers the entire transcript, excluding the poly A-tail; numbering starts at the transcription initiation site (cap site) and ends at transcription termination site
p. = protein reference sequence
- covers the entire protein; numbering starts at the translation initiation site (the Methionine) and ends at the translation termination site (the *)

For details on residue numbering see Standards. Note that recommendations also exist to describe different transcripts/protein isoforms generated from one gene (see Standards).

Genomic or coding DNA reference sequence ?

Discussions on a proper reference sequence have been very lively. In general it can be concluded that all suggestions made have their pro's and con's, but there is no perfect solution.

Theoretically, a genomic reference sequence is the best choice. By simply numbering nucleotides from 1 to the end of the file no problems occur with complex gene structures like multiple transcription start sites (promoters / 5'-first exons), multiple translation initiation sites (ATG-codons), alternative splicing and the use of different 3'-terminal exons and poly-A addition sites.

In practice a coding DNA reference sequence is mostly preferred. The most important reason is that from the description one immediately gets some information regarding the location of the variant; exonic or intronic, 5' of the ATG or 3' of the stop codon and, by dividing the nucleotide number by 3, the number of the amino acid residue that is affected (see Nucleotide numbering).

Practical problems genomic reference sequence

for a human, a genomic reference sequence does not contain any useful information (a coding DNA reference sequence does)

a gene can be very large (over 2.0 Mb) - this makes nucleotide numbering based on a genomic reference sequence rather impractical (e.g. g.1567234_1567235insTG). Furthermore, genomic reference sequences based on GenBank NT_ files become increasingly long (e.g. the CFTR gene in NT_007933.15, >77 Mb) and consequently loose their informativity. Downloading such large files is, even with good internet connections, time consuming and working with these files is rather difficult.

when a genomic reference sequence is taken from a complete genome sequence, e.g. a bacterium or the human X-chromosome, the transcriptional orientation of the gene of interest may be on the minus (-) strand. This makes the description of sequence variants rather complicated, especially when the consequences on RNA and/or protein level need to be described; nucleotides on DNA and RNA level are complementary and numbering goes in different directions - a confusing situation that should be prevented.

when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).

when the gene sequence is incomplete (especially when large introns are present) - a genomic sequence can not be used.

genes may contain very large introns with many intronic (length) variants present in the population - it is thus very difficult to give THE genomic reference sequence (see Genomic sequence changes regularly).

Practical problems coding DNA reference sequence

the exact transcriptional start site (cap-site) of a gene has often not been determined and/or its assignment is debated - the first nucleotide can thus not be assigned with certainty. The same might be true for the translation initiation site (ATG-codon).

a gene may have several transcripts, using different promoters / 5'-first exons, alternatively spliced internal exons, different 3'-terminal exons and polyA-addition sites - one complete coding DNA reference sequence can thus not be generated (see Alternatively spliced exons - nucleotide numbering),

the different transcripts may encode different proteins (isoforms) with, when different promoters are used, different N-terminal sequences and even using different reading frames in one or more exons. One complete protein reference sequence can thus not be assigned.

when different genes (partly) overlap, using the same or the minus (-) DNA strand, which reference sequence should one use to describe the variant and to which gene should the change be assigned ? (see Recommendations).

Recommendations reference sequence

The recommendation is to use a LRG (Locus Reference Genomic sequence, e.g. LRG_329. Information on LRG's (Dalgleish et al. 2010, MacArthur et al. 2014), and how to get one for your gene of interest, can be found at the LRG website. When no LRG is available, one should be requested. In the mean time, a RefSeqGene record is a good alternative (RefSeq database, format NG_008797.2). When both are not available, request an LRG, a RefSeqGene record will be made in parallel. DO NOT use LRGs that are "pending", they might change before officially released.

Reporting using LRGs as a reference is possible for genomic DNA (e.g. LRG_1:g.8463G>C), coding DNA (e.g. LRG_1t1:c.572G>C), non-coding RNA (e.g. LRG_163t1:n.5C>T) and protein (e.g. LRG_1p1:p.Gly191Ala) variants. To describe coding DNA/non-coding RNA variants the transcript must be indicated (e.g. "t1"), for protein variants the protein isoform (e.g. "p1").

Genomic reference sequence

When a genomic reference sequence is used the following recommendations should be followed;

it should include all known exons and cover all known transcripts
to facilitate the description of variants in immediate gene flanking regions (e.g. the promoter region), it should contain several kilobases of 5' upstream (~5 kb) and 3' downstream (~2 kb) sequences
NOTE: this demand ensures that when new data indicate that the gene's transcriptional start site or polyA-addition site should be shifted, it is often not necessary to generate a new reference sequence
descriptions in relation to a genomic reference sequence do not include "+", "-" or other signs
when the complete genomic sequence is not known, a coding DNA reference sequence should be used

For complex genes, when on the genomic reference sequence all transcripts are annotated properly, computational tools (like Mutalyzer and the Genomic Mutation Consequence Calculator) can easily predict the consequences of a sequence change on all transcripts and their encoded protein isoforms, incl. when they derive from overlapping genes.

coding DNA reference sequence

When a coding DNA reference sequence is used the following recommendations should be followed;

the coding DNA reference sequence should be complete and preferably derived from the RefSeq database (format NM_033337.2) or Ensembl (format ENST00000343849); When the RefSeq database does not contain a file covering the sequence of interest, one should make one, annotate it properly and submit it to the RefSeq database
the genomic reference sequence used as its basis should be given; to facilitate the description of variants in immediate gene flanking regions (e.g. the promoter region), it should contain several kilobases of 5' upstream (~5 kb) and 3' downstream (~2 kb) sequences
NOTE: this demand ensures that when new data indicate that the gene's transcriptional start site or polyA-addition site should be shifted, it is often not necessary to generate a new reference sequence
to facilitate the description of variants and to prevent confusion it should cover the major and largest transcript known and include as many exons as possible, even when this transcript has not been proven to actually exist in nature. Exons that disrupt the main reading frame should not be included but annotated in the introns
for difficult cases the best choice is up to those that need to report sequence variants in relation to the sequence, e.g. clinical labs. Important is that the experts;
- come to a consensus and use ONE coding DNA reference sequence only
- in the coding DNA reference sequence it should be clearly annotated where sequences are that are used in other transcripts and protein isoforms
exonic sequences not covered by the coding DNA reference sequence selected should be numbered in relation to the selected sequence (see Examples), i.e.;
- using the '-' sign when 5' of the exon containing the translation initiation site (ATG-codon)
- when inside the gene like intronic nucleotides (e.g. c.435+574, c.896-56, etc.)
- using the '*' sign when 3' of the exon containing the translation termination site (stop codon)
NOTE: the numbering suggested can also be applied to amino acid numbering (e.g. p.Ala456+1, Cys456+2, ..., etc. or ..., p.Phe457-2, p.Gln457-1)

Suggestions have been made to extend the recommendations for the nucleotide numbering of coding DNA reference sequences to specifically indicate untranscribed nucleotides (see Discussion).

As discussed, genes can be rather complex and the choice of a good coding DNA reference sequence can be very difficult. Below we will refer to some examples of how experts have tried to resolve the issue.

Examples

one gene - many promoters/5'-first exons (5' and internal) - the DMD-gene
Since DMD is mainly a muscle-disease the major transcript of the gene found in muscle (Dp427m) was chosen as the coding DNA reference sequence. This includes all differently spliced exons, e.g. exons 71 and 78 reported to be present each in about 50% of the transcripts.
- exon 78 seems to be present only in higher organisms - transcripts lacking exon 78 change the reading frame of exon 79, extending the C-terminal end of the encoded protein beyond that of the normal reading frame. To prevent that this is not noted this feature is clearly annotated in the coding DNA reference sequence (see DMD coding DNA reference sequence).
- the DMD-gene has seven different promoters/5'-first exons. Six of these are internal, i.e. located in an intron of the coding DNA reference sequence transcript. These promoters/5'-first exons are clearly annotated and they are numbered using intronic nucleotides (see e.g. Dp260 located in intron 29). The exception is the promoter/5'-first exon expressed in brain (Dp427c) which starts 128 kb upstream of the muscle promoter. This promoter/5'-first exon is again clearly annotated and is numbered using nucleotides 5' of the ATG (see e.g. Dp427c).
one gene - two promoters/ 5'-first exons - two alternative reading frames - the CDKN2A gene
The CDKN2A gene uses two alternative promoters/5'-first exons (separated by 19 kb) and shared exons 2 and 3. When the two exons 1 are spliced to exons 2 and 3 they are out of frame with each other, i.e. use a different reading frame, one encoding a protein called p14ARF, the other a protein called p16. As a consequence some sequence variants affect only one of the 2 proteins while other variants have a different consequences for both proteins. To prevent confusion both transcripts / proteins are clearly indicated and sequence variants are described in relation to both transcripts/protein isoforms (see CDKN2A sequence variant database).
one gene - alternative splice sites internal exon - the MUTYH gene
The MUTYH gene encodes several transcripts differing by the splice acceptor site used in exon 3 (intron 2), adding 72, 63 or 3 nucleotides 5' of a common exon 3 sequence (adding 14, 11 or 1 amino acids resp.). Although it is not clear in which tissues and to what level these transcripts are expressed, to prevent confusion and to ensure that the entire region is checked for sequence variants, the suggestion has been made to include all nucleotides in the coding DNA reference sequences (see MUTYH coding DNA reference sequence)
one gene - alternative 3-last exons - the LMNA gene
The LMNA gene encodes two major protein isoforms; lamin-A and lamin-C. These proteins are generated by two major transcripts, one ending after exon 10 (lamin-C, located in intron 10 of lamin-A), one extending further downstream to exon 12 (lamin-A). The coding DNA reference sequence used is based on the largest transcript, i.e. lamin-A (see LMNA). Nucleotide numbering for the shorter lamin-C uses intron 10 nucleotides (1698+1 to 1698+123 see lamin-C), amino acid numbering uses p.Val566+1 to p.Arg566+6.

Do you have other examples - please let us know (E-mail to: J.T.den_Dunnen @ LUMC.nl) !.

Numbering exons & introns

The HGVS recommendations for the description of sequence variants does not include suggestions for the numbering of exons and introns. The simple reason is that exon/intron numbers are not required for a correct description. When necessary, the exon/intron numbers can be derived from the description at DNA level.

In fact, using exon/intron numbers introduces a lot of confusion, which is undesired; assume an exon number is in conflict with the description of the variant at DNA level, what to do ?. In many genes there is no consensus on exon/intron numbering and originally used numbering schemes had to be revised to include newly discovered exons (internal as well as 5' and/or 3' of the gene). This led to all kinds of numbering schemes using no consensus or overall logic, making it very difficult for non-experts in the specific gene to keep track of all details (see Dalgleish et al. 2010). With the increasing use of genome browsers, numbering exons simply from start to end 1, 2, 3, etc., legacy numbering schemes have become even more confusing.

The only logical thing to do is to follow the standard set by the genome browsers and to start numbering with 1 for the first exon. Although this is probably difficult to accept by the experts, we can not keep on confusing newcomers by forever using legacy numbering systems. We should realize that, at some point wrong assumptions will be made and a patient wiil end up with an erroneous diagnosis, which is of course unacceptable.

Recommendation exon/intron numbering

Describe variants at DNA level and do not include exon or intron numbers as part of the description. Exon and intron numbers may be mentioned but only when there use is specified and reference sequences for the exons and introns are gvien. Since history will leave its tracks, when refering to older data, always mention changed numbering schemes in M&M and in Figure and Table legends to prevent any confusion. For tables even consider to add an additional column indicating the legacy numbering.

Examples;

c.IVS12-1G>T
- is confusing, use a description like c.88+2T>G
c.IVS12-1G>T
- is confusing, use a description like c.2417-1G>T
a deletion exon 30 to 36 in the DMD-gene
- is confusing, use a description like NM_004006.2:c.4072-?_5154+?del

Changed recommendations

Description of intronic variants

Initial recommendations (see e.g. Antonarakis [1998] Hum.Mut. 11: 1-3) suggested two alternative descriptions for variants in intron sequences based on a coding DNA reference sequence; the formats c.88+2T>G / c.89-1G>T and c.IVS2+2T>G / c.IVS2-1G>T. The current recommendation is that the format c.IVS2+2T>G / c.IVS2-1G>T should not be used anymore.

Reason: from the description c.IVS2+2T>G it is difficult to deduce where the position of the intron relative to the coding DNA sequence is. In addition, when one wants to deduce this position, this is often problematic. First, many authors fail to mention the genomic + coding DNA reference sequences that were used as the basis of exon/intron numbering. Second, since on first publication gene sequences are often based on incomplete sequences, initial exon / intron structure often turns out incomplete and numbering changes later (see Numbering exons / introns). Consequently, descriptions using the format c.IVS2+2T>G fail the basic criterion to be unequivocal and should thus not be used. Descriptions using the format c.88+2T>G do not suffer from these problems.
NOTE: when intronic variants are described in relation to a coding DNA reference sequence authors should not forget to mention the genomic reference sequence where the intron sequence can be found.

A basic recommendations is to use the shortest description as much as possible. Therefore, in the middle of an intron nucleotide numbering changes from + to - (e.g. from "c.88+.." to "c.89-.."). In addition, when a change in an intron is described as c.88+4356A>G (in stead of c.89-2A>G) it will not be clear that the change might be close to the splice acceptor site, and thus might affect splicing. This is immediately clear when the description c.89-2A>G is used.

NOTE: when an intron immediately follows the last nucleotide of the stop codon (position c.876), nucleotides in the intron are numbered like c.876+1, c.876+2, c.876+3, … c.*1-3, c.*1-2, c.*1-1.

Frequently asked questions

Report with 20 bp on either side

Question
When description in relation to a Reference Sequence is problematic could one specify the change in between 20 bp of sequence on both sides ?.

Answer
In many cases this would be OK but for recently duplicated genes or genes which contain repeated segments even giving 20 nucleotides to either side will not be sufficient. Furthermore, descriptions will become very long. For problematic cases the best method is probably to include the raw data, i.e. the sequence file itself.

Database sequence does not start with 1 at A of ATG

Question
When I retrieve a cDNA sequence from GenBank nucleotide numbering does not start with +1 at the A of the ATG translation initiation codon.

Answer
True, but such a file can be simply obtained. When you retrieve the sequence from the RefSeq-database (i.e. start at EntrezGene, enter the gene symbol or gene name, select the gene of interest, click the mRNA entry) it will be annotated extensively (see Example). Clicking the "CDS" annotation (CoDing Sequence) opens a window where the nucleotide numbering will start with 1 at the A of the ATG translation initiation codon (see Example). To assist those studying or reporting sequence variants a locus specific database (LSDB, see HGVS - list of LSDBs) usually provides the coding DNA reference sequence with the nucleotide numbering (see Example).

Genomic reference sequence split in several files

Question
The recommendation on numbering genomic and coding DNA variants based on the first nucleotide of the initiation codon ATG is workable only if the reference sequence in the database is published as a single file. In the case of the gene CDKN2A, its genomic sequence is stored as multiple files, each containing one exonic sequence and partial intronic sequences on both ends of the exon. I can use the above recommendation easily to number variants in exon 1 where the initiation codon is located. The problem is how should I number variants in exon 2 which is located in another database file ?.

Answer
If no database file is available that contains the complete genomic sequence, a coding DNA Reference Sequence, preferably from the RefSeq database, should be used. Since for many organisms a genome sequence is freely available, a database curator can easily make a fully annotated file (genomic and coding DNA) covering the sequence of interest and submit it to the RefSeq database. This file can than be used as the reference sequence.

Variant 1 Mb upstream in another gene

Question (Tracy Lester, Oxford, UK)
We are wondering how to name variants in ZRS, a regulatory sequence for SHH that lies 1 Mb upstream of SHH in intron 5 of LMBR1. Variants in ZRS are associated with various limb abnormalities and to-date have been numbered according to a sequence which does not follow HGVS guidelines. Should we create a genomic reference sequence for SHH that includes 1 Mb of upstream sequence to encompass the ZRS, number it according to the LMBR1 reference sequence, or something else?

Answer
A difficult case. I see 3 options;

simply describe the variants using genomic coordinates. Checking the SHH gene variant database, which uses NM_000193.2 as a reference transcript, a change of the A of the ATG codon c.1A>G would then be chr7.hg19:g.155604816T>C (use NM_000193.2:c.1A>G in Mutalyzer
describe the variants in the LMBR1 gene variant database, which uses NM_022458.3 as a reference transcript. To make the connection with SHH you can add that no variants were found in the SHH gene (description c.=) making sure the case emerges in the SHH database overview.
ask NCBI to extend the RefSeqGene record NG_007504.1 for SHH with the 1 Mb region upstream. Similarly, ask for a LRG at EBI. When this reference sequence is then attached to the SHH gene variant database variants can be desribed in relation to that sequence.

RefSeq numbering with introns 5' of the ATG

Question (Isabelle Touitou, Montpellier, FRANCE)
If the first translation ATG is in exon 2, and we find a variant 5' to exon 1, should we include intron 1 in the counting process?.
NOTE: based on a coding DNA reference sequence intron 1 is located between nucleotides -15 and -14.

Answer
Nucleotides in introns 5' of the ATG translation initiation codon (i.e. in the 5'UTR) are numbered as all other nucleotides (see Examples and Figure). In your example, based on a coding DNA reference sequence, an intron is present between nucleotides -15 and -14. The nucleotides for this intron are numbered as -15+1, -15+2, -15+3, ...., -14-3, -14-2, -14-1. Consequently, regarding the question, when a coding DNA reference sequence is used, these intronic nucleotides are not counted.

Numbering exons / introns

Question
The CBS gene was originally thought to contain 16 exons. Later it was recognised that exon 15 does not exist, and recently two additional non-translated 5' exons were detected. The current gene structure therefore includes 17 exons, of which exons 3 to 17 are translated. Should the exons of a gene be counted from the exon that contains the start codon rather than the beginning of the cDNA ?. If so, should exons preceding the start codon be counted 0, -1, -2, etc. or should the 0 be skipped ?. Is there an agreement on how to deal with corrections in exon numbering ?.

Answer
For the description of sequence changes it does not matter how exons are numbered !; exon (and intron) numbers are not used in the descriptions. In fact this is one reason why the recommendation is as it is (see Description of intronic variants). Examples (using a coding DNA reference sequence);

c.-5G>T: a change 5' of the ATG (in the 5'UTR)

c.5G>T: a change in the coding (related to a change in amino acid 2)

c.256+1G>T: a change in the 5' end of an intron

c.257-1G>T: a change in the 3' end of an intron

c.*5G>T: a change 3' of the stop codon (in the 3'UTR)

For exon numbering the only logical thing to do is to start with 1 for the first exon, otherwise eventually problems will emerge. For other numbering schemes only the experts will know the history; newcomers just blindly assume that the first exon is exon 1. Consequently, when historic numbering schemes are used, at some point wrong assumptions will be made and a patient might end up with an erroneous diagnosis.
However, since history will leave its tracks it is suggested to always mention changed numbering schemes in M&M and in all Figure and Table legends to prevent any further confusion. For tables even consider to add an additional column indicating the historic / old exon number.

Alternatively spliced exon

Question (Alessandra Splendore, Rio de Janeiro, Brasil)
Recently two previously unidentified exons of the TCOF1 gene were identified, and named 6A and 16A. Exon 6A is present in most of the transcribed isoforms, exon 16A is included only in minor isoforms. In updating the nomenclature of reported mutations in TCOF1, should I use a sequence that corresponds to the major isoform (with exon 6A, but without 16A) or the sequence that corresponds to the longest ("most complete") isoform ?.

Answer
This is the eternal problem of changes in the coordinates of a reference sequence. The best solution is that the TCOF1-community gets together and decides to use an updated reference sequence representing the most complete transcript, i.e including exons 6A and 16A. This updated sequence should be annotated properly, submitted to the RefSeq database and used from then on.

Genomic sequence changes regularly

Question (JM Friedman, Vancouver, CANADA)
We are working on a new locus-specific mutation database for NF1 and NF2, and we have run into a problem with the standard mutation nomenclature based on the genomic sequence. The problem is that the canonical genomic sequence (and consequent numbering) we are using as the basis of the mutation nomenclature has changed repeatedly since many of the mutations were described, and it is continuing to change. If we use the names assigned to the mutations on the basis of the version of the sequence that was used to name the mutations, they do not map to the proper position in the current version of the sequence. If we change the names to match the new sequence, they will not match the published names for these mutations and may need to be changed again the next time time the sequence changes. (Actually, the current version of the NF1 sequence is annotated on the wrong strand, so all the numbering would be backwards if we used the annotated strand instead of its complement, which is the really the correct one).
The solution to identifying the mutation unequivocally is to provide enough of the surrounding sequence to permit a unique result on a BLAST search, and we are doing this. However, this does not solve the problem of naming the mutations. What is your recommendation for this ?.

Answer
Indeed the problems you mention make live very hard. In fact, especially with genes containing large introns, there will be no one genomic reference sequence since every gene will be slightly different (see above). The problem of continuously changing genomic sequences will not settle rapidly. When designating "THE genomic reference sequence" now one can already foresee future discussions whether this choice was proper; it will be a "random pick" and might not be the evolutionary correct choice. The way to go in our eyes is to declare one sequence THE genomic reference sequence (starting several kilo base pairs 5' of the promoter region), annotate it properly, submit it to the RefSeq database and use it from then on. The RefSeq database has NG_ files specifically made for this purpose (see e.g. NG_000004.2). These problems are one of the reasons why for the LSDB's I curate (i.e. Johan den Dunnen), I prefer a coding DNA Reference Sequence. In that case the effect of the ever changing intronic sequences has only a marginal effect.

Difference chromosome / coding DNA description

Question
For genes that are on the minus strand of a chromosome (opposite transcriptional orientation) the description based on chromosome coordinates may differ significantly from that based on the coding DNA reference sequence. Say the chromosome sequence is -TGGGGCAT- and one of the G's is deleted (change to -TGGG_CAT-). Based on chromosome coordinates the description is g.5delG. However based on the coding DNA reference sequence (ATGCCCCA) the description is c.7delC. Not only is the deleted nucleotide different (delG vs. delC), in fact the descriptions also point to another nucleotide, g.5 vs. g.2 (equal to c.7delC). Is this correct?

Answer
Yes, this is correct. When genes are on the minus strand of a chromosome (opposite transcriptional orientation) and the change is located in a repeated sequence (mono-, di-, tri-, etc. stretches) the rule that for all descriptions the most 3' position possible should be assigned (see General recommendations) has this as a consequence.

Coding DNA sequence incomplete

Question
We are preparing an annotated set of Hox genes from the zebrafish for publication. If the coding DNA sequence is not completely known, but only an EST lacking 5' sequence and a genomic sequence covering the EST, how do you describe a change in this sequence; do you number it in relation to the EST or the genomic sequence ?. Furthermore, if there is a mismatch between the genomic and the EST sequence, and you don't know which one is correct, how do you define e.g. whether the genomic sequence has an insertion or the EST has a deletion ?.

Answer
First, the reference sequence chosen is always assumed to be the correct sequence simply because changes are described in relation to this sequence.

Second, when the EST sequence is incomplete one should describe changes in relation to this sequence like AA010203.2:54_55insG (assuming the reference sequence used is AA010203.2). So do not use a 'c.' or 'g.' prefix, since neither a coding DNA nor a genomic reference sequence is used. However, when a genomic sequence covering this EST is available the recommendation is to use this as a reference sequence.

Frequency / "wild type" sequence

Question
Making a judgment on what is the "wild type" (wt) nucleotide for some sequences seems arbitrary at best. How would you suggest that the description be presented for these ?.

Answer
Changes are always described in relation to a "reference sequence". This reference sequence is considered to be the "wild type" sequence and is expected to be the one present in the database (GenBank). Consequently, reference and wild type sequence can be different. Note however that everybody has influence on the sequences in the RefSeq database and thus may request that a variant is changed into the more common allele. However, the debate about what is wild type can be unsolvable when variants are very common (near 50%) or differ between populations.

Changes in mitochondrial DNA

Question (M Paalman, Human Mutation)
How should sequence variants in the mitochondrial DNA (mtDNA) be described ?.

Answer
The mtDNA genome is rather small, completely sequenced and numbered. According to current recommendations variants in the mitochondrial DNA should be described in relation to a the full mitochondrial DNA sequence, i.e. for human the Homo sapiens mitochondrion, complete genome (GenBank NC_012920.1). Descriptions should be preceded by "m.", like m.8993T>C (see Recommendations). The mtDNA encodes a range of different proteins. To prevent confusion, changes at protein level should be described including a reference to the protein changed, like ATP6:p.Leu156Pro (GenBank YP_003024031.1, ATP synthase 6).
NOTE: for issues related to mitochondrial DNA sequences see MITOMAP.

Changes in non-coding RNA (ncRNA) genes

Question
How should sequence variants be described in genes that produce only RNA (so no protein), e.g. ncRNA, miRNA, etc. ?.

Answer
To describe variants in genes that produce an RNA molecule but no protein a genomic reference sequence can be used ("g." description). When available, it is also possible to use a NR_ transcript reference sequence (e.g. NR_000020.1 for the small nucleolar RNA, C/D box 33 (SNORD33) gene) using the prefix "n." ( see Standards). Numbering for the transcript reference sequence starts with position "n.1" and ends with the last position.
NOTE: suggested addition, see SVD-WG002

A reference sequence - discussions and FAQs

Last modified January 11, 2016

Contents

Reference sequence descriptions

Genomic or coding DNA reference sequence ?

Practical problems genomic reference sequence

Practical problems coding DNA reference sequence

Recommendations reference sequence

Genomic reference sequence

coding DNA reference sequence

Examples

Numbering exons & introns

Recommendation exon/intron numbering

Changed recommendations

Description of intronic variants

Frequently asked questions

Report with 20 bp on either side

Database sequence does not start with 1 at A of ATG

Genomic reference sequence split in several files

Variant 1 Mb upstream in another gene

RefSeq numbering with introns 5' of the ATG

Numbering exons / introns

Alternatively spliced exon

Genomic sequence changes regularly

Difference chromosome / coding DNA description

Coding DNA sequence incomplete

Frequency / "wild type" sequence

Changes in mitochondrial DNA

Changes in non-coding RNA (ncRNA) genes