HGVS recommendations - Discussions

Discussions regarding the description of sequence variants

Last modified February 1, 2014

NOTE: this website is frozen since May 1, 2016. It has been replaced by a new version at http://www.HGVS.org/varnomen. These pages serve as archival copy only.

Introduction
Discussion / recent modifications
- reference sequences
  - which reference sequence to use (genomic or coding DNA)
  - accession number
- traditional descriptions
- numbering untranscribed nucleotides
- polymorphisms
- silent protein changes
- descriptions of a range using "_"
- two or more sequence variants in one individual
  - two sequence variants in one gene (chromosome)
  - two sequence changes with genes (chromosomes) unknown
  - recessive disease
  - sequence variants in two genes on one chromosome
  - sequence variants in two genes on two different chromosome
- more transcripts / proteins from one gene
- large deletions, split reference sequence
- deletions with unknown molecular breakpoints
- duplication or insertion, duplication or (2) ??
- triplication, quadruplication, ...
- loss from a run of nucleotides
- insertions
- insertion-deletions (indels)
- translation initiation
- frame shift variants
New recommendations
- alternative transcripts (Reference Sequence)
- accepted sequence indicators
- SNPs - description of SNP's (in Tables and text)
- homo/heterozygotes
- haplotypes
- translocations
- fused gene products
- one or three letter amino acid code (Trp or W)
- translation initiation codon changes (Met1)
- gene conversions
- exact position not known
- added 2012-10-12protein description between brackets
Examples to describe changes

Introduction

This page gives an overview of the discussions raised and suggestions made to describe sequence variations after publication of the latest manuscript on this issue by JT den Dunnen and S Antonarakis (2000, Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 15: 7-12, copy in PDF-format). We invite investigators to send us further remarks on the issues discussed here. Furthermore, we solicit complicated cases not yet covered, with a suggestion regarding how to describe these. We hope these pages will be used as a guide to describe any sequence change, ultimately evolving into a uniformly accepted standard.

For reactions: contact us by E-mail (to: HGVSmn @ JohanDenDunnen.nl) or using the HGVS variant description forum.

Discussion / recent modifications

Accession number

As correctly pointed out by Peter Slickers (Clondiag Chip Technologies), providing a database accession number is not sufficient to identify a sequence in the database unambiguously. There may exist several different versions for a given accession. In most cases only the annotation changes, while the sequence remains the same, but this is not always the case and one can not rely on this (compare e.g. NM_000130.1 and NM_000130.2). Therefore one should always use accession AND version number to refer to the reference sequence (see Recommendations).

Traditional descriptions

In some cases the description of pathogenic changes in genes started well before there were any mutation nomenclature recommendations (e.g. in thalassemias and cystic fibrosis). When new reports describe variants according to current recommendations, instead of using traditional descriptions, experts in the field experience problems "recognizing" these variants. However, nomenclature rules should be universal and thus can not be made to apply for specific situations. The traditional notation only rings a bell for experts in the field, for others it is cryptic and confusing.
Although annoying, traditional descriptions should not be used anymore. When one uses the recommended descriptions it will be only a matter of time until also the experts get acquainted. The recommendation is to list in the variant summary Table official and traditional names next to each other in separate columns, like c.88+2T>G and IVS#+2T>G, p.Phe508del and delF508, or c.24dupG and Cd8/9+G.

Numbering untranscribed nucleotides

Several people have requested to extend the recommendations for the numbering of nucleotides using a coding DNA reference sequence to include a specific description for untranscribed nucleotides (i.e. 5' of the transcription initiation site (cap-site) or 3' of the polyA-addition site). Thus far, these requests have not been granted. The main reason is that genes often have several transcription initiation sites (promoters/5'-first exons) as well as polyA-addition sites (3'-terminal exons). Furthermore, the transcription initiation or cap-site, is often ill-defined (see also Practical problems coding DNA reference sequence). Consequently, the suggested information in the description (indicating that the variant lies in untranscribed sequences) is not very reliable and informative. In addition, it further complicates the already complex description using a coding DNA Reference Sequence.

Recently our knowledge of the genome and its transcription is quickly maturing and transcription initiation and polyA-addition sites have been mapped much more precisely. When, as recommended, a stable LRG-based reference sequence (see Recommendations reference sequence) is used, these uncertainties are less of an issue.

The most mature suggestion is to extend the current recommendations (see Numbering coding DNA reference sequence) with;

coding DNA reference sequence

-N-uM = nucleotide M 5' (upstream) of the nucleotide -N of the transcription initiation site -N (e.g. -237-u5A>G)
NOTE: restricted to nucleotides 5' of the transcription initiation site (cap site, i.e. upstream of the gene incl. the promoter)
*N+dM = nucleotide M 3' (downstream) of the nucleotide transcription termination site *N (e.g. *237+d5A>G)
NOTE: restricted to locations 3' of the polyA-addition site (downstream of the gene)

Polymorphisms

In the past, descriptions like c.76A/G and p.36L/I (p.36Lys/Ile) have been used to describe "polymorphic" sequence variants (see Mutation / polymorphism). Note that a description of a variant should be neutral and not include any functional conclusion; consequently, polymorphisms and changes affecting function ("pathogenic") should not be described differently. Note that it will often be very difficult to discriminate between variants affecting function and those that are truly neutral (not affecting function).

Silent protein changes

Description of so called "silent" changes in the format p.(Leu54Leu) (alternatively p.(L54L)) should not be used; descriptions should be given at DNA level. The description at protein level is not informative and not unequivocal (there are at least five possibilities at DNA level which may underlie p.(Leu54Leu)). A correct description has the format c.162C>G (p.(Leu54=)), with "p.(Leu54=)" indicating that there is no effect on protein level expected.

NOTE: the recommendation for the description of silent protein changes was recently modified (see proposal SVD-WG001 - No change). The recommended format is now c.162C>G p.(Leu54=); the change at DNA level should always be listed.

Descriptions of a range using "_"

Initially, the "-"-character (hyphen) was used for two different purposes, i.e. to indicate a range (nucletotides c.12-13delTG) as well as to indicate a negative number (e.g. for intronic sequences like in c.77-2A>G). This description might cause confusion, which should be circumvented. For example when the change is c.12-13del, does this indicate a deletion from coding DNA nucletoide 12 to 13 or from the intronic nucleotide c.12-13 ?. Since for intronic positions both the "+" and "-" characters are essential, the recommendation is to use the "_"-character (underscore) to indicate a range.

Two sequence variants in one individual

Two sequence variants in a gene on one chromosome
den Dunnen&Antonarakis, 2000, Hum.Mut. 15: 7-12 suggested to describe two sequence variations in a gene on one chromosome as [c.76A>C+c.83G>C], i.e. using a "+"-character to separate the two changes. The previous description was c.[76A>C;83G>C] (Antonarakis, S.E. and the Nomenclature Working Group, 1998, Hum.Mut. 11: 1-3), i.e. using a semicolon (";") to separate the two changes. Since the den Dunnen & Antonarakis' suggestion might introduce confusion with older publications this suggestion was retracted. Furthermore, for consistency and to keep descriptions as short as possible, it was decided to write all indications of the reference sequence used in front of the square brackets. A correct description is thus;

c.[76A>C; 83G>C]

Two sequence changes in a gene (chromosome unknown)
when two variants are identified in a gene but when it is not known whether these are on one chromosome (in cis) or on different chromosomes (in trans), the recommendation is to describe these as "[change1(;)change2]" (see FAQ). Of course it is recommended to determine whether the changes are on the same chromosome or not. Still, when this has not yet been done, it is important to make this absolutely clear in the description. A correct description is;
- c.[76A>C(;)83G>C]
Recessive disease - two sequence variants in one gene (different chromosomes)
in recessive diseases sequence changes are expected in a gene on both chromosomes. The description of the changes found should indicate whether variants were found in the gene on both chromosomes and in which combination. The latter is important since severity might depend on the combination of variants present, while other combinations might not be deleterious.
Examples;

c.[76A>C];[87delG] - description of the changes in a gene on both chromosomes in a heterozygous case with a recessive disease
On protein level, the designation is likewise; p.[Arg175*];[Cys305Ser], p.[(Arg175*)];[(Cys305Ser)] when RNA was not analysed
c.[76A>C];[76A>C] - description of a homozygous change in a recessive disease
c.[76A>C];[?] - a case where only one variant on one chromosome could be identified
when there is no change in the other chromosome the format is c.[76A>C];[=] (see FAQ)

Two sequence variants in two different genes on one chromosomes
the description should be as for variants on one chromosome but should include a reference to the sequence or gene changed.
Examples;
- hg19 chrX:g.[30683643A>G;33038273T>G] - variants in two different genes on the same X-chromosome (GK and DMD gene resp.)
- c.[NM_000167.5:94A>G;NM_004006.2:76A>C] - variants in two different genes on the same X-chromosome (GK and DMD gene resp.). On protein level the description is; GK:p.[Thr32Ala] DMD:p.[Asn26His]
- c.[GK:94A>G;DMD:76A>C] - variants in two different genes on the same X-chromosome, the GK and DMD gene. On protein level the description is; GK:p.[Thr32Ala] DMD:p.[Asn26His]
  NOTE: the reference sequences of the GK and DMD genes should be described elsewhere in the document.
Two sequence variants in two different genes on two different chromosomes
the description should be as for recessive diseases but should include a space (" ") as a separator and a reference to the sequence or gene changed.
Examples;
- hg19 chr1:g.[35227587C>G] chr13:g.[20763083A>T] - variants in two different genes (GJB4 and GJB2) on two different chromosome (1 and 13). On protein level the description is; GJB4:p.[His244Gln] GJB2:p.[Leu213*]
  NOTE: the reference sequences of the GJB4 and GJB2 genes should be described elsewhere in the document.
- NM_153212.2:c.[732C>G] NM_004004.5:c.[638T>A] - variants in two different genes (GJB4 and GJB2) on two different chromosome (1 and 13). On protein level the description is; GJB4:p.[His244Gln] GJB2:p.[Leu213*]
  NOTE: the reference sequences of the GJB4 and GJB2 genes should be described elsewhere in the document.

More transcripts / proteins from one gene

As a consequence of the above mentioned change, the ";"-character should not be used to describe changes which affect RNA-processing, i.e. yielding two or more transcripts (den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12). The suggestion is to use the ","-character (comma) instead (see Recommendations). This rule applies to both description at RNA and protein level.

Large deletions, split reference sequence

If a deletion is large and the reference sequence is split over several files, list at least once (in order) the respective files containing the overall reference sequence. When describing the change, to prevent confusion, include a reference to the sequence used, e.g. AC109326.2:g.82398_L78833.1:g.80466del. In the "Remarks" column of the summary table the size of the deletion could be mentioned (e.g. 160 kb deletion spanning exons 1-22). Please note that, since the reference sequence is split over several files, this size can not be deduced from the description of the sequence (see also Discussion - Fused genes).

Duplication or insertion

Although duplications can be considered as a special type of insertion, the recommendation is to describe duplications independently from insertions, using the term "dup" . This recommendation also applies for a duplicated mono-, di-, tri-, etc. nucleotide stretch. There are several reasons why the recommendation is to describe such changes as a "duplication" (see Triplication, ...)

the description is simpler, shorter and more unequivocal
it is clearer and prevents confusion regarding the exact position introduced when an insertion is incorrectly reported like "22insG" (see Insertions)
it prevents discussions regarding the position of the insertion; in the case of a duplication including the intron/exon border (e.g. c.123-8_137dup) would the "insertion" be in the intron or exon ?.
insertion more or less includes "coming from elsewhere". Mechanistically, a duplication is more likely to be caused by DNA polymerase slippage, duplicating a local sequence.

Examples

nucleotide level
- c.7dupT (or c.7dup) denotes the duplication (insertion) of a T at position 7 in the sequence ACTTACTGCC to ACTTACTTGCC

protein level
p.His7dup (or p.H7dup) describes as a duplicating insertion in the H repeat sequence of MKMGHHHQCC to MKMGHHHHQCC

NOTE: the description "dup" (see Standards) may by definition only be used when the additional copy is directly 3'-flanking of the original copy (tandem duplication). For large duplications (e.g. one or more exons of a gene) there will often be no such experimental proof, the additional copy can be inserted anywhere in the genome. Without experimental evidence, such changes should be described as an insertion.

Triplication, quadruplication, ...

Duplications are indicated by the term "dup". The question arose what to do when more copies are involved, use triplications, quadruplications, etc. ?. There are several possibilities. First, like "dup" for duplications one could use "tri" for triplications, "qua" for quadruplications, etc. Another possibility is to use the recommendation to describe sequence repeat variability and to use "3" for triplication (3 copies), "4" for quadruplication (4 copies), etc. A variant of this possibility is to use rep3, rep4, etc.
To prevent that more and more specific notations are used, making the overall description of sequence variants increasingly complicated, tri, qua, rep3, rep4, etc. are not recommended.
NOTE: the format "[N]" can only be used to when there is experimental evidence that the additional copies (N-1) are in tandem on the same chromosome.

Examples

nucleotide level
- c.87_93[3] describes the presence of two additional copies (a triplication) of the 7 nucleotides from coding DNA position 87 to 93.
- c.4987-?_5193+?[4] describes the presence of three additional copies (a quadruplication) of exons 17 to 19 of the BRCA1 gene, from an unknown position in intron 16 (c.4987-?) to an unknown position in intron 19 (c.5193+?).
  NOTE: the description implies there is evidence the extra copies are in hte BRCA1 locus, in tandem and on the same chromosome.

protein level
p.(His5_Cys7[3]) (or p.(H5_C7[3]), RNA not analysed) describes as a triplication of the amino acid sequence HQC in MKMGHQCC to MKMGHQCHQCHQCC

Loss from a run of nucleotides

From Pat O'Neill (Burlington, USA):
I especially like the use of "dup" in place of "ins" when the inserted base creates a run of two or more bases. I feel that there should be a parallel term for the loss of a base from a run of two or more bases instead of just "del". This is because of the mechanistic implications of both an ins and a del of a base in a run. Has this been discussed? My only thought for a term in place of "del" is "los" for loss.
Shuji Ogino (Boston, USA) agrees with this suggestion but suggest to use the term "dec" for a decrease in length.

Reply (JdD): Basically the "dup" nomenclature was suggested because the description is simpler, shorter and less unequivocal (see Discussion). The potential mechanistic relation is nice but was not decisive. Basically a description should be clear/unequivocal and not so much contain additional information.

Insertions

The description of insertions has had some discussion. The first point of discussion was whether the nucleotides (amino acids) flanking the insertion site had to be given both or not. In the past, the description 22insG (or Cys22insGly) was used both to indicate insertion at position 22 and insertion after position 22. This situation becomes even more complex when a "-" character is involved, like in -14insG or 456-13insG. Does the latter mean at or after intronic nucleotide 456-13 and in addition, after nucleotide 456-13 is that position 456-12 or 456-14 ?. Consequently, to prevent confusion, both flanking residues have to be listed.
The second point of discussion was which character to use as a separator. The initial suggestion was to use the "^"-character (e.g. p.Q83^C84insQ). However, since a character to indicate a range was already available, it was decided to use this character, i.e. the "_"-character (see above).

Insertion-deletions (indels)

The occurrence of a combination of a deletion and insertion, sometimes named "indel", is not rare. Based on existing terminology, a recommendation for their description can be rather straightforward; a combination of a deletion and insertion at the same site is described using the format 112_117delinsTG. On protein level, likewise, as p.Trp33_Lys35delinsArg.

Translation initiation

date 2012-08-31 Based on a new variant reported in the IFITM5 gene (c.-14C>T, generating a new translation initiation codon at position -5), Raymond Dalgleish (Leicester, UK) asked how to describe this variant on protein level.

The recommendation is to describe the generation of new upstream translation initiation codons using the format "p.Met1ext-5", where "-5" is the position of the new translation initiating Methionine.

Argumentation

the description is clear, unequivocal, short and in line/not in conflict with existing recommendations, incl.
- descriptions at protein level should describe the changes observed on protein level (and not try to incorporate knowledge regarding the change at DNA-level)
- amino acids originating from changes introducing upstream translation initiation are numbered like nucleotides (like ..., Gln-2, Thr-1)
p.Met1extMet-5 is an alternative but "Met" in "Met-5" is redundant
describing the variant as an insertion, like p.(Met1_Asp2insAlaLeuGluProMet), is an alternative but the description becomes quite cumbersome when the new initiation codon lies further upstream
the "Recommendations for the description of protein sequence variants (v2.0)" mentioned the example p.Met1ValextMet-12. Based on the new recommendation this example has been changed to p.Met1Valext-12.

Frame shift variants

NOTE: description clarified with the help of Raymond Dalgleish (Leicester, UK).

The recommendation is to designate frame shifting variants by "fs". It is not useful to add much detail in the description of frame shifting variants besides (especially in the case of C-terminal variants) the length of the new, shifted reading frame. Two notations can be used to describe frame shift changes, a short or a long form.

Short description
"fs" after a description of the first amino acid affected by the change.

p.Arg97fs (alternative p.R97fs) denotes a frame shifting change with Arginine-97 as the first affected amino acid

Long description
"fs*#" after a description of the amino acid(s) affected by the change and the change occurring at the site of the frame shift. "*#" indicates at which codon position the new reading frame ends in a stop (*). The position of the stop in the new reading frame is calculated starting at the first amino acid that is changed by the frame shift, and ending at the first stop codon (*#).
NOTE: the shifted reading frame is thus open for '#-1' amino acids.

p.Arg97Profs*23 (short p.Arg97fs) denotes a frame shifting change with Arginine-97 as the first affected amino acid, changing into a Proline and the new reading frame ending in a stop at position 23

NOTE: the description at protein level does not relate to the change at DNA-level. So a 1 nucleotide deletion (or duplication / insertion) as well as a 100 nucleotide deletion may have the same description at protein level, like p.Arg79fs. The description may differ in the "long description" when the first encoded amino acid in the shifted reading frame differs (e.g. p.Arg97Profs*23, p.Arg97Serfs*23, etc.)

Please note that the frame shift example given in den Dunnen & Antonarakis, 2000, Hum.Mut. 15: 7-12 contains a mistakes; p.R97fs*121 (page 11) should be "p.R97fs*25, indicating a frame-shifting change with Arginine-97 as the first affected amino acid and the new reading frame being open for 24 amino acids".

New recommendations

Alternative transcripts (Reference Sequence)

As discussed, in some cases it is very difficult to assign a sequence which can be used as a good reference for numbering. When a coding DNA reference sequence is used it should represent the major transcript of the gene. Alternatively spliced exons (5'-first, internal or 3'-terminal) derived from within the gene can than be best numbered as for intronic sequences. Description of variants in transcripts initiating or terminating outside this region is more difficult. The suggestion is to described these as usual but to precede them with a unique identifier of the alternative transcript and a ":"-character, like c.Dp427c:3G>T. The alternative transcript should be precisely described and refer to a specific database record (Genbank, EMBL, DDJB)., the accession number of which should be provided.

For example, for the DMD-gene, involved in Duchenne Muscular Dystrophy (DMD), the major transcript is that found in muscle, indicated with Dp427m. Other transcripts are initiated from within the Dp427m gene, e.g. that found in Purkinje cells (Dp427p) and in retina (Dp260), but these are all considered "alternative transcripts". Variants in the respective promoter / exon 1 region can thus be described as intronic sequences in relation to the Dp427m coding DNA sequence. However, the brain promoter / exon 1 lies 5' of the Dp427m promoter. Thus, a variant in this region should be described using the format c.Dp427c:3G>T. Note that this transcript encodes a new translation initiation site and that the numbering used starts with nucleotide +1 for the A of the ATG-translation initiation codon of the Dp427c-transcript.

Accepted sequence indicators

For clarity reasons, e.g. to prevent confusion when in one manuscript variants in relation to different reference sequences are described, it is recommended to use unique sequence indicators as part of the description of each variant. It should be noted however that for every indicator the respective reference sequence used should always be mentioned. Unique indicator and sequence description should be separated by a colon (":") (see Recommendations).
Examples;

NM_004006.1:c.3G>T - uses a GenBank file as indicator
GJB2:c.76A>C - uses a HGNC-approved gene symbol as indicator
NM_004006.1(DMD):c.3G>T - uses both a GenBank file and a HGNC-approved gene symbol as indicator
chrX:g.32,218,983_32,984,039del - uses a chromosome indicator (here X)
NOTE: the chromosome build used should always be mentioned (e.g. NCBI Build 36.1)
rs2306220:A>G - using a dbSNP-identifier as indicator
DXS1219:g.CA[18] (or AFM297yd1:g.CA[18]) - uses marker DXS1219 / AFM297yd1 as indicator

Single Nucleotide Polymorphisms (SNP's)

Publications reporting linkage or association studies often use a range of different markers/SNP's. Such publications should contain an unequivocal description of all markers used. An easy way to achieve this is to include in the description a direct, unequivocal reference to the reference sequence used (preferably a GenBank or dbSNP record).
Examples;

NM_004006.1:c.3G>T - uses a GenBank file as reference sequence (coding DNA)
GJB2:c.76A>C - uses a HGNC-approved gene symbol as reference. NOTE: the manuscript should list the GenBank file of the reference sequence of the GJB2 gene (here the coding DNA)
rs2306220:A>G - using a dbSNP-identifier as a reference
NOTE: descriptions like dbSNP2306220:A>G should not be used, they are not unequivocal since it is unknown whether rs2306220 or ss2306220 in meant.
DXS1219:g.CA[18];[21] (or AFM297yd1:g.CA[18];[21]) - uses marker DXS1219 (AFM297yd1) as reference to describe the length of the two CA-repeats. NOTE: although descriptions of marker DXS1219 can be found at several places (e.g. GDB, GenBank), the manuscript should list the GenBank file containing the reference sequence of the marker.
variants in the promoter region (see FAQ) - it is recommended to describe this variant in relation to a genomic reference sequence (like L01538.1:g.1407C>T). Describing a promoter variant in relation to a coding DNA reference sequence (i.e. in relation to the A of the ATG initiation codon) is possible but not very informative. In such cases, next to the coding DNA reference sequence also the genomic reference sequence used should be given (see Discussion). A suggestion is to describe the change as "L01538.1:g.1407C>T (at -401 of the ATG)".

Homo/heterozygotes

Regarding SNP's and their use in the text of papers Peter Taschner (LUMC, Leiden, NEDERLAND) makes the following remark;
most recommendations for sequence variant nomenclature apply to genotype descriptions in tables. Unfortunately, these are not very useful in the general text of a paper. For instance, the OPRM1:c.118A>G or dbSNP1799971:A>G designation can be used to describe the sequence variant, but in a paper you might like to discuss the phenotypic consequences of different genotypes. In fact the current recommendation is to use OPRM1:c.[118A>G];[=] to describe a heterozygote and [=];[=] and OPRM1:c.[118A>G];[118A>G] for the homozygotes. I would like to suggest to describe the genotypes in the text like;

OPRM1:c.118AA homozygotes

OPRM1:c.118GA heterozygotes

OPRM1:c.118GG homozygotes

The different alleles could then be designated as the OPRM1:c.118A allele and the OPRM1:c.118G allele. In combination with variants of other genes, the genotype descriptions could be OPRM1:c.118AA, GJB2:c.76AC double heterozygotes, etc.

Haplotypes

Haplotypes are a special form of two or more variants in one chromosome (see Recommendations DNA changes). When it is once clearly described (e.g. in the Materials & Methods) what the order of the variants is and which reference sequences were used a rather simple description of a haplotype can be used. Descriptions using "[]" are of course only used for variants on one chromosome. Examples;

Haplotype with all variants in relation to one coding DNA reference sequence
- description of the reference haplotype; NM_004006.1:c.[837G>A; 1704+51T>C; 3734C>T; 6438+2669T(16_23); 6571C>T; 7098+13212GT(15_19)]
- description haplotype; [G;C;C;18;T;17]
Haplotype with all variants in relation to several reference sequences, both genomic and coding DNA
- description of the reference haplotype; [M59228.1:g.250G>C; AF209160.1:g.572CA; Z11861.1:g.61T>C; Z16803.1:g.114A
- description haplotype; [C;13;T;21]

Translocations

For the description of translocations the format "t(X;4)(p21.2;q34)", suggested originally by the ISCN (1985), is already used as a standard.
NOTE: current recommendations in this areas are made by the "Standing Committee on Human Cytogenetic Nomenclature" and were published recently as ISCN 2013".

For a description at the molecular level this notation can be followed, extended with the standard description indicating the exact translocation breakpoint. When due to local similarities the exact breakpoint is uncertain, following standard nomenclature rules, it will be arbitrarily assigned to the most 3' nucleotide. Since the translocation breakpoints can have a complex structure and since it involves two different chromosomal locations, the sequences of the two translocation breakpoints should always be submitted to a sequence database (Genbank, EMBL, DDJB). The accession numbers of these files should be listed in the report

Next to the exact location of the translocation breakpoint, its molecular characterisation will yield more details including e.g. deletions/duplications at the junction and the sequence joined, derived from the other chromosome. We believe that a description covering all these details will become too complex. However, when one wants to include these details, the first description should be for the translocated 5' segment of the gene, the second for the translocated 3' segment, separated by a ";"-character. It should also be noted that when a translocation joins genes A and B, the description of the breakpoint in the sequence variation database of gene A is different from that in gene B. The major difference being that the nucleotide numbering is based on that of gene A or gene B respectively.

t(X;4)(p21.2;q35)857+101_857+102 denotes a translocation breakpoint located in an intron, between nucleotides 857+101 and 857+102, and joining chromosome bands Xp21.2 and 4q34
t(X;4)(p21.2;q35)IVS7 denotes a translocation breakpoint in intron 7, joining chromosome bands Xp21.2 and 4q34

Fused genes

Due to (large) deletions, translocations or inversions, genetic rearrangements may have one breakpoint far from the gene under study. The breakpoint might lie in 'empty' intragenic sequences or in another gene. Consequently, to describe the breakpoint at a molecular level two Reference Sequences will be required. To describe cases like this, no recommendations have been made yet.
Recommendation: for the breakpoint residing in the gene under study, nucleotide numbering is clear and follows the standard. When the breakpoint lies in another gene, nucleotide numbering for that end should be based on the nucleotide numbering for that gene (accession.version number of the Reference Sequence used should be provided). To indicate that the end lies in another gene, the nucleotide number should be preceded with the gene's official Gene Symbol, like GJB2:c.233. When the breakpoint does not reside in another gene, the accession number of the Reference Sequence will be used instead of the official Gene Symbol, like AC012343.2:g.763 (please note that this is always a genomic Reference Sequence). When the breakpoint ends on the opposite strand (reverse, complementary, non-transcribed or anti-sense strand) of a gene or on the opposite strand of an intragenic sequence, an "o" will precede the official Gene Symbol (like oGJB2:c.233). Pleas note that the use of a "c" (complementary), "a" (anti-sense) or "r" (reverse) might cause confusion with nucleotides C and A or the "r" indicating description of a change on RNA-level.

c.1431_oXYZ:c.234-15del denotes a deletion starting at position 1431 of the gene analysed, ending on the opposite translational strand in an intron of the XYZ-gene (accession number to be provided)
c.1431_AC012343.2:g.23515del denotes a deletion starting at position 1431 of the gene analysed, ending at position 23,515 of the intragenic sequence reported with accession number AC012343.2

One or three letter amino acid code

Discussions regarding the use of either the one- or three-letter amino acid code to describe variants at protein level are ongoing. Basically, descriptions using the one-letter amino acid code are unequivocal, short and thus preferred. However, since the one-letter amino acid code is not obvious (Ala, Arg, Asn, Asp start with A, Gln, Glu, GLy with G, Leu, Lys with L, Phe, Pro with P and Thr, Tyr with T) publications often contain mistakes when the one-letter code is used. In addition, the '*' is not only used to indicate a stop codon (translation termination) but also to indicate unknown residues. Consequently, to prevent mistakes, we favour the use of the three-letter amino acid code.

Translation initiation codon changes

Currently, variants in the translation initiating Methionine (M1) are usually described as a substitution, e.g. p.Met1Val. This is not correct. Either no protein is produced (p.0) or a new translation initiation site up- or downstream is used (e.g. p.Met1ValextMet-12 or p.Met1_Lys45del resp.). Unless experimental proof is available, it is probably best to report the effect on protein level as "p.Met1?" (unknown). When experimental data show that no protein is made, the description "p.0" is recommended (see Examples).

Protein description between brackets

Usually, descriptions at protein level have no experimental proof, i.e. are predictions only, deduced directly from the DNA sequence. However, when RNA has been analysed, and (unexpected) effects at RNA processing can be excluded, the predcited protein change will usually be correct. Similarly, the variant protein may have been detected using immuno-histochemistry or on Western blot. To indicate whether there is any experimental evidence for a protein description, it is recommended that when RNA nor protein has been analysed, the description is given between brackets (e.g. p.(Arg22Ser)).

added 2012-10-12
Question; (Richard Barber) when the nucleotide change is common and well characterised at RNA and protein level, such as CFTR:p.Phe508del, there seems no need to use a description with brackets.

Answer; Agreed. However, please check carefully that such evidence is indeed available and do not fall into the trap of "transitive proof". i.e. reports only referring to another source for experimental evidence without giving any themselves.

Gene conversions

A gene conversion is a nonreciprocal transfer of genetic information between two homologous sequences. As a result of a gene conversion the sequence of (part of) a gene can be copied from a highly similar sequence residing elsewhere in the genome. Usually, the converted segment contains a range of sequence changes, making its description rather complex. In such cases it is recommended to use a specific description using the format "region_changed" con "region of origin". Please note that also here the rule applies to arbitrarily assign the most 3' position possible as the first to have been changed.

Examples

c.15_355conNM_004006.1:c.15_355 - indicates that nucleotides c.15 to c.355 of the coding DNA sequence of the transcript of interest were converted to nucleotides c.15 to c.355 from a transcript sequence as present in GenBank file NM_004006 (version 1)
g.415_1655conAC096506.5:g.409_1683 - indicates that nucleotides g.415 to g.1655 of the genomic sequence of the gene of interest were converted to nucleotides g.409 to g.1683 from a genomic sequence as present in GenBank file AC096506 (version 5).

Discussions regarding the description of sequence variants

Last modified February 1, 2014

NOTE: this website is frozen since May 1, 2016. It has been replaced by a new version at http://www.HGVS.org/varnomen. These pages serve as archival copy only.

Contents

Introduction

Discussion / recent modifications

Accession number

Traditional descriptions

Numbering untranscribed nucleotides

coding DNA reference sequence

Polymorphisms

Silent protein changes

Descriptions of a range using "_"

Two sequence variants in one individual

More transcripts / proteins from one gene

Large deletions, split reference sequence

Duplication or insertion

Examples

Triplication, quadruplication, ...

Examples

Loss from a run of nucleotides

Insertions

Insertion-deletions (indels)

Translation initiation

Frame shift variants

New recommendations

Alternative transcripts (Reference Sequence)

Accepted sequence indicators

Single Nucleotide Polymorphisms (SNP's)

Homo/heterozygotes

Haplotypes

Translocations

Fused genes

One or three letter amino acid code

Translation initiation codon changes

Protein description between brackets

Gene conversions

Examples