THE HUGO MUTATION DATABASE INITIATIVE (MDI)

Baltimore Oct. 27-28, 1997

Meeting Report

Charles R. Scriver
Piotr M. Nowacki
Richard G.H. Cotton

Science is an assault on ignorance and its legacies are arrays of concepts, databases and technologies. Databases of mutations in human genomes, for example, are a legacy of research in human genetics and genomics. Mutations are a record of genetic variation and they form an interface between genetics and genomics. Genetic variation is an essential feature of living organisms, and to record the corresponding information is a normal, useful and cultural activity. Over 80 participants from Europe, North America, Japan, and Australia attended the 4th meeting on the theme of mutation databases1; speakers and presentators are indicated in brackets below.

Gert-Jan van Ommen (President, HUGO) opened with a brief overview of HUGO and progress in the International Genome Project. His own interest in Duchenne Muscular Dystrophy has generated a locus-specific database of DMD mutations and relationships, it receives 500 hits per month and the obvious usefulness of this and analogous resources was a recurring theme during the meeting.

The HUGO Mutation Database (MDI) is addressing three important issues:

i) Nomenclature: Taxonomy is a classical theme in biology and the naming of mutations is a new challenge for which a systematic approach is required and there will be recommendations and guidelines (Antonarakis et al in press) for the naming of the simpler mutation types; those for describing complex mutations are still evolving. Whereas mutation names are only descriptors (attributes) of objects (entitites) in the context of bioinformatics, a unique identifier (another attribute) can and should be assigned to the nucleotide change. Disease-causing mutations attract attention but polymorphic alleles should also be recorded; the definition of a polymorphism elicited intense discussion, a definition being one thing, biological significance another.

ii) Protection and integrity of information and intellectual property: Biomedical information in databases is in cyberspace (Wallace). Print publications have fixed rules for copyright and protection; electronic "publications" (in databases) have no corresponding rules. Databases in the public domain contain entities that reflect work by individuals; how is credit assigned and the intellectual property respected? Persons in other fields of informatics (De Riet in Data and Knowledge Engineering 24:69, 1997) have been addressing problems such as masquerading, unauthorized use, disclosure and alterations, acknowledgements, denial of service, and auditing/accounting - all issues relevant to mutation databases. Databases can be protected by keeping a copy offline (the template approach) with date-stamped copies put on-line. Long-term concerns for database integrity include continuity of curatorial functions and funding to maintain and support databases; quality of databases (documentation of development and maintenance; flexibility, integrity and regulation of use) is also an abiding issue.

iii) Is a mutation real? Mutation reports can be validated against a set of standards (Cotton), such as: proof of state on a second PCR product, segregation in family and with trait, frequency of occurrence on 100 "normal" chromosomes, evolutionary significance of the codon involved, likelihood that the mutation (by type) has a phenotypic effect, and expression analysis in vitro (for missense mutations).

Database design begins in a biological context and then moves into the field of informatics (Lehvaslaiho). A centralized approach to the design of mutation databases, as envisioned at the European Bioinformatics Institute (Ashburner), has relevance because locus-specific databases yet have no common mechanism to distribute information, no shared format, and non-standard contents. Nonetheless the bottom-up approach in the latter is workable and is more flexible than the top-down approach inherent in omnifarious genomic databases. The issue becomes less important as guidelines for universal nomenclature and agreements on the entities comprising core content take hold; and when mirror templates of databases become accessible through search engines (Etzold). An obvious strength of locus-specific databases lies in their variety of content and format; they serve the particular interests of persons (or Consortium) working at the locus. When locus-specific databases can distribute their information efficiently, or make it universally accessible to search engines, the issue of standardization is partly resolved. Central and locus databases can be linked for mutual long term support and security. Whereas mutation information has hierarchy, there can be agreement about the entities and attributes (descriptors) that are core and shared between genomic and locus-specific databases and those particular to the latter class of databases. Meantime a register of databases and curators would be useful and could be located at HUGO.

Metabolic Information Network (MIN). Persons are the ultimate "sources" of the naturally-occurring human mutations (disease-causing and neutral polymorphisms) being recorded in databases. MIN is a model register (Mize) of patients (entities) and in this case, those with inborn errors of metabolism (entities with attributes). A minimum data set was established in 1993 to document 87 different disease phenotypes; voluntary participation rates (physicians and patients) approach 97%; the MIN register contains 8400 independent cases, with an 80% follow-up record through contacts with 400 physicians. Since MIN and existing mutation databases contain mutually interesting entities, they could be linked, and locus-specific consortia could register patients with MIN. Several mutation databases already have "clinical pages" for interaction with patients; MIN is another strand in the network to collect and distribute the relevant information.

Locus-specific mutation databases flourish. The CFTR (Tsui) and PAH (Scriver and Nowacki) databases are established prototypes; each serves a large international consortium. Mitochondrial genome databases (Mitomap: Wallace; MitBASE: Attimonelli) are a different prototype - a combined locus-specific and genomic database. These and other locus-specific databases are driven by biological and medical interests, contain hundreds of mutations at their respective loci, are highly informative about some aspect of human population genetic variation, are curated and, up to now, have been maintained and operated in imaginative entrepreneurial fashion with minimal or no dedicated funding. The databases are generally relational in design although some are object relational (e.g. Mitomap) and all are keyed to nucleotide sequences. PAHdb uses Visual FoxPro as its database management system (DBMS); Mitomap developed an internal DBMS. In their relational design these databases can accommodate large amounts of information (entities and attributes). Curatorial function maintains quality.

Five different X-Linked Immunodeficiency diseases have databases (Vihinen), each formatted on large collaborative registers of patients and driven by consortia of investigators (40 persons for BTK); the five databases account for one tenth of disease-causing human mutations now documented at EBI; they are linked to OMIM. Many locus-specific databases were reported on posters; PAX6 (Brown), AR for the Androgen Receptor (Gottlieb), PIG-A (Nafa), a-mannosidase (Riise), LDLR (Varret), BIODEF-BIOMDB for tetrahydrobiopterin deficiencies at 4 different loci (Blau), EBN1 for Marfan (Collod-Beroud), Collagen Type 4 for Alport syndrome (Gubler), FAA and FAC for Fanconi anemia (Verlander), Globin Gene Server for globin mutations (Hardison), T-cell receptor (Lefranc) and GENATLAS, a phenotype based database of diseases, genes and markers (Frezal).

Mutation View (Shimuzu), a database derived from several locus-specific mutation databases (e.g. PAH, CFTR, and p53), has a common user interface formatted for viewing in real-time on the client server; it is currently written for UNIX but will be in HTML for web deployment. Its editorial function logically belongs with the curators of the imported locus-specific databases; it does not currently interact with curators, and simply takes content and reformats information. Universal software (ACI-4D) is used to record mutations in databases for APC, COL4A5, FBN1, LDLR, P53, RB, VHL, and WT1 (Beroud); version 6 remodels data reports, and allows interrogation, graphical displays and cross-database comparisons. Flexible search engines for locus-specific databases permit decentralization (Etzold). Dominance of the locus-specific "parts" is preferrable to dominance of the genomic "center" and SRS software can be the mediator. However, to function, SRS requires standardized mutation nomenclature and some agreement on structure and core content of databases; it will provide restricted access and read-only operation in a network of locus-specific databases where curators do the primary work and place date-stamped copies on the web.

On-line Mendelian Inheritance in Man (OMIM) (McKusick) is the "Old Testament" of human genetic databases; it is a "genomic" type of database with "locus - specific" components. It provides full text catalogues of phenotypes (entities with 6 digit UIDs) to which mutation information can be appended and pointers to corresponding locus-specific mutation databases; 708 loci in OMIM record at least one mutant allele (as of Oct. 27/1997). Whereas OMIM is a primary record of genetic diseases (phenotypes) in the "Mendelian/Garrodian" model, it is fast becoming a resource for the interpretation of complex genetic disease in the "Galtonian/Fisherian" model. At one end of a spectrum of loci and alleles, 4231 loci contain one or more allele responsible for a single corresponding phenotype; at the other end, there is one locus which harbours mutations causing 9 different and discrete phenotypes. Taken together, 590 loci are responsible for 927 different phenotypes. Each locus in OMIM has a formal name (and gene symbol) and a pointer to the corresponding reference nucleotide sequences and mutation database (e.g. OMIM 261600 (phenylketonuria) points to GenBank U49897 and http://www.mcgill.ca/pahdb).

The Human Gene Mutation Database (HGMD, Cardiff) originated in a series of meta-analyses of human mutation types (Cooper). It lists curated information from MedLine, journal reports, locus-specific databases, and personal communications. Mutations are classified in ten groups by type, one page per gene per mutation type. HGMD has information on 640 genes (460 annotated reference sequences) harbouring over 12000 mutations and it provides MedLine references (via Entrez). If there were locus-specific databases today only for genes known to harbour 25 mutations or more, they would contain only 46% of the currently known mutations and they would cover only 7% of the genes listed in HGMD; hence the need for a genomic database like HGMD to record the many genes with few mutations. However, HGMD and other genomic databases may find it difficult to maintain the dense repertoire of descriptors and other information found in large locus-specific databases. Meantime, OMIM and HGMD function well as directories of locus-specific databases.

Bioinformatics is a necessary resource in genomics. The National Center for Biotechnology Information NCBI at NIH (Ostell) aligns genomic data (markers and loci), provides computed relationships, and combines the information in chromosomal, genetic and physical maps with the genomic (or cDNA) nucleotide sequence. NCBI maintains GenBank and its annotated reference nucleotide sequences (5 digit accession numbers). Information is linked via Entrez to PubMed and Medline UIDs. Information on genomic variation is stored at the Genome Data Base (GDB) which does not systematically collect mutational data (Cottingham); a new collaborative project between HGMD and GDB is being funded to record mutations.

The power of software to interrogate the information content in wildtype and mutant nucleotide sequences was revealed by an analysis of splice-site modifying mutations (Rogan). All mutations occur in a context of flanking nucleotide sequence; an algorithm has been developed to quantitate information content in the sequence context; whether a missense mutation, for example, would or would not generate a new splice site can be predicted.

Envoie: Whereas mutation databases are now firmly established as resources for genetics and genomics, problems remain to be addressed (Cotton), among them: nomenclature of complex mutations, guidelines for content and structure of databases (both locus-specific and genomic), quality control of content, protection of intellectual property, copyright, credit and recognition for effort and input, and how to ensure longevity and funding of such resources. Current support of MDI by HUGO and The March of Dimes (USA) means that issues will be addressed.

(1) To the program and list of participants