Gert-Jan van Ommen (President, HUGO) opened with a brief overview of HUGO and progress in the International Genome Project. His own interest in Duchenne Muscular Dystrophy has generated a locus-specific database of DMD mutations and relationships, it receives 500 hits per month and the obvious usefulness of this and analogous resources was a recurring theme during the meeting.
The HUGO Mutation Database (MDI) is addressing three important issues:
ii) Protection and integrity of information and intellectual property: Biomedical information in databases is in cyberspace (Wallace). Print publications have fixed rules for copyright and protection; electronic "publications" (in databases) have no corresponding rules. Databases in the public domain contain entities that reflect work by individuals; how is credit assigned and the intellectual property respected? Persons in other fields of informatics (De Riet in Data and Knowledge Engineering 24:69, 1997) have been addressing problems such as masquerading, unauthorized use, disclosure and alterations, acknowledgements, denial of service, and auditing/accounting - all issues relevant to mutation databases. Databases can be protected by keeping a copy offline (the template approach) with date-stamped copies put on-line. Long-term concerns for database integrity include continuity of curatorial functions and funding to maintain and support databases; quality of databases (documentation of development and maintenance; flexibility, integrity and regulation of use) is also an abiding issue.
iii) Is a mutation real? Mutation reports can be validated against a set of standards (Cotton), such as: proof of state on a second PCR product, segregation in family and with trait, frequency of occurrence on 100 "normal" chromosomes, evolutionary significance of the codon involved, likelihood that the mutation (by type) has a phenotypic effect, and expression analysis in vitro (for missense mutations).
Metabolic Information Network (MIN). Persons are the ultimate "sources" of the naturally-occurring human mutations (disease-causing and neutral polymorphisms) being recorded in databases. MIN is a model register (Mize) of patients (entities) and in this case, those with inborn errors of metabolism (entities with attributes). A minimum data set was established in 1993 to document 87 different disease phenotypes; voluntary participation rates (physicians and patients) approach 97%; the MIN register contains 8400 independent cases, with an 80% follow-up record through contacts with 400 physicians. Since MIN and existing mutation databases contain mutually interesting entities, they could be linked, and locus-specific consortia could register patients with MIN. Several mutation databases already have "clinical pages" for interaction with patients; MIN is another strand in the network to collect and distribute the relevant information.
Locus-specific mutation databases flourish. The CFTR (Tsui) and PAH (Scriver and Nowacki) databases are established prototypes; each serves a large international consortium. Mitochondrial genome databases (Mitomap: Wallace; MitBASE: Attimonelli) are a different prototype - a combined locus-specific and genomic database. These and other locus-specific databases are driven by biological and medical interests, contain hundreds of mutations at their respective loci, are highly informative about some aspect of human population genetic variation, are curated and, up to now, have been maintained and operated in imaginative entrepreneurial fashion with minimal or no dedicated funding. The databases are generally relational in design although some are object relational (e.g. Mitomap) and all are keyed to nucleotide sequences. PAHdb uses Visual FoxPro as its database management system (DBMS); Mitomap developed an internal DBMS. In their relational design these databases can accommodate large amounts of information (entities and attributes). Curatorial function maintains quality.
Five different X-Linked Immunodeficiency diseases have databases (Vihinen), each formatted on large collaborative registers of patients and driven by consortia of investigators (40 persons for BTK); the five databases account for one tenth of disease-causing human mutations now documented at EBI; they are linked to OMIM. Many locus-specific databases were reported on posters; PAX6 (Brown), AR for the Androgen Receptor (Gottlieb), PIG-A (Nafa), a-mannosidase (Riise), LDLR (Varret), BIODEF-BIOMDB for tetrahydrobiopterin deficiencies at 4 different loci (Blau), EBN1 for Marfan (Collod-Beroud), Collagen Type 4 for Alport syndrome (Gubler), FAA and FAC for Fanconi anemia (Verlander), Globin Gene Server for globin mutations (Hardison), T-cell receptor (Lefranc) and GENATLAS, a phenotype based database of diseases, genes and markers (Frezal).
Mutation View (Shimuzu), a database derived from several locus-specific mutation databases (e.g. PAH, CFTR, and p53), has a common user interface formatted for viewing in real-time on the client server; it is currently written for UNIX but will be in HTML for web deployment. Its editorial function logically belongs with the curators of the imported locus-specific databases; it does not currently interact with curators, and simply takes content and reformats information. Universal software (ACI-4D) is used to record mutations in databases for APC, COL4A5, FBN1, LDLR, P53, RB, VHL, and WT1 (Beroud); version 6 remodels data reports, and allows interrogation, graphical displays and cross-database comparisons. Flexible search engines for locus-specific databases permit decentralization (Etzold). Dominance of the locus-specific "parts" is preferrable to dominance of the genomic "center" and SRS software can be the mediator. However, to function, SRS requires standardized mutation nomenclature and some agreement on structure and core content of databases; it will provide restricted access and read-only operation in a network of locus-specific databases where curators do the primary work and place date-stamped copies on the web.
On-line Mendelian Inheritance in Man (OMIM) (McKusick) is the "Old Testament" of human genetic databases; it is a "genomic" type of database with "locus - specific" components. It provides full text catalogues of phenotypes (entities with 6 digit UIDs) to which mutation information can be appended and pointers to corresponding locus-specific mutation databases; 708 loci in OMIM record at least one mutant allele (as of Oct. 27/1997). Whereas OMIM is a primary record of genetic diseases (phenotypes) in the "Mendelian/Garrodian" model, it is fast becoming a resource for the interpretation of complex genetic disease in the "Galtonian/Fisherian" model. At one end of a spectrum of loci and alleles, 4231 loci contain one or more allele responsible for a single corresponding phenotype; at the other end, there is one locus which harbours mutations causing 9 different and discrete phenotypes. Taken together, 590 loci are responsible for 927 different phenotypes. Each locus in OMIM has a formal name (and gene symbol) and a pointer to the corresponding reference nucleotide sequences and mutation database (e.g. OMIM 261600 (phenylketonuria) points to GenBank U49897 and http://www.mcgill.ca/pahdb).
The Human Gene Mutation Database (HGMD, Cardiff) originated in a series of meta-analyses of human mutation types (Cooper). It lists curated information from MedLine, journal reports, locus-specific databases, and personal communications. Mutations are classified in ten groups by type, one page per gene per mutation type. HGMD has information on 640 genes (460 annotated reference sequences) harbouring over 12000 mutations and it provides MedLine references (via Entrez). If there were locus-specific databases today only for genes known to harbour 25 mutations or more, they would contain only 46% of the currently known mutations and they would cover only 7% of the genes listed in HGMD; hence the need for a genomic database like HGMD to record the many genes with few mutations. However, HGMD and other genomic databases may find it difficult to maintain the dense repertoire of descriptors and other information found in large locus-specific databases. Meantime, OMIM and HGMD function well as directories of locus-specific databases.
Bioinformatics is a necessary resource in genomics. The National Center for Biotechnology Information NCBI at NIH (Ostell) aligns genomic data (markers and loci), provides computed relationships, and combines the information in chromosomal, genetic and physical maps with the genomic (or cDNA) nucleotide sequence. NCBI maintains GenBank and its annotated reference nucleotide sequences (5 digit accession numbers). Information is linked via Entrez to PubMed and Medline UIDs. Information on genomic variation is stored at the Genome Data Base (GDB) which does not systematically collect mutational data (Cottingham); a new collaborative project between HGMD and GDB is being funded to record mutations.
The power of software to interrogate the information content in wildtype and mutant nucleotide sequences was revealed by an analysis of splice-site modifying mutations (Rogan). All mutations occur in a context of flanking nucleotide sequence; an algorithm has been developed to quantitate information content in the sequence context; whether a missense mutation, for example, would or would not generate a new splice site can be predicted.
Envoie: Whereas mutation databases are now firmly established as resources for genetics and genomics, problems remain to be addressed (Cotton), among them: nomenclature of complex mutations, guidelines for content and structure of databases (both locus-specific and genomic), quality control of content, protection of intellectual property, copyright, credit and recognition for effort and input, and how to ensure longevity and funding of such resources. Current support of MDI by HUGO and The March of Dimes (USA) means that issues will be addressed.
(1) To the program and list of participants