Posted Dec. 4th, 2000

HUGO MUTATION DATABASE MEETING
PHILADELPHIA, PA, U.S.A
OCTOBER 3, 2000
MEETING REPORT- full version- Reported by Rania Horaitis


Invited abstracts

Discussion

Resolutions


Invited abstracts

The 9th International HUGO Mutation Database Meeting was held in association with the ASHG 2000 on the 3rd of October in Philadelphia, Pennsylvania U.S.A. The city of brotherly love and home of the Liberty Bell provided an excellent setting for a successful meeting. Sixty-two registrants from ten countries, ten from nine companies attended. The morning session was devoted to invited abstracts and the afternoon was set aside for discussion of sustainability of the MDI and a proposal by industry.

Richard Cotton welcomed and reviewed the current state of the MDI before the meeting began.

The first speaker Mauno Vihinen an LSDB curator for the immunodeficiency databases from the University of Tampere in Finland spoke of mutation data analysis with mobile devices such as phones. In 2001 there will be ~500 million mobile phone subscribers. The Wireless Application Protocol (WAP) is a standard used for the presentation and delivery of information on mobile phones. A package (BioWAP) that can provide access to all the major bioinfomatics data independent of time and place on a mobile phone has been developed.

Fifteen different analysis services are provided by BioWAP to study sequence, structure and mutation information. BioWAP may be used to search for specific mutation information from immunodeficiency mutation databases providing patient specific data. The service is freely accessible with no charge within Finland. People outside Finland can check the availability of free local WAP gateways from http://www.wapdrive.net to obtain cheaper calls. The BioWAP settings are available at http://www.uta.fi/imt/bioinfo/BioWAP/

Bruce Gottlieb curator of the androgen receptor database from Lady Davis Institute for Medical Research, Montreal, Canada spoke on variable expressivity and mutation databases. The importance and value of mutation databases has been based on the premise that the same gene or allelic variation in a specific gene that has been proven to determine a specific phenotype will always produce the same phenotype. Recent evidence however suggests that Mendelian disorders or monogenic traits are often far from simple and exhibit variation in phenotype (variable expressivity) that can't be explained only by a gene or allelic change. Specific alterations in DNA sequence in specific genes could be caused by environmental factors "modifying" genes or cofactors i.e. interacting proteins etc. there are few examples of simple and direct correlations between genes.

Mutations in modifying genes alone could be expected to affect the phenotype, if it is assumed that the modifying genes are part of the normal gene expression pathway. When modifying genes have been identified, they have rarely been found to have such an effect. An example of such phenotypic variable expressivity is the Androgen Insensitivity Syndrome (AIS) that is caused by mutations in the Androgen Receptor (AR). It has always been assumed that different mutations are responsible for phenotypic variability. The reality is that 25 out of 200 mutations listed in the AR gene mutation database have produced different phenotypes. Possible clues to the reason for this is that: (a) Somatic as opposed to germline mutations of the AR gene have been associated with prostate cancer. (b) Some genital skin disease tissue has appeared to contain two different types of androgen receptor.

Significantly somatic mutations are important for databases, they can occur anytime in a lifetime and may occur in different cells. Timing is critical.

Tissues with specific AR gene mutations that appeared to show variable expressivity due to possible somatic mutations were examined. Genotype differences were observed, as genital skin fibroblast (GSF) tissue contained both mutant and wild type AR genes, whereas blood leukocytes only contained the mutant gene that had been inherited from their mother's. Thus the patients exhibited somatic mosaicism, with mutant AR genes reverting back to the wild type AR gene in their GSF.

Future considerations for mutation databases:

1. Realization that the genetic constitution of an organism can vary significantly over its lifetime.
2. Identification of somatic mutations as having a significant effect on phenotypic expression.
3. Identification of the importance of somatic mosaicism in variable expressivity.

Significance for mutation databases:

1. Genetic heterogeneity indicates we need to look at databases in a different way. We need to continuously examine a genome over lifetime.
2. Sequences from different tissues need to be analysed. Databases need to reflect genetic differences in different tissues for each entry.
3. Genotype dynamism- the timing of somatic mutations may be critical e.g. in foetal development.
4. The rate of somatic mutation may increase with age.

Suggestions for future database:

1. Databases need to be able to differentiate between germline and somatic mutations.
2. All phenotype differences to be identified, however subtle.
3. Databases need to be updated over a period of time to reflect changes in the genotype due to somatic mutations.
4. Genetic heterogeneity needs to be accommodated within some database particularly those associated with somatic mutations in certain cancers.

Saeed Teebi from the Dept. of Bioinformatics at the Hospital for Sick Children in Toronto, Canada presented a model for an Arab genetic disease database that is under construction. This database will catalogue genetic disorders found in Arab populations that characteristically have high consanguinity, large family size, a high frequency of autosomal recessive disorders, presence of multiple isolates, and high frequency of new syndromes and variation.

There is an apparent increased frequency of new disorders, especially autosomal recessive ones. Increased frequency homozygous for autosomal dominant disorders (e.g. familial hypercholesterolaemia). Increased frequency homozygous for X-linked recessive (e.g. G6PD).

There is a lot of interest and research in Arab populations hence the need for a database of Arab gene disorders. Work in this area started with the book "Genetic Disorders among Arab Populations (A. S. Teebi and Farag, 1997 OUP) which subsequently led to the formation of The Middle East Genetics Association of America. Potential users of the database are researchers, physicians with Arab patients, and physicians in Arab nations, geneticists, councilors, health care planners, and students.

Data will be collected from (a) The International literature, (b) "Local" refereed journals not indexed in PubMed, Index Medicus etc. This is important because they are located in developing countries and the data is not published elsewhere. (c) The Arab Genetic Disease Consortium (currently 30 investigators in 15 countries). The database's search capabilities will include OMIM, disease name, clinical synopsis, frequency, mutations, haplotype data, and references. Diagnostic tools will help trait groups diagnose a disorder and bank information about unknown cases (as selected by the curators). Links will be made available to OMIM, GDB, LSDBs & repositories (as many as possible), PubMed, HGMD/Cardiff, and others. The database will be Web based and released early 2001 at http://www.agddb.org/ it is expected it will have around 1000 mutation entries at that time.

Sue Povey from the HUGO Gene Nomenclature Committee (HGNC) spoke on the interaction of human gene nomenclature and the mutation databases. The only rule of the HGNC is that all approved symbols are unique; however guidelines (http://www.gene.ucl.ac.uk/nomenclature/guidelines.html), (White et al Genomics 1997) are followed when new symbols are assigned. The number of approved symbols is increasing. In October 2000 there were approximately 12,000 approved symbols. A gene requires a name when somebody is interested in it and wants to talk about it. Since the introduction of the HGNC website in October 1997, the hits have risen to just over 20,000 per month.

The HGNC maintains and curates its own database of approved symbols: Genew3. This has meant that there is now an increased field number and better reporting capability; including an new output file at http://www.gene.ucl.ac.uk/public-files/nomen/nomeids.txt which contains LocusLink, RefSeq, GDB identifiers and stable HGNC gene symbol IDs. Thus, imports and exports of data with other databases e.g. HGMD is better facilitated as tracking can be verified via the ID numbers.

If all mutation databases use the HGNC approved symbols then they will be better able to interact with each other. Hester Wain (Email: nome@galton.ucl.ac.uk) should be contacted about the HGNC database and other nomenclature updates.

A summary of the ASHG nomenclature workshop (held on Monday October 2, 2000) described: The definition of a gene, a systematic approach for: genes, pseudogenes, open reading frames and "like" genes, symbol consistency between species, co-ordination and collaboration and high throughput genomic sequence data analysis. This will be placed online shortly at http://www.gene.ucl.ac.uk/nomenclature/ASHG-NW.html. The most controversial issue raised was once again the definition of a gene. At present the HGNC definition is being used: A gene is a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology. [White et al 1997]

Daniel Nebert from the University of Cincinnati Medical Centre spoke about considerations for the naming of human alleles. Nomenclature committees face enormous difficulty in deciding upon the best name for each gene and at the same time using standardized nomenclature rules for naming of allelic variants. Evolution of Homo sapiens over the last 800,000 years or so has seen divergence of alleles, similar to the way genes diverged on the earth during the past 500 million to 3.8 billion years. As expected, African alleles are older and hence have two or more times mutations within any gene as compared with Caucasians or Asians. The evolving consensus for naming the alleles of all human genes based ideally on evolutionary diverging haplotype patterns has been described in Nebert DW (2000) Pharmacogenetics 10: 279-90.

High-throughput sequencing and DNA-chip technologies lead to an anticipated explosion in finding new allele variants. This deluge of new information will overwhelm journals. The best approach could be to place the data on Websites with links between them. Information would have to be frequently updated so that those in all fields of medical and genetic research can remain in the knowledgeable. The cytochrome P450 (CYP) genes and human CYP alleles, UDP glycosyltransferase (UGT) genes and human alleles, human N-acetlyaminotransferase (NAT2, NAT1) alleles and aldehyde dehydrogenase (ALDH) genes and human alleles are examples of successful such Websites. Many more Websites will be necessary. The curator will need to be responsible, accurate, energetic, very organized and keen to keep the site current. Interactive discussions on these sites should be encouraged and advisory committees should check frequently to ensure all new information is accurate. The websites may be linked from the list of LSDBs on the MDI website.

Heikki Lehväslaiho from EBI, UK presented gene and genome wide tools for variation management. The current Mutation Checker is tool to verify effects of sequence changes in DNA, RNA and amino acid sequences. The web interface to it is at http://www.ebi.ac.uk/cgi-bin/mutations/check.cgi. The Checker has various problems due to gradual growth over years; most importantly it cannot handle genomic reference sequences. A more modular design is being implemented using an object-oriented approach in open source Bioperl project (http://bio.perl.org/). An extensive integrated gene model in namespace Bio::LiveSeq has been created to track down mutations. The resulting mutation description is then written into objects in Bio::Variation.. These objects handle the analysis as well as input and output of information to various formats. The code can be used to describe canonical mutations from wild type to mutated, polymorphisms with more than two alleles and pair wise sequence alignments with multiple differences. Also, the approach can be extended to handle multiple, overlapping mutations.

This tool is being used to create HGBASE, a public database of human intra-genic DNA polymorphisms (http://hgbase.cgr.ki.se/). To establish a unified view of human SNPs HGBASE aims to be an accurate, high utility and fully comprehensive catalogue of normal human genome variation. Data in HGBASE is collected from all major public genome databases and is also extracted from published literature. Research groups are also able to submit data. Extensive annotation and internal checking is undertaken to ensure high quality of data.

HGBASE is collecting SNPs from the literature v.7 (19/5/'00) has 61,598 human polymorphisms in contrast dbSNP that expects people to submit data and has no increase in entities. HGBASE and dbSNP have started processes of synchronizing their data. The rate of growth of HGBASE is likely to be exponential for some years. Currently 99% of entries are SNPs. Of 13,369 SNPs, 91% are gene related.

HGBASE can be freely accessed using the EBI SRS (Sequence Retrieval System) by keyword search and sequence similarity search using FASTA3 to query HGBASE DNA sequences. The complete database is available for downloading.

EnsEMBL (http://www.ensembl.org/) has been created by EBI, EMBL and the Sanger Centre. This produces and maintains annotation on the human genome automatically and results in a database containing genomic sequence with features attached. Map, disease and sequence variation data are added as external databases to be viewed in genomic contig context. The SNP data come from HGBASE and dbSNP. The human genome data are produced in monthly data freezes based on which genes are located. The genes are predicted using ab initio methods and confirmed using homologies to known genes and ESTs. The July 15th data set contained 29,472 confirmed genes.

Saeed Teebi took to the podium again and presented BiSCs mutation submission WayStation and central mutation database pilot. At the Vancouver meeting it was agreed that BiSC/GDB would develop a prototype Way Station to collect and disperse mutations to LSDBs and central databases. This would be a component of the pilot project for a central mutation database. The collaborators in this are BiSC, HGBASE, EBI, and dbSNP.

The submission WayStation aims to create a central point to submit data in one place only and to provide a consistent interface and format for this data. Redirection of submitted data to the appropriate curators for review because they can review it most properly would also be a function of this WayStation. The draft 4 of the allele variant entry form developed by the MDI in 1999 has been adopted and made into a submission form (http://ariel.ucs.unimelb.edu.au:80/~cotton/entry.htm) Data entered into the WayStation is compliant with gene and nomenclature recommendations.

The major segments of the WayStation are submitter information, source data, general mutation information (including DNA data), RNA, protein, and population data etc. Note that the term "mutation" includes polymorphisms too. The submitter registration has the following fields: contact information, user name and password, submitters are verified by the way station curator. Valid submitters to the way station can only be done by a verified submitter i.e. at a legal institution.

Data is submitted by registered users only; it is a streamlined web-based submission. Two segments are required for valid submission. (a) Source submission i.e. the submitter. (b) General mutation data, gene locus etc. It was decided that a third quality control section as in the draft 4 of the entry form would be added as this was omitted by mistake. The source is a required field and may be (a) Published reference or PUBMED ID (b) Personal communication to the way station, the entry form must be filled out. General mutation information is a required field and covers basic mutation data e.g. species, locus, gene symbol. DNA data e.g. name, location base change type is included. Additional information is not required to submit a mutation however may be added such as, RNA protein data, cross reference data.

What happens to data after submission? Data is "pushed" to the WayStation where a unique identifier is assigned. It is then redirected to the appropriate LSDB. If permission is given by the LSDB curator the submission is the forwarded to MDI's Central mutation database. Curators may modify, query, and approve of data. They notify the WayStation of the status of the data and may submit information to the central database. The data however stays in the LSDB. Database pilot prototype has been constructed and may be viewed at http://www.centralmutations.org/. It is designed to accept data from diverse sources and stores submission form front-end forms such as WayStation. It includes a number of flexible queries e.g. PAH db, CF, Fanconi, HEXA & B, PHEX, CASR. These databases are already included in the database as a demonstration and are now available for query. Links to the original LSDB are in place. This is still a work in progress and feedback and suggestions are critical and needed. This prototype was available for demonstrations at the GDB/BiSC booth in the Exhibitors hall at the ASHG meeting.

Frank Russo from Incyte Genomics presented a summary of Incyte's new Genomics Knowledge Platform (GKP).

GKP is built according to Metcalfe's Law:

1. The cost of a network rises linearly with the number of nodes.
2. Value of a network rises exponentially with the number of nodes.

It has become a point where the network must be joined to get something happening, Incyte wants to see this in bioinformatics.

Incyte's GKP will change the way that researchers collect diverse information from many sources. Often these information searches are slow and individual data sets have to be queried in isolation. GKP is a platform that can process data from a variety of sources and file formats. Data types include gene sequence, expression, polymorphism, and proteomic data, as well as functional data.

Incyte's Unified Object Model where this data is mapped provides the foundation for linking, communicating, and sharing information about biological functions and interactions. Software applications will combine this information with evidence from clinical trials and animal studies so researchers can quickly investigate the complex associations that underlie many important biological problems.

The architecture of GKP was explained and may be viewed in detail at: http://www.incyte.com/our_science/gkp/architecture.shtml by anyone interested. Incyte's wishes to add value to their data by giving value to other data that is why it is doing this project.

Francisco de la Vega from Applied Biosystems, Foster City, CA represented an ontology for assays to SNPs. In sharing knowledge different systems use different concepts and terms for describing domains. To promote interoperation between genomic databases, standardization trends such as ASN.1 standard, Relational database schema-GATC consortium, NCGR microarray database, Object oriented database schemata/models, OPM libraries e.g. used by GDB, XML, and CORBA IDL specifications (OMG LSR-TF) have been advanced. These are platform or technology dependant e.g. relational vs. object databases CORBA vs. JAVA and therefore suffer because their representational languages lack expressiveness to define concepts and algorithms. Any users of the particular technology/approach are likely to adopt these standards. They are theoretical for sharing experiences and best practice.

Ontologies are formal representations about the sorts of objects, properties of objects, and relationships between objects that are possible in a specified domain of knowledge. They are used for data exchange among programs, unification of disparate representation, embody the representation of a theory, enable K-base services, and facilitate communication between software and scientists. Ontologies have advantages such as they are platform and implementation technology independent and structure around where knowledge bases can be directly built, they are also more useful for sharing data models and algorithms, and can be used to drive database development.

The Polymorphism Assay (PA) ontology was described. This assay represents polymorphisms and assays used to screen them; it also facilitates communication during required gathering. There are three main parts of representation: (a) methods-assay-sequencing, primer extension, PCR, Taqman-detection-fluorescence, electrophoresis etc., (b) genomic segments, and (c) reagents. PA has been successfully used for required capture in the design and development of database systems to capture high-throughput SNP-genotyping assays under development.

PA is implemented in an object-oriented language used by their object-to-relational middleware (OPM) and the backend was deployed in ORACLE 8i. Future plans for availability of this ontology include sharing it with the pharmacogenomics projects for including in their model and publish in the "Ontolinga" KB server at Stanford.

Charles Scriver from McGill University, Montreal, Canada and curator of the PAH database, presented a short summary of the PAH database as a model database after a brief presentation of the history of the MDI.

MDI patronage has been received from HUGO and financial support has been from the March of Dimes. From a nomenclature meeting in 1994 we have now got to where we are now. The human genome project (HGP) has multiple players just as the MDI project has multiple players serving the HGP. The other side of evolution is allelic variation in genes. The MDI services society as it provides knowledge, the people are here at these meetings therefore they are interested.

The PAH database includes information on the disease phenylketonuria. It does this because it is information to share with the consortium and is of general interest. Jamie Cuticchia et al from BiSC for example, will take mutation data, distribute, and enlarge it. This is OK. A bigger company can also do this same thing and gain a profit out of it. Why not?

Mutations are highly annotated-population types, expression analysis. A newsletter is a paper version of some of this. Some of this is used to get structural biology of PAH. Patients use it, clinical model re PKU tells parents with a PKU child what they need to know. High school students use it and also need to know for example.

There may be a curatorial problem. For example, what will happen when Charles is too old to curate the database? Various people have come and gone and curated the database e.g. Teebi, Hong, Nowacki. Now we have Ziggy Zeng. Ziggy is 17 years old and in her final year of high school. She has interest and enthusiasm in science and is a curator of the PAH database. She has launched a scientific journal for youth called "JOYS" Journal of Youth Science aimed at 15-25 year olds in over 25 countries. Charles then introduced Ziggy who spoke about her perspective as a scientist of the future.

Ziggy Zeng says she has an interest and passion for science. She believes that age is not a prerequisite for success and tells us that scientists must also become educators in the classrooms and to the general public. Scientists must not be seen as a highly educated elite. Politicians should have a sufficient scientific background to understand the important issues and governments should provide larger budgets for research and development.

The MDI encourages communication and cooperation and the depository planned fits in well with the 21st century and globalization. There is a need for consciousness and visionary leaders.

Nicolas Neckelman from Transgenomic Inc. San Jose, CA, discussed MutationDiscovery.com a mutation database and resource for methods and applications for the DHPLC community. MutationDiscovery.com is an Internet accessed database that will feature sequence variants found in organisms whose nucleic acids have been analyzed using denaturing high performance liquid chromatography (DHPLC). An up-to-date collection of DHPLC methods and applications will also be available on this website.

Data obtained with any such system may be submitted and posted. Every variation will be catalogued at the cDNA level and genomic DNA level where possible. The exact location of each mutation will be shown at the sequence level while referring to well-defined reference sequences and accession numbers. Where applicable, changes in amino acid sequence will also be shown. Methods files containing the DHPLC chromatograms, PCR primers and conditions, and DHPLC conditions will be provided for each sequence variant entered in the database. Method files posted at one location will be able to be accessed by WAVE® systems at any other location in the world. This interaction between the WAVE® systems, MutationDiscovery.com Website, and individuals, will facilitate publication, exchange, sharing and standardization of DHPLC data obtained by the worldwide community of WAVE® systems users.


Discussion

The afternoon session was devoted to a discussion of various issues.

The Mutation Submission WayStation & Central Mutation Database Pilot was presented by Saeed Teebi in the morning and is summarized above. All agreed this moved along nicely since the last meeting in Vancouver and are happy that BiSC/GDB continue their efforts.

It was proposed that HUGO-MDI form a Society. Mark Paalman and Colette Bean representatives from John Wiley & Sons Inc. were in attendance and proposed that Human Mutation become the official Society journal. Mark the Managing Editor outlined in brief various proposals as to how we could work together. Human Mutation would publish meeting reports and advertise meetings in the journal (they already do so) as well as consider manuscripts on mutation databases (in fact a byline was added to the cover earlier this year "variation, databases, and disease" which shows their commitment, furthermore a special issue devoted to HUGO-MDI was published at the beginning of the year). There were two models (a) an obligatory model i.e. charging a fee to members who would receive a copy of the journal or (b) a non-obligatory model making subscription optional in which case subscription would be at a reduced rate to normal but not as cheap as the obligatory model. This is subject to further discussion. There was unanimous agreement that MDI should indeed form a Society and that Human Mutation be the official journal.

Proposal from Industry for funding Discussion

During the summer of 2000 MDI sought to gain funds and approached various sources. Industry was one of them. Incyte Genomics provided MDI with a proposal and a Memorandum of Understanding (MOU) on the 2nd of Oct. the day before this meeting. Much heated discussion took place in the afternoon session and issues such as exclusivity were brought up. Others were totally against "doing a deal" with a private company. Since the meeting another MOU has been sent by Incyte to MDI that is currently under consideration. This is available on request from Rania (horaitis@ariel.ucs.unimelb.edu.au). The main points that arose out of the discussion are:


Resolutions of the meeting

    1. HUGO-MDI to form an official Society.
    2. HUGO-MDI adopt Human Mutation as official Society journal.
    3. Incyte MOU dated Oct. 2 needs to be thoroughly examined.

The successful meeting ended with a mixer where delegates mingled and ideas were exchanged. We hope to see you all next year 19th April in Edinburgh, Scotland and/or 12th Oct. in San Diego, U.S.A. mark it in your diary.


A brief meeting took place on Oct. 4 with representatives of LSDBs, and Central DBs. This was to further discuss the MOU and what actions MDI should take. Since the meeting this group has been systematically working on a better deal as well as defining the objectives of the MDI. A second MOU was sent by Incyte and may be obtained by e-mail from Rania. You will be advised of any proposals.