San Francisco HUGO-MDI meeting '99

7th International HUGO Mutation Database Meeting
October 19, San Francisco 1999

Reported by: Ourania Horaitis, Heikki Lehväslaiho and Richard G.H. Cotton

ETHICAL ASPECTS
DATABASE PROTECTION
COPYLEFTING
ECONOMIC ISSUES
GUIDELINES AND RECOMMENDATIONS FOR CONTENT
A WAY STATION FOR MUTATION DATA
APOGI - A SYSTEMATIC APPROACH TO DEPLOYING GENETIC INFORMATION IN CLINICAL PRACTICE
GeneClinics™
MUTbase
A NEW MODEL APC DATABASE
HCForum™
POSTER PRESENTERS
KEY ISSUES
RESOLUTIONS

The 7th International HUGO Mutation Database Meeting was held in association with the ASHG '99 on the 19th of October in San Francisco, U.S.A. Warm, sunny weather and the steep hills of the city provided and excellent setting for a successful meeting. Eighty registrants from 12 countries, nine from eight companies attended. This year, some topics not covered in previous years but of current importance were sought such as ethics, protection of databases and economic issues.

After an initial welcome and review of the current state of the MDI by Richard Cotton the meeting began by an ethics paper not covered in previous meetings. Bartha Maria Knoppers together with Claude Laberge from the University of Montreal, Canada were invited to speak about the ETHICAL ASPECTS of variation databases. The main points of the presentation were that it is important to verify if the statements issued by HUGO's Ethics committee are applicable to the MDI. As the MDI has parallels with epidemiology, relevant HUGO and UNESCO statutes exist that can be adapted to MDI purposes. Key points made in the presentation were that the scientific content of databases must be correct, and consent must be obtained particularly from the vulnerable. Any new codes should not be retrospective. Confidentiality is an issue, the more clinical data that is included the greater the ethical problem, the contributor will be identified and hence the patient may also be identified. Any conflict of interest should also be stated. The purpose and limits of particular mutation databases must be clarified due to public access so as not to be confused with other types of genetic database. Interactive questions lead to questions of competency for counseling.

The main conclusions were:

The exchange of information empowers the less well informed.
We should be aware genetic diseases affect families in countless communities in many countries. Information must be protected against misuse.
We should be proactive regarding data security and access.
A common ethical framework from inception of a database is best. Guidelines should be set up immediately.

The next invited speaker Jerome Reichman from the Vanderbilt School of Law in Nashville, discussed DATABASE PROTECTION. The main point was that scientists need to be active as overly protective initiatives may compromise research-based institutions in the U.S. For example, the European parliament has recently passed a law that protected databases.

Unless suitable adjustments are made a potentially adverse impact of the emerging legal infrastructure regarding the users of factual data could occur. Proposed legal protection of databases in the U.S.A. if enacted would:

(a) Compromise scientific databases
(b) Diminish easy access that is a key to U.S. science.
(c) Upstream access would be blocked.
(d) Would make it costly to aggregate data.
(e) Would have a major impact on science.

To continue the stream of protection of databases and information Heikki Lehväslaiho from the European Bioinformatics Institute (EBI) Cambridge, introduced the idea of "COPYLEFTING" of scientific databases. A major trend with biological research is to collect data, publish it and arrange it into databases. However a major fear is that this freely accessible public data can then be taken over by companies seeking commercial gain. The "copylefting" of databases is being suggested as a solution. This has been used by the computer science for software development and documentation by software community for some time now. Copyright legislation is used to guarantee the availability of the information and prevent its inclusion in any commercial applications. Two existing examples are the LINUX kernel and PFAM database (http://www.sanger.ac.uk/Software/Pfam/). The formalised open practice for academic research is the general goal. It was recommended that the MDI agrees on a common copyleft/open database license and that LSDBs should start using copyleft license. It was also suggested that organisation of LSDBs should allow for easy separation of freely distributable general variation and phenotype data from more sensitive data from individuals in need of tighter protection.

Mutation databases are always trying to obtain funds to continue their valuable work. Stephen Maurer an attorney at law from Berkeley, U.S.A. discussed the ECONOMIC ISSUES of databases. At present there is no real world example of a self-supporting biology database despite the interest. It is surmised bioinformatics databases could have an estimate of as many as 10 workers that could be usefully employed and therefore need approx. $2M/year whereas other scientific databases need about $250K to $5M/year. Revenue may be gained from an annual subscription e.g. SWISS PROT has a sliding scale annual subscription of $5k-$90K/yr. How many users are likely to utilize the service must be calculated in order to gain an insight of how to gain revenue. e.g. 200 customers who pay $10K/yr each will lead to a gain of $2M per year and therefore a potential 10 full time employees. Aiming at the mass market could lead to potentially similar results e.g. $1000/yr x 20,000 users = $2M/yr. On the other hand, if the high end of the market is pursued there may be a higher gain. e.g. $10K/yr x 300 customers = $3M/yr.

One strategy envisioned is "Discriminatory pricing". i.e. trying to sell our product as well as give it away to others! The "airline model" may be used. In this model there are 2 markets that have very different preferences. i.e. the business traveler will not want to stay the weekend whereas the vacationer will, thus substantially different fares are applicable to both. This model could be used to discriminate the academic free use of a mutation database from the company who is willing to pay for the data. Characteristics of our product must be chosen so there are different products for different users and hence the corporate user may be charged. A tiered structure may be used. For example in gaining information about the stock exchange a newspaper may be bought for 25 cents or you may pay $10,000/yr to get trade information online, or even pay $25,000 a year to get a "real time" update.

Several complimentary business strategies may be implemented in the case of databases:

Updates

Some companies may want an updated database every night and be prepared to pay for it. Academic free users may not care if the database they access is 1 or 2 months old.

Metering strategies

The user can be charged depending on how many bytes and what information are downloaded.

Custom products

The curator knows the database best and may be able to produce information that people are willing to pay for.

Alert strategies

People may pay for an automatic alert whenever new data is available.

Charles Scriver of McGill University in Montreal, Canada updated us on the GUIDELINES AND RECOMMENDATIONS FOR CONTENT, structure and deployment of mutation databases that were published in "Human Mutation" 13:344-350 and are now available on the web at: http://www.wiley.com/products/subject/life/genetics/genetics_humu_article1.pdf . An updated document will be published in the January 2000 issue of "Human Mutation", (Hum. Mut. (2000) 15:in press). The main point made was that genomic or central databases were a mile wide and an inch deep whereas locus specific databases were a mile deep and an inch wide. Many LSDBs are better described as knowledge bases as their content and use go well beyond lists of mutations. For example the phenylalanine hydroxylase (PAH) mutation database has a curational team of 10, phenotype on 600 patients, contains a PAH structure component, a mouse model section and a resource booklet for families.

Jamie Cuticchia of Genome Data Base (GDB) in Toronto, Canada spoke regarding the creation of A WAY STATION FOR MUTATION DATA. Several groups such as the NCBI, GDB, HGMD, EBI and others have or are producing databases that collect and store variation information. It was proposed that a way station be created as a central site for the collection of variations. This way station should have a unique URL so there is no "real" ownership of it. e.g. such as GDB. Variation data will be collected by direct submission and from the literature. This information will then be reviewed and channeled to the appropriate databases e.g. dbSNP, HGMD, OMIM, MUTATION VIEW, and a/ or via a LSDB if one exists for them to do as they wish. It will act as a central transfer point for mutation databases to exchange information i.e. information may be exchanged both ways between the databases and the way station.

The way station will not present data to the public, it will act only as the submission and transfer point. Links and advertising for relevant variation databases will be included on the site.

The submission of variation data will be based on the HUGO-MDI form with JAVA or another web based tool implementation. The submitter will be provided with a report to verify the submission has been received and where the information is being sent. Where appropriate, links to recipient databases will be provided within the report.

A way station or central entry point provides several advantages-one URL (HUGO-MDI) for submission, one site for advertising, curation will be kept in the hands of those who are expert in the gene/s and willing to do it (this is the best possible curation), a wider source of submission will be allowed, multiple databases will be able to curate in particular areas, advertisements of mutation databases will be available and those who may not be aware of the place or places to submit such data (e.g. clinicians) will only need go to one place.

Matthew Darlison of the Centre for Health Informatics and Multiprofessional Education and Department of Primary Care & Population Sciences, University College London, U.K, spoke of APOGI - A SYSTEMATIC APPROACH TO DEPLOYING GENETIC INFORMATION IN CLINICAL PRACTICE. Delivery of large-scale services to the public is irregular because of failings in infrastructure. Many failures result from a lack of awareness or information somewhere in the health system. ApoGI is a model for the delivery of up-to-date genetic counseling information to professionals and their clients. Molecular variation data, phenotypic data and epidemiological data are systematically integrated to produce text that supports services for haemoglobin variants in the U.K. Consultation is ongoing with patients, parents, health professionals and other interested parties. The resource is freely available on the web (http://www.chime.ucl.ac.uk/ApoGI/) or on CD-ROM and is in everyday use in centres around the U.K. The ultimate aim of the initiative is to automate the provision of high quality mutation specific written information to all the clients regardless of the rarity of the diagnosis or their cultural background.

Aravinda Chakravarti discussed the issue of content of variation databases. Most current databases concentrate on a gene locus and its variation. Databases with genomic context are needed to manage data for complex, multifactorial diseases e.g. Hirshprung's disease. The essence of the presentation was that description canonical mutations will not be enough, detailed information about each patient, such as haplotypes, geographic variation, ethnic background and phenotype should be included.

Peter Tarczy-Hornoch of the University of Washington, U.S.A, presented an application of genetic testing to diagnosis, management and genetic counseling i.e. the GeneClinics™ database (http://www.geneclinics.org/). One of the first applications for most new genetic discoveries is genetic testing. Clinicians in all domains will need to have up to date information about the operation and existence of genetic testing as the genetic basis of common disorders is elucidated. Two complementary databases have been created to address these needs. GeneTests™ - a directory of genetic tests and GeneClinics™ a database that contains information on the application of clinical genetic testing. Currently GeneClinics™ database contains authored and peer reviewed entries of genetic testing information for 47 diseases as well as 4 overviews covering groups of related disease that have entries in the database. The information presented in this database is for the care providers, not patients. The database contains data (genes, loci, prevalence, mode of inheritance) to permit bi-directional linkages to primary genomic databases and free text structured as a series of questions and answers clinicians may ask. Tools have been developed to acquire hybrid data and text from the authors that is then converted to XML. After tagging by editorial staff the document is loaded into an ObjectStore database as a few hundred objects. Reviewers use the web to comment on the draft entry and after peer review, the entry is released in HTML format.

Mauno Vihinen from the University of Tampere, Finland presented the MUTbase (http://www.uta.fi/laitokset/imt/bioinfo/mutdatbas.html#idmdb) software package for the maintenance and analysis of mutation databases on the web. MUTbase is a suite of programs that provides easy, interactive and quality controlled submission of detailed patient and mutation information and ways to present the data on the web. The programs are implemented on the Perl programming language. The system is designed to be easy to transfer data to other systems and has been used to maintain 10 different consortium-maintained immunodeficiency databases containing information on 16 different genes.

The current submission interface differs from the MDI form however this will be modified to conform once the form is finalized. After submission, quality checks are done on the data by the curators. Common data elements from database to database are present on the submission forms i.e. the reference sequence, accession number and disease information. No data is added to the public databases without curatorial checking. The curator may ask for further detials from the submitter. When enough new mutations have been collected a new release of the database is made. Dr Vihinen will provide programs, customization help, and disk space for those who need it. Server maintenance of the database is also offered. (ltmavi@uta.fi)

Robert Pomponio of Genzyme, U.S.A., spoke of A NEW MODEL APC DATABASE. This database reflects the HUGO-MDI recommendations, i.e. the Scriver et al guidelines for content, structure and deployment document and the Antonarakis et al nomenclature paper. The database was created for research use only but in the context of a diagnostic testing laboratory.

To create the database a commercially available platform "FileMaker Pro" (FMP4.1) was used because it was readily available and easy to use. Excel as well as ODB1, Oracle, SQL serve or Access may be converted to FMP4.1 and data may be shared between MAC and Windows over the network. FileMaker Pro also has built in web publishing software.

It is envisioned the bulk of information will flow from the clinician/researcher via the entry form to the central, offline APC server. This data is then validated and assigned a unique accession number by the curator. The information is then sent to a searchable web database that is not modifiable by the user and may be used by the clinicians/researchers and also flows to the other databases such as OMIM, Medline, GenBank and Entrez. Data may also be sent from the central server to an electronic journal or to a consortium.

Dr Pomponio wishes to propose an APC consortium for the open reporting and discussion of mutations in this gene. Please contact him if you are interested. (Rob.Pomponio@genzyme.com)

Olivier Cohen of the Medical School of Grenoble, France, presented HCForum™ (http://HCForum.imag.fr), a website dedicated to structural abnormalities of human chromosomes. Inherited structural abnormalities are caused by chromosomal breakpoints followed by abnormal sticking and are present in around 2.4% of individuals in the general population. Ten years of effort were needed to create this database that carries information on about 4500 carrier families. This information is gained from genetic centres and the literature. Accurate location of breakpoints may be used as a tool for diagnosis. The information is used for genetic counseling and statistics are included to calculate the risk of disease. There are currently 300 users from 35 countries.

POSTER PRESENTERS briefly presented a summary of their work.

Brage Storstein Andersen of Aarhus University hospital, Denmark announced a new database under construction for MCAD and VLCAD. This database was constructed using the UMD software available from Christophe Beroud and contains standardized data as well as clinical information on patients allowing phenotype/genotype studies to be carried out. It will become publicly available shortly.

Alastair Brown of the MRC Human Genetics Unit, Edinburgh, U.K., updated the status of the MuStaR™ software. A curation program (stand alone MS-Access ) is now available together with a suite of cgi programs to automatically create a database website free to the academic community. The entire system is based around recommendations made by the HUGO-MDI working groups and requirements of users.

Mary-Pat Reeve of the Walter and Eliza Hall Institute, Melbourne, Australia discussed updates to Variation View, an applet to record and manage variation data applied to tuberous sclerosis genes. This applet provides a graphical interface for entering insertions, deletions and single base changes in genomic and coding DNA context. The spectrum of variations may be viewed from the exon or sequence level and the effect of change on the DNA and coding level may also be seen for each variation. Index and correct nomenclature are generated automatically. Data is stored in MiniSQL and accessed from the applet through JDBC.

Steve Sherry from NCBI discussed the current and future developments for dbSNP. dbSNP is moving in two directions- firstly dbSNP is developing a set of quality measures and standards to be used when reporting experimental results together with a NHGRI advisory panel. Secondly, users will be able to submit information on multi-locus haplotypes and some associated general phenotype data.

Barbara Trülzsch of the University of Leipzig, Germany discussed a new TSH receptor database. Data is presented as HTML and cgi programs are used. Pedigrees are included for all germline mutations, links are provided where applicable and there is also a graphical representation with links to comprehensive descriptions of each mutation.

Rachel Kreisberg-Zakarin of the Computational Biology Unit, Tel-Aviv University, Israel presented a new database named GeneDis. This is a searchable database of human genetic diseases. Biochemical data and known mutations are incorporated in the primary sequence of genes and proteins involved using links. At the moment GeneDis includes Gaucher and Tay Sachs disease that are prevalent in Ashkenazi Jews. Other diseases found in Jewish, Israeli and Mediterranean populations are in the process of being included in the database.

Nobuyoshi Shimizu of Keio University, Japan provided an update on the Keio Mutation Database and Mutation View software. KMDB now contains data from 39 different genes for 35 different diseases that are involved in eye, heart, ear, brain, cancer and autoimmunity. The mutation view software is now available to all interested research groups under the conditions the users actively participate in the establishment of a world wide distributed database system for disease gene mutations. Dr Shimizu may be contacted at (shimizu@dmb.med.keio.ac.jp).

Ample time was allowed in the afternoon for a discussion of KEY ISSUES that have been identified by MDI members.

Software
There are two major problems (a) some software is not freely available and (b) licensing and authorship is required for use. It was suggested that the MDI should recommend choices of software however some argue that this is premature as we cannot be dictatorial as this may prevent creativity of individual databases. We can however state which software is available to those that wish to use it. At present the MDI community has several databases maintained in FileMarker Pro. This is a commercially available, easy-to-use database management system and may be good to start a small database with and convert it later when the database is larger. Four databases systems designed specifically for mutations are now available to MDI members from their respective authors. Mutation View, MuStaR™ , MUTbase and UMD. According to Chuanbo Xu, Genzyme is now developing commercial ORACLE and JAVA based databases and analysis packages.

Copyright and Intellectual Property

It was agreed that the MDI must closely watch what happens in the U.S.A. congress in relation to data protection laws.

Polymorphisms

A question was posed whether polymorphisms are relevant or not to the MDI. It was agreed the term "polymorphism" is still a problem in certain contexts and that ALL variations in a gene should be recorded and not labeled pathogenic or not but data included for a conclusion to be drawn. This is stated on page 345 of the Scriver et al guidelines document (Hum. Mut. 1999: 13:344-350). It can't be said if polymorphisms contribute to disease or not, thus the data should be included.

Heikki Lehväslaiho reminded us that while dbSNP has become a repository for SNPs, HGBASE in Europe has been collecting gene-associated polymorphisms. In addition to direct submissions, HGBASE has been curating data from published literature. HGBASE is now going fully into the public domain. After redesign it will be freely downloadable from the EBI website.

National and Ethnic databases

The question was posed if the term "ethnic" in national and ethnic databases is ethical. It was agreed these databases are critical. Ethnic as a term was thought appropriate. It was strongly argued that the haplotype is like the phenotype and that "population bases" are needed. The term "origin" however is difficult. Is this to mean geographic or ethnic, or should broad groups such as Asian, African, African-American, North European etc. be used? The nation may also be used. However Charles Scriver provided the example of ethnic Turks living in Germany and ethnic Germans living in Germany, information that would have a medical significance. There is a place on the MDI allele variant entry form for this type of data. It was agreed that the population variation data comes from should be defined.

Patient Aspects

It was agreed that public education is needed. The word "mutation" is misunderstood by the public and the word "abnormal" should never be used. Unless we are careful we will end up like genetically modified foods and be regarded with suspicion by the public! This is an information and education process and we must start soon.

Spectral databases

These are a question of recording somatic mutations. There needs to be a reason for it. The p53 databases are the best examples of a spectral database as they are a collection of main somatic mutations.

Sustainability

Two ideas were put forward to make the MDI self-sustaining:

Approach NIH, MRC etc to get funding as the HUGO-MDI for databases and not as individuals.
Propose to raise funds using Stephen Maurer's model.

Hence a resolution was proposed to raise activity to a new level i.e work as a group to induce funding out of our activities under the name of HUGO-MDI. Serious and heated discussion followed. Bruce Gottlieb and Michael Krawczak stated that no-one talks about mutation databases in the bioinformatics arena and we must make it known that mutation databases are crucial as it is the bioinformatics peers who turn our grants down. Arleen Auerbach stated the "idea of the MDI" of linking all databases must not be dropped. Olga Blumenfeld suggested we contact NIH and tell them the importance of these databases. Mary Fujiwara stressed that the USER is not ignored and that one centralised database is needed with a core and all other databases exist and do what they want. Glenn Miller made a point about private companies:To undertake genetic testing they do not require a database and for therapeutic development where they have propietary holding on a gene it is nice to have a database but its not really necessary when the information may be freely obtained. Hence the problem is how we can offer a service in order to capitalize on it. Selling the information is wrong and probably won't happen.

Lisa Brooks from the NIH stated that funding in Europe is a problem however the U.S. is not so bad. The National Human Genome Institute has too small a budget to fund more databases but there is more interest now and she is happy to talk with those interested. Often when asking for money, biologists have great ideas but are too naïve regarding how to sustain their work. A central black box collection point will probably be funded and not be shut down as this would be seen as a good part of the business plan. NIH would possibly fund such a central collection point. She suggests we try to raise funds across institutions.

Allele Variant Entry Form

It was generally agreed the form is now finalised. Questions were raised as to whether it should be published. Publishing may copyright it to the publisher, the legality of this was unclear. Heikki Lehväslaiho said the form is an idea so is not copyrightable or copyleftable. Mary Fujiwara stated that publishing is unnecessary but advertising of the form is. Dissemination of information is the important thing here. Resolutions were passed that the form should be disseminated but not copyrighted. The form will be disseminated by Letters to "Human Mutation", "Human Genetics", "Nature Genetics", "Science" and the URL of the form will be given without actually publishing it. Robert Pomponio stated the caption "this is what you should use for a database" should be added, Mathew Darlison stated we should publish "the why" and not "the what".

The following RESOLUTIONS were made in conclusion of the meeting:

There is a need for ethical guidelines- an ethics committee was formed comprised of Bartha Knoppers, Catherine Boileau, Mary Fujiwara, Sylvia Spengler and Janet Warrington.
The population that the variation data comes from must be well defined.
Every example of each mutation is to be recorded.
Continue with "HUGO-Affiliated" labeling of databases to those databases following the MDI recommendations.
Raise activity to a new level, i.e work as a group to induce funding out of our activities under the name of HUGO-MDI.
Accept the current draft of the allele variant entry form
Disseminate the allele variant entry form
Do not Copyright the allele variant entry form
Wiley should be asked to do what "Nucleic Acids Research" used to do as an incentive to publish mutations, i.e. publish lists of mutations each year in a special issue.
The next meeting should be held in Vancouver, Canada April 9, 2000 in association with HGM 2000. It is a key time in the planning of the central depository as the final plan will be approved.

Posted 19th November 1999

7th International HUGO Mutation Database MeetingOctober 19, San Francisco 1999