HIVe Home
Posted: 10 Apr 2011
Data Sharing: A Priority of the Enterprise 2010 Scientific Strategic Plan

“Data Sharing: A Priority of the Enterprise 2010 Scientific Strategic Plan”
Keystone Symposia, Whistler, Canada
2:30-4:30, Fitzsimmons Room
Wednesday, 23 March, 2011

Doug Richman, University of California, San Diego
Lynn Morris, National Institute for Communicable Diseases

Alan Bernstein, Global HIV Vaccine Enterprise
Rick Bushman, University of Pennsylvania School of Medicine
Tom Denny, Duke University
Brandon Keele, SAIC-Frederick, Inc.
Rick Koup, National Institutes of Health
Amalio Telenti, Centre Hospitalier Universitaire Vaudois

The 2010 Scientific Strategic Plan of the Enterprise identified data management and data sharing as key issues for the field. Therefore, the Enterprise Secretariat organized a workshop at the recent Keystone Symposia on “Protection from HIV: Targeted Prevention Strategies”. The workshop was designed to provide general guidance for the Data Sharing Initiative of the Enterprise, and specifically aimed to identify field-wide issues that need to be urgently addressed and discuss possible ways in which the Enterprise can move the initiative forward. With that in mind, the workshop was structured to facilitate discussion among the audience through informal presentations by the speakers. The discussion was moderated by Doug Richman and Lynn Morris, co-chairs of the workshop.

The meeting was opened by Alan Bernstein, executive director of the Enterprise, who highlighted the importance of a strategic approach to data management and reminded the participants that the 2010 Plan called on the research community to “seek consensus on the principle of rapid access to data and develop the infrastructure to annotate, deposit and analyze large amounts of data.“

Doug Richman started the discussion by pointing out that several initiatives in the HIV field, including LANL HIV database and a number of collaborations, can serve as examples of success in data sharing. Nevertheless, in his opinion there remain many opportunities for improvement, as large amounts of data generated in the field are not currently being shared and analyzed to the fullest degree. Making data openly available for further analysis can benefit the field on multiple levels:

  • Access to data stimulates development of novel approaches to data analysis;
  • Combining multiple datasets allows increase in statistical power of analyses;
  • Integration of multiple types of data makes possible asking new scientific questions.

In the discussion following this presentation, participants highlighted the fact that while providing access to data is required by scientific journals or funding agencies before or after publication, the mechanisms for data sharing are frequently not provided, placing the burden of developing the necessary tools and infrastructure on the researcher. Thus, the current system is not conducive to data sharing,  is very time-consuming and leads to unnecessary duplication of effort.

Rick Bushman shared his experience with the Human Metagenome Project (HMP), in which his lab had to deposit data into the Short Read Archive (SRA) in GenBank. SRA was built around a very rigid data model which did not align with user needs. Furthermore, deposition of data was extremely complicated and time-consuming and data retrieval did not work properly. As a result, GenBank recently closed SRA for submission of any new datasets. Lessons learned from the closure of SRA are:

  • Keep it simple. Database should be designed to make annotation and deposition easy for users.
  • Share raw (not processed) data, targeting the audience that is knowledgeable enough to make use of raw data (MG-RAST database is a good example).
  • Invest in careful design of metadata tables and publish the standards.

The discussion then moved from database design to ethical and legal aspects of data sharing. Amalio Telenti pointed out that data sharing crucially depends on the legal framework around patient-related data. Consent forms that specify only a narrow use for patient-related data limit researchers’ ability to do exploratory analyses. Lack of legal agreements often prevents or complicates transfer of data and samples between institutions, especially across international borders. Patient privacy also needs to be carefully considered, taking into account the changing landscape of technologies and assays. For example, 75 SNPs on average are sufficient to identify a person and with high-throughput assays this information can be easily obtained. Viral sequence data, which is required by many journals to be deposited in open access databases before publication, can also be used to identify a specific individual. The field of GWAS can serve as a positive example in this respect. GWAS data is not openly accessible and access is handled on individual bases, ensuring the validity of requests for information. When a meta-analysis of multiple studies is performed, representatives from each group are present on the analysis team.

Brandon Keele highlighted the importance of close collaboration between people that generate data and data-management experts. As a “data producer”, he appreciated the help from the LANL HIV DB in annotating, analyzing and making available the data set used to identify sequences of transmitted HIV variants. The collaboration was mutually beneficial, because it served as a test case for LANL. LANL HIV DB now provides help to other researchers with similar data sets. Still, the field is quickly changing and the next generation sequencing will bring its own set of challenges. Making raw data available will continue to be important, but it will be equally important to provide information on how this data has been analyzed previously.

In the discussion that followed, Lynn Morris asked Steve Self (SCHARP, Seattle) to comment on SCHARP’s future plans for facilitating data sharing.  He agreed with previous speakers that it is critically important to develop flexible data models that would allow researchers to adapt the database to their specific needs. SCHARP is working on this right now. In addition, SCHARP aims to provide some basic tools for data analysis on its website.

The discussion then turned to flow cytometry data, which are expected to become vastly more complex. For example, novel technologies, making use of antibodies labeled with heavy metals, will allow detection of hundreds of cell surface markers in a single experiment. Thus, the sizes of datasets and their complexity will be higher than for sequencing data due to the larger number of experimental variables and more complex data processing and interpretation.

Rick Koup, speaking about the situation at CAVD, noted that currently there are no specific rules or mechanisms for data sharing. Rather, the results are shared and the supporting data are reported. With respect to flow cytometry data, he agreed with other participants about the significant data handling and analysis challenges that will arise as the result of next generation technology in this area. Raw data will likely become uninterpretable. It will have to be carefully curated, annotated, and, at least partially, processed. Harmonized annotation will become critically important, because otherwise data analysis will be impossible. Data currently generated at the VRC are very complex and require each lab to have a full-time person knowledgeable in databases to assist with data management. Still, even within the VRC, researchers sometimes annotate data differently due to errors or lack of communication. Therefore, harmonization across the field is becoming a timely and critical issue.

Tom Denny pointed out that during the formation of CHAVI, data sharing was expected to be a priority from the beginning. However, the logistical complexity of sharing data and of linking diverse datasets together prevented quick establishment of data sharing procedures and creation of the necessary infrastructure.  CHAVI is only now working on a pilot project to link several datasets into a single database. This initiative is proving to be challenging because annotation was not harmonized from the beginning. Going forward, it is hoped that the procedures developed and lessons learned in the course of this pilot project will make data sharing much more efficient.

This concluded the presentation-driven part of the workshop and the floor was open for general discussion. When asked for advice on how the Data Sharing Initiative of the Enterprise should target the identified issues, participants suggested several strategic approaches:

  • Avoid the mistake of very long-term strategies. The field is very fluid and efforts to devise a single all-encompassing system for data sharing will likely be futile, as new tools, technologies and standards appear and make past planning obsolete.
  • Use a modular approach - identify a particular issue, devise a solution, find funding to implement.
  • Provide a database of existing available resources: databases, reference datasets, tools, and guidelines.
  • Consider diverse strategies for supporting data management experts and placing them in the labs that produce the data.
  • Learn from past successes. For example, in the area of microarrays the factors contributing to success were: adoption of common standards, maturation of technology, investment in databases, journal policy of data sharing upon publication, involvement of industry (commercial data analysis software includes data annotation and deposition tools).
  • Learn from successes in other fields. For example, cancer research is very far advanced with respect to data sharing and has addressed many of the issues facing the HIV research . Our field should not reinvent the wheel but rather should take advantage of lessons learned, as well as procedures, software, strategies that have been already worked through in the cancer field.