HIVe Home
Posted In: Managing deep sequencing data

As you may have heard, GenBank is closing its Short Read Archive.

At the recent Keystone meeting, Brian Foley (from LANL) and I organized an informal meeting to discuss the impact this will have on all who use next-gen sequencing and especially on people doing the deep seqeuncing experiments. The summary of the meeting can be found here:

Any comments on this issue or suggestions on how to proceed would be much appreciated.

Posted by: drone, 14 Apr 2011 1:02 pm

Hi Yegor,

First of all, very nice job with the HIVe. I hope that this forum becomes a lively clearinghouse for discussions, hopefully involving a larger community than would attend any single meeting.

In that spirit, I'll start off with a provocative assertion that may catalyze some discussion. Some of this I shared during the meeting in Keystone, but perhaps not as loudly as I could have:

Given the enormous diversity of NGS data formats (454, Illumina, SOLID, IonTorrent, PacBio, plus whatever is coming next), analysis tools (too many to count), analysis parameters for each of these tools, pipelines that string these tools together in different orders, plus the use of customized scripts that are purpose-written to glue these together, it is hard to envision a scenario where polished, analyzed data can be warehoused in a way that will support robust reanalysis without an enormous investment in data polishing and preparation.

Instead, I propose that a system to share NGS data from HIV studies warehouses two items per instrument run:  (1) FASTQ files (normalized to a single quality-score encoding system) and (2) a structured file that describes the sequences in the FASTQ file. I have very limited experience working with XML, but it may be worth developing an XML schema for describing HIV NGS data (or cribbing / modifying one from metagenomics, if a good one already exists). Such a schema might look something like this (fields chosen to illustrate concepts, not as an actual schema suggestion)



-date run

-run metrics

-passed reads

-failed reads

-failed read descriptions


-multiplex identifier tags

-multiplex barcode sequence

-sample identifier

-patient id

-sample date

-sample type

-target sequence

-amplicon / library identifer

-sample preperation details

-viral RNA extraction details

-PCR details



-primary analysis






Such a schema would, ideally, require investigators to submit a very small amount of REQUIRED information about each dataset, but would support the submission of a large amount of optional information. Required information might be as limited as the instrument type (so that error profiles in the data can be identified) and sample identifiers. Optional information on analysis pipelines, experiment setup, etc. would be added at the discretion of the investigator. Some people may want to add a lot of data, esp. if this facilitates information sharing with their collaborators. 

As someone who generates a good amount of 454 data in my lab (and is now getting Illumina data from collaborators), my willingness to contribute data to a community-wide system to benefit the rest of the community is inversely proportional to the amount of time it takes to make the information available. I believe that any sharing system that requires more than 15-30 minutes of prep time per submission will have a hard time gaining traction.

Hopefully these points will stimulate at least some discussion. 

dave o'connor, uw-madison

Posted by: Yegor Voronin, 14 Apr 2011 5:03 pm


I agree with you and it was also the general agreement at the meeting that unprocessed data should be shared, while providing the information on how the primary analysis was done and what conclusions were reached. We are about to start developing the ontology you suggest together with the standardized metadata terms and formats to use in it, so that people don't label same things in different ways.