Data formats Gabor T. Marth Boston College

Data formats Gabor T. Marth Boston College
for folks developing data standards for 1000G analysis 1000 Genomes Meeting Philadelphia, November 10-11, 2008

Why have standard formats?
slide courtesy of Richard Durbin

Standard formats aggregate data from different platforms on a common footing ABI/capillary 454 FLX 454 GS20 Illumina

Standard formats provide algorithms with a well-defined input and output plug alternate tools into pipeline compare performance integrate results across different algorithms capture “checkpoints” in the analysis pipeline

Data types with standard formats

Read data formats – SRF and FASTQ
What is the data: trace information, base calls, base qualities Produced by base callers, used by read mappers/aligners SRF FASTQ Standard formats

Read data formats – SRF and FASTQ
SRF (Sequence Read Format): designed to store machine-specific trace information, alternative base calls, extended base quality value schemes complex format used mostly for archival FASTQ: only stores base calls + 1 Q-value per base simple format the same for all platforms the de facto format for downstream analysis is there information in SRF (but not in FASTQ that is required by downstream analysis?

Alignment formats What is the data?
generated by read mapper / aligners / assemblers used by e.g. allele callers, SV callers

Alignment formats A standard format (SAM, TAM, BAM) is being defined (Heng Li [Sanger], Bob Handsaker [Broad], etc.)… a standard is within reach Compatible with all technologies (AB?), allows aggregation of data from different individuals, different platforms “Lean and mean”  cannot be all-encompassing Remaining issues: gapped / padded alignments, reads pairs, compression, indexing Extremely high priority for 1000G data analysis

SNP / short-INDEL allele calling
Data: SNP probability, individual genotype probabilities Produced by SNP caller, used by downstream analysis

Other data types that need standard format?

Data formats Gabor T. Marth Boston College

Similar presentations

Presentation on theme: "Data formats Gabor T. Marth Boston College"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data formats Gabor T. Marth Boston College

Similar presentations

Presentation on theme: "Data formats Gabor T. Marth Boston College"— Presentation transcript:

Similar presentations

About project

Feedback