Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151
Sequence Formats All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence Formats are designed to hold sequence data and other information about sequence 8/19/20152
Why so many formats? 8/19/20153 Supply required information for each step of analysis Efficient Data management- moving data across file system takes time Each Data formats vary in the information they contain Five types of sequence file formats Raw Sequence files Co-ordinate files Parameter files Annotation files Metadata files
Sequencers & Sequence Analysis Packages 8/19/20154
Read output formats 454 Solexa/Illumina SOLiD 8/19/20155
454 output formats.sff.fna.qual 8/19/20156
Illumina output formats.seq.txt.prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF 8/19/20157
SOLiD output format(s) CSFASTA 8/19/20158
If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI 8/19/20159
Common (“standard”) format for read alignments: Alignment/Assembly Format SAM BAM (= binary SAM) 8/19/201510
Formats for Genome/Gene annotation BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (XML) (any annotation; under development) 8/19/201511
Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ Deposit genome/metagenome metadata: MIGS/MIMS standard by GSC Genomic Standards Consortium International Nucleotide Sequence Database Collaboration 8/19/201512
MIGS: Minimum Information about a Genome Sequence MIMS: Minimum Information about a Metagenome Sequence/Sample 8/19/201513
Use raw sequencing data- format when possible For base-call data, use “standard” FASTQ (Sanger, Phred) For read alignments, use SAM/BAM format For annotation results (e.g. GFF or BED format) Points to remember on Data Formats 8/19/201514
QC analysis 8/19/201515
Need for QC & Preprocessing QC analysis of sequence data is extremely important for meaningful downstream analysis To analyze problems in quality scores/ statistics of sequencing data To check whether further analysis with sequence is possible To remove redundancy (filtering) To remove low quality reads from analysis Highly efficient and fast processing tools are required to handle large volume of datasets 8/19/201516
FastQC and FastX Toolkit Use FastQC in preliminary analysis Use FastX-toolkit to optimize different datasets and visualize the results with FastQC 8/19/201517
FastQC output Basic statistics Quality- Per base position Per Sequence Quality Distribution Nucleotide content per position Per sequence GC distribution Per base GC distribution Per base N content Length Distribution Overrepresented/ duplicated sequences K-mer content 8/19/201518
FastQC (Box-Whisker plot) Y axis- Quality Score X axis- Base position 8/19/201519
Basic Statistics Contains information about File_type ASCII encoding quality value Total sequences, filtered sequence Sequence length Percentage GC content 8/19/201520
2. Quality- Per base position 8/19/201521
2. Quality- Per base position 8/19/201522
3.Per Sequence Quality Distribution 8/19/201523
3. Per Sequence Quality Distribution 8/19/201524
4.Nucleotide content per position 8/19/201525
4. Nucleotide content per position 8/19/201526
5.Per sequence GC distribution 8/19/201527
5.Per sequence GC distribution 8/19/201528
6. Per base GC distribution 8/19/201529
6. Per base GC distribution 8/19/201530
7. Per base N content 8/19/201531
7. Length Distribution 8/19/201532
8. Kmer content 8/19/201533
9. Overrepresented/ duplicate sequences Too many duplicate regions in the sequence will be due to sequencing problems 8/19/201534
FASTX Toolkit fastx_quality_stats.txt fastq_quality_boxplot_graph.png fastx_nucleotide_distribution.png QC report.txt 8/19/201535
QC Report Sequence Statistics Total No. Of Sequences Avg. Sequence Length54 Max Sequence Length54 Min Sequence Length54 Total Sequence Length Total N bases % N bases No of Sequences with Ns % Sequences with Ns Quality Statistics Total HQ bases %HQ bases88.78 Total HQ reads %HQ reads /19/201536
quality_boxplot_graph & nucleotide_distribution 8/19/201537
Thank you 8/19/201538