Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

Similar presentations


Presentation on theme: "Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151."— Presentation transcript:

1 Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151

2 Sequence Formats  All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence  Formats are designed to hold sequence data and other information about sequence 8/19/20152

3 Why so many formats? 8/19/20153  Supply required information for each step of analysis  Efficient Data management- moving data across file system takes time  Each Data formats vary in the information they contain Five types of sequence file formats Raw Sequence files Co-ordinate files Parameter files Annotation files Metadata files

4 Sequencers & Sequence Analysis Packages 8/19/20154

5 Read output formats  454  Solexa/Illumina  SOLiD 8/19/20155

6 454 output formats.sff.fna.qual 8/19/20156

7 Illumina output formats.seq.txt.prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF 8/19/20157

8 SOLiD output format(s) CSFASTA 8/19/20158

9 If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI 8/19/20159

10 Common (“standard”) format for read alignments: Alignment/Assembly Format SAM BAM (= binary SAM) 8/19/201510

11 Formats for Genome/Gene annotation BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (XML) (any annotation; under development) 8/19/201511

12 Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ Deposit genome/metagenome metadata: MIGS/MIMS standard by GSC Genomic Standards Consortium International Nucleotide Sequence Database Collaboration 8/19/201512

13 MIGS: Minimum Information about a Genome Sequence MIMS: Minimum Information about a Metagenome Sequence/Sample 8/19/201513

14 Use raw sequencing data- format when possible  For base-call data, use “standard” FASTQ (Sanger, Phred)  For read alignments, use SAM/BAM format  For annotation results (e.g. GFF or BED format) Points to remember on Data Formats 8/19/201514

15 QC analysis 8/19/201515

16 Need for QC & Preprocessing QC analysis of sequence data is extremely important for meaningful downstream analysis  To analyze problems in quality scores/ statistics of sequencing data  To check whether further analysis with sequence is possible  To remove redundancy (filtering)  To remove low quality reads from analysis Highly efficient and fast processing tools are required to handle large volume of datasets 8/19/201516

17 FastQC and FastX Toolkit  Use FastQC in preliminary analysis  Use FastX-toolkit to optimize different datasets and visualize the results with FastQC 8/19/201517

18 FastQC output  Basic statistics  Quality- Per base position  Per Sequence Quality Distribution  Nucleotide content per position  Per sequence GC distribution  Per base GC distribution  Per base N content  Length Distribution  Overrepresented/ duplicated sequences  K-mer content 8/19/201518

19 FastQC (Box-Whisker plot) Y axis- Quality Score X axis- Base position 8/19/201519

20 Basic Statistics Contains information about  File_type  ASCII encoding quality value  Total sequences, filtered sequence  Sequence length  Percentage GC content 8/19/201520

21 2. Quality- Per base position 8/19/201521

22 2. Quality- Per base position 8/19/201522

23 3.Per Sequence Quality Distribution 8/19/201523

24 3. Per Sequence Quality Distribution 8/19/201524

25 4.Nucleotide content per position 8/19/201525

26 4. Nucleotide content per position 8/19/201526

27 5.Per sequence GC distribution 8/19/201527

28 5.Per sequence GC distribution 8/19/201528

29 6. Per base GC distribution 8/19/201529

30 6. Per base GC distribution 8/19/201530

31 7. Per base N content 8/19/201531

32 7. Length Distribution 8/19/201532

33 8. Kmer content 8/19/201533

34 9. Overrepresented/ duplicate sequences Too many duplicate regions in the sequence will be due to sequencing problems 8/19/201534

35 FASTX Toolkit  fastx_quality_stats.txt  fastq_quality_boxplot_graph.png  fastx_nucleotide_distribution.png  QC report.txt 8/19/201535

36 QC Report  Sequence Statistics Total No. Of Sequences6970943 Avg. Sequence Length54 Max Sequence Length54 Min Sequence Length54 Total Sequence Length376430922 Total N bases14254521 % N bases3.78676 No of Sequences with Ns278635 % Sequences with Ns3.99709  Quality Statistics Total HQ bases334195496 %HQ bases88.78 Total HQ reads6350256 %HQ reads91.0961 8/19/201536

37 quality_boxplot_graph & nucleotide_distribution 8/19/201537

38 Thank you 8/19/201538


Download ppt "Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151."

Similar presentations


Ads by Google