Presentation is loading. Please wait.

Presentation is loading. Please wait.

EMC Galaxy Course November 24-25, 2014

Similar presentations


Presentation on theme: "EMC Galaxy Course November 24-25, 2014"— Presentation transcript:

1 Introduction to Data Processing and Variant Detection for NGS DNA Sequencing
EMC Galaxy Course November 24-25, 2014 Youri Hoogstrate, David van Zessen, Saskia Hiltemann Guido Jenster, Andrew Stubbs

2 How does next-gen sequencing work?

3 Instruments generate short reads that need to be mapped to the reference

4 High-level overview of NGS data processing

5 Aligned reads In Galaxy, you can view your data in the built-in genome browser, Trackster

6 Challenge: distinguishing variants from noise
Possible reasons for a mismatch: - True SNP - Error generated in library prep - Base calling error - Misalignment (mapping error) - Error in reference genome

7 Genotyping - What are the set of alleles at this locus? What are the frequencies? - Genotypers begin with a model of prior knowledge about the likelihood (and types) or errors, and the likelihood of observing real variants. - Error models depend on sequencing technology

8 What we know about NGS technology
Relatively high per-base error rate Reads are higher quality in the middle than at the ends Some technologies are poor with homopolymers, GC rich Indels confuse alignment Sequence coverage is not uniform Alignments are probabilistic Quality Control Local realignment Remove duplicate reads Filter low-quality reads Recalibrate base qualities Read trimming

9 Quality Score Fastq: raw reads with per-base quality scores
Quality = Phred score + 33 (so that all characters are printable) Q= -10 log P (P= base-calling error probability) Q=10 error rate 10% Q=20 error rate 1% Q=30 error rate 0.1% etc..

10 Quality Control Tool: FastQC

11 Sequencing Depth Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data. BMC Genomics 2012, 13(Suppl 2):S6

12 Tools Popular Tools: - SAM Tools Mpileup (practical) - GATK Unified Genotype Caller (practical extra part) - FreeBayes (practical extra part) - MAQ - Varscan2 All available in Galaxy Tool Shed Always a trade-off between sensitivity and specificity; false positives and false negatives

13 Practical Raw data (fastq files) QC with FastQC Map with BWA
Visualize with Trackster Call Variants with Mpileup Annotate variants with ANNOVAR Time permitting: Call Variants with FreeBayes and GATK Unified Genotyper and compare the three callers

14 Practical Session Learn by doing it yourself! Servers: galaxy-training1.trait-ctmm.cloudlet.sara.nl galaxy-training2.trait-ctmm.cloudlet.sara.nl galaxy-training3.trait-ctmm.cloudlet.sara.nl Log in to your account All handouts and slides can be found under Shared Data → Data Libraries Manual: [Course Manual] EMC Galaxy Training 2: Introduction to Galaxy.pdf


Download ppt "EMC Galaxy Course November 24-25, 2014"

Similar presentations


Ads by Google