Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012.

Similar presentations


Presentation on theme: "Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012."— Presentation transcript:

1 Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012

2 NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview

3 NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview Slides and tutorials are available at: https://www.wiki.ed.ac.uk/display/GenePoolExternal/NGS+workshop at+EBI

4 NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview

5 DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

6 Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

7 Resequencing genomes Library prep Library prep Library prep Library prep Library prep Library prep DNA Extraction DNA Extraction

8 Sequencing data GATGGGAAGA GCGGTTCAGC AGGAATGCCG AGACCGATAT CGTATGCCGT Sequence data Precise Fairly unbiased Easy to QC Coverage depth data Can be biased Hard to know what’s true

9 Sequencer specific errors  Homopolymer run create false indels  Specific sequence patterns can create phasing issues

10 Sequencer specific errors  Specific sequence patterns can create phasing issues

11 Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%

12 Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%

13 Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%

14 Quality control Questions you should ask (yourself or your sequencing provider): Sequencing QC: How much sequencing? What’s the sequencing quality? Library QC: What’s the base profile across the reads? Is there an unexpected GC bias? Are there any library preparation contaminants? Post mapping QC: What is the fragment length distribution? (for paired end) Is there an unexpected Duplicate rate?

15 Example with FastQC

16 Example with FastQC

17 Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

18 Mapping Reads to a reference genome Problems: How to find the best match of short sequence onto a large genome (high sensitivity) How to not find a match when for 100,000,000,000 reads in reasonable amount of time. Solution: Hashing based algorithms: BLAST, Eland, MAQ, Shrimps, GSNAP, Stampy More sensitive when SNPs/Indels Suffix trie + Burrows Wheeler Transform algorithms: Bowtie, SOAP BWA Faster

19 Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp bowtie BWA

20 Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw

21 Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw Mapper FastqSam/Bam

22 SAM/BAM format SAM: Sequence Alignment/Map format v1.4 The SAM Format Specification Working Group (Sept 2011)  Standardized format for alignment Bam: binary equivalent of SAM Bam can be indexed for fast record retrieval Manipulate Sam/Bam file using samtools and others 2 parts: Header: contains metadata about the sample Alignment:

23 SAM/BAM format COLUMNS: 1QNAMEStringQuery template NAME 2FLAGIntbitwise FLAG 3RNAMEStringReference sequence NAME 4POSInt1-based leftmost mapping POSition 5MAPQIntMAPping Quality 6CIGAR StringCIGAR string 7RNEXTStringRef. name of the mate/next fragment 8PNEXT IntPosition of the mate/next fragment 9TLEN Intobserved Template LENgth 10SEQ Stringfragment SEQuence 11QUAL StringASCII of Phred-scaled base QUALity+33≈ R ref37309M=7-39CAGCGCAT TAG

24 Bitwise flag BitintegerDescription 0x11template having multiple segments in sequencing 0x22each segment properly aligned according to the aligner 0x44segment unmapped 0x88next segment in the template unmapped 0x1016SEQ being reverse complemented 0x2032SEQ of the next segment in the template being reversed 0x4064the first segment in the template 0x80128the last segment in the template 0x100256secondary alignment 0x200512not passing quality controls 0x PCR or optical duplicate 83 = in binary format

25 Bitwise flag 83 = in binary format

26 CIGAR alignment Ref: AGGTCCATGGACCTG || ||||X||||||| Query: AG-TCCACGGACCTG 2M1D12M or 2=1D4=1X7= Ref: CTTATGTGATC ||||||||||| Query: CTTATGTGATCCCTG 10M4S Malignment match (can be a sequence match or mismatch) Iinsertion to the reference Ddeletion from the reference Nskipped region from the reference Ssoft clipping (clipped sequences present in SEQ) Hhard clipping (clipped sequences NOT present in SEQ) Ppadding (silent deletion from padded reference) =sequence match Xsequence mismatch

27 Mapping enhancement Each read is mapped independently:  Can borrow knowledge from neighbor to improve mapping Picard Marking Duplicates: A duplicated read pair is when both two or more read pairs have the same coordinates. Samtools BAQ: Hidden markov model that downweight mismatching based if they are close to indel GATK Indel realignment: take every reads around potential indel and perform a more sensitive alignment GATK Base recalibration: look at several contextual information, such as position in the read or dinucleotide composition to identify covariate of sequencing errors

28 Indel realignment AACAATATCTATGGA/TTTCG/TTTTG

29 Indel realignment

30

31 Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

32 The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s)

33 The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis

34 The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis

35 SNPs and indels calling Samtools mpileup + bcftools GATK UnifiedGenotyper Algorithm Bayesian based multiple samples calling yes Input: bam file(s) output vcf file Runtime Rather fast Slow but multithreaded Multi-allelic Up to 2alleles3 by default

36 VCF format Variant format designed for 1000 genome project - SNPs - Insertions - Deletions - Duplications - Inversions - Copy number variation

37 VCF format Header: define the optional fields ##INFO= ##FORMAT= Variants: 8 mandatory columns describing the variant 1 column defining the genotype format 1 column per sample describing the genotype for that SNP for that sample

38 DATA ##fileformat=VCFv4.1 ##samtoolsVersion= (r982:295) ##INFO= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germline tumor chr T C DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:38,3,0:1:0:3 chr G T DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:33,3,0:1:0:4 chr T C 44. DP=2;AF1=1;AC1=4;DP4=0,0,1,1;MQ=60;FQ=-30.8 GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8 1/1:37,3,0:1:0:8 chr G A DP=2;AF1=0.5011;AC1=2;DP4=1,0,0,1;MQ=60;FQ=-5.67;PV4=1,1,1,1 GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28 0/0:0,0,0:0:0:3 chr A T DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:40,3,0:1:0:4 HEADER

39 #CHROMPOSIDREFALTQUALFILTERINFOFORMATgermline chr TC8.65.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr GT4.77.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr TC44.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ1/1:40,3,0:1:0:8 chr GA5.47.DP=2;AF1=0.5011; AC1=2; …GT:PL:DP:SP:GQ0/1:34,0,23:2:0:28 chr AT10.4.DP=1;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 Chromosome name VCF format SNPs Position SNP Identifier Reference base Alternate base(s) SNPs quality Filtering reasons SNPs information Genotype format Genotype information

40 Variant Filtering Depth of Coverage: confident het call= 10X-20X SNPs quality depends on the caller: Genotype quality: 20 Strand bias Biological interpretation


Download ppt "Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012."

Similar presentations


Ads by Google