Presentation on theme: "Whole Exome Sequencing for Variant Discovery and Prioritisation"— Presentation transcript:
1 Whole Exome Sequencing for Variant Discovery and Prioritisation
2 First, a recap. What have we learned? NGS platforms – short and long readsWhat the data looks likeHow to QC dataGeneral procedures in processing dataHow to find biological signal in data - RNA-Seq lectures + practical (in progress)There’s a LOT more, but it’s not necessarily more complex or very different!
5 Variant Discovery Application: Disease An equivalent of the genome would amount almost 2000 books, containing 1.5 million letters each (average books with 200 pages)!This information is contained in any single cell of the body.
6 Monogenic Diseases Single mutation How do we find it in all those ‘books’?A bioinformatics challengeNGS sequencers can only read small portionsSo, the library is fragments of pages of the books!
9 Opportunities and Challenges Enabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collectionsExomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancerChallenges:Still can’t interpret many Mendelian disordersRare variants need large samples sizesExome might miss region (e.g. novel non-coding genes)Shendure, Genome Biol 2011
10 Why exome sequencing?WGS still too costly & added value of intergenic mutations is lowWES: targeted sequencing of coding regions (~1% of human genome)Mendelian disorders disrupt protein-coding sequences (mostly)Large fraction of rare non-synonymous variants in human genome are predicted to be deleteriousSplice sites also enriched for highly functional variationThe exome represents a highly enriched subset of the genome in which to search for variants with large effect sizes
11 A representation of the relationship between the size of the mutational target and the frequency of disease for disorders caused by de novo mutationsGilissen, Genom Biol 2011
13 Maximizing chances of finding disease-causing rare variants using exome sequencing Bamshad, Nat Rev Genet 2011
14 Example: Comparative Sequencing Somatic mutation detection between normal / cancer pairsMore mutation yield and better causal gene identification than Mendelian disordersMeyerson et al, Nat Rev Genet 2010
15 BUT Exome Analysis for single patient can be informative Perrault syndrome (HSD17B4)Pierce, Am J Hum Genet 2010
17 Read MappingMapping hundreds of millions of reads the reference genome is CPU and RAM intensive, and ‘slow’Read quality decreases with length (small single nucleotide mismatches or indels – real or artifact?)Very few mappers appropriately deal with indelsMapping output: SAM (BAM) or BED
18 Mapped Data: SAM specification Generic sequence alignment formatDescribes alignment of reads to a referenceFlexible - stores all the alignment informationSimple enough to be easily generated or converted from other existing alignment formatsKeeps track of chromosome position, alignment quality and alignment features (extended cigar)Includes mate pair / paired end informationOriginal FASTQ data can be reproduced from SAM (and BAM)
20 BAM format Binary version of SAM - more compact Makes downstream analysis independent from the mapping programAllows most of operations on alignment to work on a stream without loading the whole alignment into memoryAllows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus
21 VCF format Emerging standard for storing variant data Originally designed for SNPs and short INDELs, it also works for structural variationsConsists of header and data sectionsThe data section is TAB delimited with each line consisting of at least 8 mandatory fields
27 Structural VariationBreakDancer Chen et al, Nat Meth 2009 Only looks at anomalous read pairs
28 Copy Number Variation Detection Change in read coverage
29 Example WES-based variant discovery workflow Map the reads to a reference genomeindex the reference genomeMap (BWA, BOWTIE, NOVOAOLIGN, ETC)Sort BAM fileRemove PCR duplicatesRealign around indels (‘optional’)Call variantsRecalibrate quality scores (‘optional’)Filter variantsBasic variant annotationBiological interpretation only starts here