Presentation is loading. Please wait.

Presentation is loading. Please wait.

EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.

Similar presentations


Presentation on theme: "EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics."— Presentation transcript:

1 EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

2 Data Levels

3 Data Types Submitted To EDACC ChIP-Seq Shotgun Bisulfite Sequencing –Methyl-C Reduced Representation Bisulfite Sequencing –RRBS MRE-Seq MeDIP-Seq Chromatin Accessibility small RNA-Seq mRNA-Seq

4 Read Mapping Common processing step to all pipelines High throughput –Sequence space: Illumina –Color space: SOLID Quick and accurate anchoring Reads size varies 36-76 bp Short read aligners –1 st generation: Maq, soap Ungapped alignment –2 nd generation: bowtie, bwa, soap 2 Tradeoff speed for sensitivity, good enough for many applications Mapping tools –Robust to indels –Sensitive to variable number of mismatches

5 Pash 3.0 Positional Hashing Regular reads mapping Bisulfite sequencing mapping Integrate basepair variation with epigenetic variation SAM output, easy integration with other analysis tools Accuracy without sacrificing efficiency

6 Bisulfite Sequencing Current tools: BSMAP, RMAP-BS, mrsFast, Zoom Pash 3.0 –Integrate mutation discovery with basepair-level methylation discovery –Speedup General approach –Covert C’s to T’s in reads and/or reference –Use mappings, reads and reference to determine methylated sites Pash 3 –Generate and hash all possible kmers for reads –CTT: CCC, CCT, CTC, CTT –Map against forward and reverse complement chromosome strands Superior sensitivity to other tools, without loss of efficiency

7 Galaxy/Genboree Developed at Penn State University Benefits –Rapid deployment tool –Share pipelines w/ others Alan Harris, Sriram Raghuram –Deployed Galaxy/Genboree –Integration w/ Genboree API for upload/download –Adaptors for LFF file format support –EDACC XML validation tools Sriram Raghuram, Andrew Jackson, Cristian Coarfa –Integration with compute clusters Arpit Tandon, Sriram Raghuram –Deployed analysis tools http://genboree.org/galaxy

8 Primary Analysis Pipelines Implemented & exposed via Galaxy/Genboree –Read mapping –Bisulfite Sequencing read mapping –Peak calling (ChIP-Seq, MeDIP-Seq) MACS (Harvard), FindPeaks (UBC) –Chromatin accessibility HotSpot (UW) –Small RNA-seq Coming soon –mRNA seq –Expression, alternative splicing –Gene fusion Typical user interaction –Use Galaxy for user input –Submit jobs to a cluster –Upload results to Genboree

9 Reads Mapping

10 ChIP-Seq Select uniquely mapping reads Build read density maps –Extend each read 200bp along the mapping strand –Remove monoclonal reads –Generate WIG data –Can be visualized in Genboree and UCSC Peak calling –FindPeaks, MACS Intepret Peaks –Overlap with genomic features of interest: gene promoters, etc

11 MeDIP-Seq Select uniquely mapping reads Build read density maps Determine methylated CpGs –FindPeaks

12 Finding methylated CpGs

13 MeDIP-Seq Signal Visualization

14 MRE-Seq Select uniquely mapping reads Determine unmethylated CpGs

15 Bisulfite Sequencing Shotgun Bisulfite Sequencing –Methyl-C –Genome wide Reduced Representation Bisulfite Sequencing –RRBS –Enzyme cocktail Map using Pash Build methylation maps

16 Bisulfite Sequencing Read Mapping

17 Methylation Maps Position Strand CHHStatus Methylation Unmethylated TotalReads 50100242 + CG 1 0 1 50100243 - CG 40 11 51 50100250 + CG 1 0 1 50100251 - CG 37 8 46

18 Small RNA-Seq Trim adapters Map reads onto target genome –up to 100 locations per read Interpret –Overlap w/ miRNAs, piRNAs, sno/scaRNAs

19 Exercise Download the input MeDIP-Seq file from the workshop wiki Analyze it using FindPeaks in Galaxy –Obtain results in Genboree Lff format Upload the results to Genboree database View the results in a tabular view Find the largest peaks Explore them in the Genboree browser


Download ppt "EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics."

Similar presentations


Ads by Google