Presentation on theme: "Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data."— Presentation transcript:
1 Advanced ChIP-seqIdentification of consensus binding sites for the LEAFY transcription factorExplain that you can use your own dataExplain that data is pre-staged for today’s workThis is public data from a recent paper.Goal: Can we replicate this finding?
2 Expand to 100 bp windowsAlign to Genome: BWAFind peaks: PeakRangerInspect results: IGV 2.0Find motifs: DREMEImport from SRAExport to FASTQMerge replicatesFilter best peaksExtract FASTA
4 The NCBI SRA NCBI SRA is a repository for NGS sequence reads Data is stored in association with basic metadata explaining experimental technique and inter-sample relationshipsData format is NCBI-specific SRA and SRA-lite format. “Universal” lossless format.Upload and download is offered via FTP and HTTP but also via Aspera ASCPFast, parallel protocol similar in performance to iRODS iput/iget commands used in iPlant Data StoreUse NCBI SRA Import to rapidly copy SRA accession SRP over ASCP into the iPlant Data Store.If you are doing this import live, it takes about 20 minutes for everyone to get their data into the system. This factors data transfer time, monkeying with learning SRA, etc.
5 Can go to this live and browse around the SRA to familiarize users with its interface
6 NCBI SRA ToolkitSRA data format is a universal format, but no downstream apps can accept it natively.Need to export SRA to FASTQ, SFF, etc.These are the standard file formats for representing sequence.Use the NCBI SRA Toolkit fastq-dump to export FASTQ sequence files from SRA files so we can process them
7 Import SRA data from NCBI SRA Extract FASTQ files from the downloaded SRA archivesThe translation step takes ~30 minutes, accounting for the repittion of users doing the work. Recommend giving users import and fastq-dump steps to do together, then taking a pause while everyone syncs up. Start up work again when ~50% of class is getting data emitted as FASTQ files.
8 BWABWA is one of many applications whose objective is to efficiently align short sequence reads to a reference genome sequenceOther alternatives are BOWTIE, MAQ, TopHat, Stampy, Novoalign, etc.BWA is used by the Human 1000 genomes project due to its speed and accuracy.
9 Outputs from BWA BWA emits alignments in the SAM format SAM is a universal system for describing next-gen sequences and their corresponding genome alignmentsSAMTools is a suite of applications for manipulating SAM filesSort, Merge, Index, and moreEmit as binary BAM file
10 Align FASTQ files to Arabidopsis genome using BWA Merge and index BAM files using SAMtools appsAnother fine spot for a sync break. Alignments for each data set take ~35 minutes. Estimate an hour for all users to complete them. Give them the task of setting up alignment, then merging the replicates result files together.
11 PeakRangerPeakRanger is a fast, optimized algorithm for detecting enrichment peaks in ChIPseq data setsPeakRanger was developed at OICR in partnership between modENCODE and iPlant and is now maintained at UTSWIt’s not the only option for peak finding:MACSChIPseq Peak FinderCisGenomeFindPeaksYou want everyone on the same page when we start PeakRanger – this is the interesting part of the class.Good time to point out that if you don’t like the selection of Peak finders you can push one in yourself!
12 Please read the PeakRanger paper if you are recommending that others read it. Use PeakRanger with the BAM files from the Control and Sample assays to find LEAFY enrichmentNOTE: Many parameters to tweak. You are recommended to read the PeakRanger paper.
13 Outputs from PeakRanger Wiggle (.wig) files: Density map of sequence reads across the reference genome for control and sample BAM alignmentsRegion (.bed) file: Feature file containing the significantly enriched domains in the genomeSummit (.bed) file: Feature file containing the single base maximum of each peakThis is pure tutorial material: What’s a BED file? What’s a WIG file?
14 Wiggle fileWiggle files represent value over spatial resolution and are generally continuous.BED files represent annotated intervals. Similar to GFF but less expressive.These are used in downstream applications for analysis…BED file
15 Integrative Genomics Viewer The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.IGV: Make sure you know how to run IGV yourself. Work the example. Play with configuring tracks.You don’t NEED to run IGV in Atmosphere. If that product is flaking out, show users how to do the same thing on their OWN desktop!Use IGV to inspect outputs from PeakRanger
16 Using IGV in Atmosphere Launch an instance of NGS Viewers from the Atmosphere App listUse VNClient to connect to your remote desktopMake sure everyone has gotten PeakRanger to complete – takes about 30 min to run on our data set.
17 Using IGV in Atmosphere Configure iDropCopy .wig and .bed files from the PeakRanger output to your Atmosphere instance desktop
18 Using IGV in Atmosphere Launch IGV (Integrative Genomics Viewer)Change the current genome to A. thaliana (TAIR10)
19 Using IGV in Atmosphere Open igvtools and convert .wig file to .tdfLoad the .tdf and .bed files into the IGV windowInspect loci by entering their name into search box
20 Using IGV in Atmosphere Enrichment region and alignment peak at promoter region of APETALA (AP1)If you are looking for a field trip while people are fiddling with Atmo/IGV throw up next slide…Otherwise just skip it
21 AP1 (APETALA) Mutant Wild-type ap1 Api1 is a mutant where APETALA gene is not active.Wild-typeap1Why do we even care about LEAFY? Well, it activates AP1. If API is not active, Arabidopsis can’t make flowers and instead makes cauliflowers!
22 Some Known LEAFY targets Gene NameLocusAPETALA (AP1)AT1GAGAMOUS (AG)AT4GLMI2AT3GLMI3AT5GLMI4AT5GLMI5AT1GLook for LEAFY enrichment at these loci in IGV 2.0
23 Filtering the PeakRanger summits file The statiscally best summits from PeakRanger have P-values of Zero. If you look at the summits.bed file you can see this is embedded in the name of the features. So, if we filter the summits.bed for only lines matching pval_0, we will generate a BED file containing summits most likely to be near true LEAFY binding sites.This identical to runningegrep “pval_0” peakranger_summit.bed > peakranger_summit_best.bedon a command lineBack to the bioinformatics. This slide is designed to highlight the convergence between command line and DE tools.Find Lines Matching a Regular Expression
24 BEDTools for Interval Operations The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together.slopBed – Expand the coordinates of features in a BED file by a a defined number of basesfastaFromBed – Extract a multiFASTA file from a reference sequence using a BED file of featuresMake sure to draw pictures, etc to make sure people understand that you are going fromSingle base peak -> 100 bp window -> FASTA file from those windowsUsing these two tools* The entire BEDtools suite is slated for itegration into the iPlant DE. Follow us on to learn when new tools become available.
25 Best Summits BED File (single base pair features) 100 bp Region BED File(100 bp centered on peak centers)FASTA file of 100 bp regions(likely to contain consensus motifs)BEDTools slopBed, 50bp equidistantBEDTools fastaFromBed, Arabidopsis genomeDREMEFilter summits.bed on pval_0ObjectiveGo from BED file of single-base peak summits to a FASTA file containing the 100 bp surrounding those summits that can be used for motif huntingWhy are we doing this? To get the sequences that are MOST LIKELY to contain consensus motifs when we input into DREME
26 DREME Run DREME on 100bp windows surrounding LEAFY peaks Run DREME. Download entire result folder to your desktop or Atmo VM.Load dreme_out/dreme.html in your browserRun DREME on 100bp windows surrounding LEAFY peaksDownload results
27 DREME results Success! CCANTG(G/T)! This motif says the same thing as the published consensus. Notice it’s the first one found by DREME!You have replicated a science finding using INDEPENDENT BIOINFORMATIC METHODSSuccess!
28 Potential Next StepsIdentify all consensus LEAFY sites in the genome that fall in promotersExtract all the promoters where LEAFY has significant binding and associate them with genes.Generate a simple gene list and run Ontology Term enrichment analysis to find classes of genes influenced by LEAFY
29 Cyberinfrastructure Overview ComponentWhat we didWhy we used itiPlant Data StoreImported data from SRA. Stored results of analyses. Downloaded results.Fast, flexible storage for large bioinformatics data.Discovery EnvironmentData import. NGS Alignment. Peak Finding. Data organization.One interface. Multiple bioinformatics applications. Easy to manage work products.AtmosphereLoaded results into desktop client application.Avoid downloading large files to personal computer. Easy access to powerful desktop environment.