Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena.

Slides:

Advertisements

Similar presentations

ACCESSING AND EXTRACTING CHIP-SEQ AND TF-GENE INTERACTIONS FROM PAZAR

Advertisements

Chapter 14 Phage Strategies.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.

Arrays Dr. Jey Veerasamy July 31 st – August 23 rd 9:30 am to 12 noon 1.

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.

15. Oktober Oktober Oktober 2012.

25 seconds left…...

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Transcriptional regulation and promoter analysis

Facilitator: Richard Bruskiewich

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.

Before we start: Align sequence reads to the reference genome

NGS Analysis Using Galaxy

Whole Exome Sequencing for Variant Discovery and Prioritisation

An Introduction to RNA-Seq Transcriptome Profiling with iPlant

A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.

Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Gary Stormo by Andrew Bardee. History Born 1950 in South Dakota Undergraduate in Biology from Caltech PhD in Molecular Biology from University of Colorado.

RNAseq analyses -- methods

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.

NGS data analysis CCM Seminar series Michael Liang:

Next Generation DNA Sequencing

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Introduction The soybean cyst nematode (SCN) causes at least $600 million in annual yield-loss in the US. It was introduced in the United States in the.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.

RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID

No reference available

Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,

Accessing and visualizing genomics data

Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.

A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.

User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.

HOMER – a one stop shop for ChIP-Seq analysis

High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.

Canadian Bioinformatics Workshops

Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.

From Reads to Results Exome-seq analysis at CCBR

Konstantin Okonechnikov Qualimap v2: advanced quality control of

Introductory RNA-seq Transcriptome Profiling

Using command line tools to process sequencing data

CS273B: Deep learning for Genomics and Biomedicine

Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.

Regulatory Genomics Lab

Short Read Sequencing Analysis Workshop

Introductory RNA-Seq Transcriptome Profiling

GE3M25: Data Analysis, Class 4

Day 5 Session 29: Questions and follow-up…. James C. Fleet, PhD

ChIP-Seq Data Processing and QC

Genome-Wide Analysis of PLT2 Binding to Target Genes.

Epigenetics System Biology Workshop: Introduction

ChIP-seq Robert J. Trumbly

Regulatory Genomics Lab

Computational Pipeline Strategies

Regulatory Genomics Lab

Chromatin basics & ChIP-seq analysis

Presentation transcript:

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R ) Matthew Huska (R ) Max Planck Institute for Molecular Genetics

Max-Planck-Institut für molekulare Genetik Software Praktikum, Transcriptional Regulation TF not bound = no gene expression TF bound = gene expression

Max-Planck-Institut für molekulare Genetik Software Praktikum, Transcriptional Regulation TF not bound = no gene expression TF bound = gene expression Problem: There are many genes and many TF's, how do we identify the targets of a TF?

Max-Planck-Institut für molekulare Genetik Software Praktikum, Methods for Identifying TF Target Genes Microarray PWM Genome Scan ChIP-seq

Max-Planck-Institut für molekulare Genetik Software Praktikum, PWM Genome Scan Purely computational method Input: o position weight matrix for your TF o genomic region(s) of interest Pros: o No need to do wet lab experiments Cons: o Many false positives, not able to take biological conditions into account Score threshold

Max-Planck-Institut für molekulare Genetik Software Praktikum, PWM genome scan Folie 6 1) Download the PWMs of your TF of interest from the database (they might include >1 motif) 1) Define the sequences to analyze (promoter sequences) 1) Run the PWM genome scan (hit- based method or affinity prediction method) 1) Rank the genomic sequences by the affinity signal Suggested Reading: Roider et al.: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics (2007). Roider et al. Thomas-Chollier et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nature Protocols (2011). Thomas-Chollier et al.

Max-Planck-Institut für molekulare Genetik Software Praktikum, PWM-PSCM Folie 7

Max-Planck-Institut für molekulare Genetik Software Praktikum, TRAP Folie 8 1) Convert the PSSM(position specific scoring matrix) to PSEM (position specific energy matrix) 2) Scan the sequences of interest with TRAP 3) Results in 1 score per sequence=binding affinity 4) Doesnt separate the exact TF binding sites (easier for ranking) 5) Sequences must have the same length! ANNOTATE= /project/gbrowse/Pipeline/ANNOTATE_v3.02/Release TRAP trap.molgen.mpg.de/cgi-bin/home.cgitrap.molgen.mpg.de/cgi-bin/home.cgi

Max-Planck-Institut für molekulare Genetik Software Praktikum, Matrix-scan Folie 9 1) Use directly the PSSM 2) Finds all TFBS which exceed a predefined threshold (e.g. p-value) 3) More complicated to create ranked lists of genomic sequences (more hits in the sequence) 4) Exact location of the binding site reported matrix-scan

Max-Planck-Institut für molekulare Genetik Software Praktikum, Finding the target genes Folie 10 target genes will be the top-ranked genes (promoters) which are the top-ranked genes? (top-100,500, ?) Theres no exact definition of promoters, usually 2000bp upstream, 500bp downstream of the TSS

Max-Planck-Institut für molekulare Genetik Software Praktikum, Microarrays R/Bioconductor (details later)

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 12 Microarrays (2) Pros: o There is a lot of microarray data already available (might not have to generate the data yourself) o Inexpensive and not very difficult to perform o Computational workflow is well established Cons: o Can not distinguish between indirect regulation and direct regulation

Max-Planck-Institut für molekulare Genetik Software Praktikum, ChIP-seq Map reads to the genome Call peaks to determine most likely TF binding locations

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 14 ChIP-seq (2) Pros: o Direct measure of genome-wide protein-DNA interaction(*) Cons: o Don't know whether binding causes changes in gene expression o More complicated experimentally and in terms of computational analysis o Most expensive o Need an antibody against your protein of interest o Biases are not as well understood as with microarrays

Max-Planck-Institut für molekulare Genetik Software Praktikum, ChIP-seq analysis Folie 15 1) Download the reads from given source (experiments and controls) 2) Quality control of the reads and statistics ( fastqc) 3) Mapping the reads to the reference genome ( bwa/Bowtie) 4) Peak calling ( MACS) 5) Visualization of the peaks in a genome browser (genome browser, IGV) 6) Finding the closest genes to the peaks( Bioconductor/ChIPp eakAnno) Visualised peaks in a genome browser Suggested Reading: Bailey et alPractical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol (2013).Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data Thomas-Chollier et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nature Protocols (2012).A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.

Max-Planck-Institut für molekulare Genetik Software Praktikum, Sequencing data Folie 16 Analysis 1) Quality control with fastqc 2) Filtering of reads with adapter sequences 3) Mapping of the reads to the reference genome (bwa or Bowtie) Example of fastq data file raw data=reads usually very large file (few GB) format fastq (ENCODE) or SRA (Sequence Read Archive of NCBI)fastqSequence Read Archive

Max-Planck-Institut für molekulare Genetik Software Praktikum, Quality control with fastqc per base quality sequence quality (avg. > 20) sequence length sequence duplication level (duplication by PCR) overrepresented sequences/kmers (adapter sequences) produces a html report manual (read it!) manual software at the MPI Folie 17 Example of per base seq quality scores FASTQC= /scratch/ngsvin/bin/chip-seq/fastqc/FastQC/fastqc

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 18 Mapping with bwa mapping the sequencing reads to a reference genome manual (read it!) manual map the experiments and the controls 1) reference genome in fasta format (hg19) 2) create an index of the reference file for faster mapping (only if not available) 3) align the reads (specify parameters e.g. for # of mismatches, read trimming, threads used...) 4) generate alignments in the SAM format (different commands for single-end and pair-end reads!)SAM format software and data at the MPI: BWA = /scratch/ngsvin/bin/executables/bwa hg19 : /scratch/ngsvin/MappingIndices/hg19.fa bwa index: /scratch/ngsvin/MappingIndices/BWA/hg19

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 19 File manipulation with samtools utilities that manipulate SAM/BAM files manual (read it!) manual 1) merge the replicates in one file (still separate experiment and control) 2) convert the SAM file into BAM file (binary version of SAM, smaller) 3) sort and index the BAM file now the sequencing files are ready for further analysis software at the MPI: SAMTOOLS = /scratch/ngsvin/bin/executables/samtools

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 20 Peak finding with MACS find the peaks, i.e. the regions with a high density of reads, where the studied TF was bound manual (read it!) manual 1) call the peaks using the experiment (treatment) data vs. control 2) set the parameters e.g. fragment length, treatment of duplication reads 3) analyse the MACS results (BED file with peaks/summits) software at the MPI: MACS = /scratch/ngsvin/bin/executables/macs

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 21 Finding the target genes find the genes which are in the closest distance to the (significant) peaks how to define the closest distance? (+- X kb) use ChIPpeakAnno in Bioconductor or bedtools

Max-Planck-Institut für molekulare Genetik Software Praktikum, Methods for Identifying TF Target Genes Microarray PWM Genome Scan ChIP-seq Threshold s

Max-Planck-Institut für molekulare Genetik Software Praktikum, Bioinformatics Read mapping (Bowtie/bwa) Peak Calling (MACS/Bioconduct or) Peak-Target Analysis (Bioconductor) Folie 23 Microarray data analysis (Bioconductor) Differential Genes (R) GSEA PWM Genome Scan (TRAP/MatScan) Statistics (R) Data Integration (R/Python/Perl) Statistical Analysis (R)

Max-Planck-Institut für molekulare Genetik Software Praktikum, Bioinformatics tools Bowtie bowtie-bio.sourceforge.net/manual.shtmlbowtie-bio.sourceforge.net/manual.shtml bwa bio-bwa.sourceforge.net/bwa.shtmlbio-bwa.sourceforge.net/bwa.shtml MACS github.com/taoliu/MACS/blob/macs_v1/README.rstgithub.com/taoliu/MACS/blob/macs_v1/README.rst TRAP trap.molgen.mpg.de/cgi-bin/home.cgitrap.molgen.mpg.de/cgi-bin/home.cgi matrix-scan Bioconductor (more info in R course) Folie 24 READ THE MANUALS! Databases GEO ENCODE genome.ucsc.edu/ENCODE/genome.ucsc.edu/ENCODE/ SRA JASPAR

Max-Planck-Institut für molekulare Genetik Software Praktikum, Schedule Introduction lecture, R course R & Bioconductor homework submission Presentation of the detailed plan of each group (which TF, cell line, tools, data, data integration, team work ) 10:30am, 11:30am every Tuesday 10:30am, 11:30am progress meetings Final report deadline (tentative) Presentations Final meeting, discussion of final reports Folie 25

Max-Planck-Institut für molekulare Genetik Software Praktikum, GR Group Expression and ChIP-seq data: Luca F, Maranville JC, et al., PLoS ONE, 2013Luca F, Maranville JC, et al., PLoS ONE, 2013 PWM database: jaspar.genereg.netjaspar.genereg.net Folie 26

Max-Planck-Institut für molekulare Genetik Software Praktikum, c-Myc Group Expression data: Cappellen, Schlange, Bauer et al., EMBO reports, 2007Cappellen, Schlange, Bauer et al., EMBO reports, 2007 Musgrove et al., PLoS One, 2008 ChIP-seq data: ENCODE ProjectENCODE Project PWM database: jaspar.genereg.netjaspar.genereg.net Folie 27

Max-Planck-Institut für molekulare Genetik Software Praktikum, Folie 28 Additional analysis Binding motifs are the overrepresented motifs in the ChIP-peak regions different? do we find any co-factors? Recommended tool: RSAT rsat.ulb.ac.bersat.ulb.ac.be binding motifs