Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.

Slides:



Advertisements
Similar presentations
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Advertisements

Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Cis/TF discovery for Arabidopsis Aristotelis Tsirigos NYU Computer Science.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Identifying Regulatory Transcriptional Elements on Functional Gene Groups Using Computer-
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Gene Regulation: What it is, and how to detect it By Jordan, Jennifer, and Brian.
MICHAEL MORRA CSE 4939W Detection of Transcription Factor Binding Sites.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Gene expression profiling identifies molecular subtypes of gliomas
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
MicroRNA identification based on sequence and structure alignment Presented by - Neeta Jain Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn State University DATA WAREHOUSE FOR BIO-GEO HEALTH CARE.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Algorithms in Bioinformatics: A Practical Introduction
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Cis-regulatory Modules and Module Discovery
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Inference with Gene Expression and Sequence Data BMI/CS 776 Mark Craven April 2002.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Transcription factor binding motifs (part II) 10/22/07.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Learning Sequence Motif Models Using Expectation Maximization (EM)
Mapping Global Histone Acetylation Patterns to Gene Expression
Presented by, Jeremy Logue.
Nora Pierstorff Dept. of Genetics University of Cologne
Volume 122, Issue 6, Pages (September 2005)
Presented by, Jeremy Logue.
BIOBASE Training TRANSFAC® ExPlain™
Presentation transcript:

Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration with AstraZeneca

Outline of the talk Introduction Data description The REDUCE method Results Applications and Conclusions Ameur, Orzechowski 11/3 2003

Introduction - the REDUCE method The aim is to find binding sites for transcription factors, motifs, in the human genome by using a method developed at Rockefeller University (Bussemaker, Li & Siggia 2001). This method is called REDUCE and has previously only been applied to yeast data. We will apply it to human data. The idea is to find motifs by correlating sequence and expression data. Input consists of: Expression data, sequence data and a set of putative motifs. Output is a list of significant motifs: consensus id description   F probes hits NNNRRCCAATSRGNNN M00287 NF-Y NNNCGGCCATCTTGNCTSNW M00069 YY NNRACAGGTGYAN M00060 Sn NNNRGGNCAAAGKTCANNN M00134 HNF TWTTTAATTGGTT M00424 NKX KNNKNNTYGCGTGCMS M00235 AhR/Arnt NANCACGTGNNW M00123 c-Myc/Max NNBTNTNCTATTTNTT M00092 BR-CZ NNGAATATKCANNNN M00136 Oct Ameur, Orzechowski 11/3 2003

Expression data Expression data is provided by AstraZeneca. It consists of 81 samples of human cerebral cortex stem cells undergoing various treatments. Expressions are measured on an Affymetrix u133 chip. We visualize expression data in a heatmap. It is possible to identify regions of correlated genes in the heatmap. Ameur, Orzechowski 11/3 2003

Sequence data In the REDUCE model, expression levels are explained by the number of times the motifs occur in the upstream sequences of human genes. For this, sequences around the transcription starts are extracted. We take sequences in the range [1000 bp upstream, 100 bp downstream]. Transcription starts and genome data are provided by AstraZeneca. The upstream sequences are masked for repeats (with the program RepeatMasker). Putative motifs are matched to the resulting sequences. The motif TKAAA and its reverse complement TTTMA are matched in the example. Ameur, Orzechowski 11/3 2003

Motifs Motifs are represented as weight matrices : We generate the set of putative motifs as weight matrices. This can be done in several ways: One possibility is to use the matrices (about 300) in the TransFac data base. Another possibility is to generate matrices of our own, for example for all sequences of a certain length. Since the number of possible sequences grows exponentially with the length, this is only possible for sequneces up to length 7 or 8. We have implemented a method based on Gibbs sampling to match weight matrices to upstream regions. w(i,B) is the probability that base i is the nucleotide B in the motif M. Ameur, Orzechowski 11/3 2003

Matching motifs to the upstream sequences A weight matrix W is matched to a sequence s 1 s 2 … s n the following way: For each of the bases s 1 s 2 … s n we extract the corresponding weight matrix entry w(i,s i ) and compute the following sum Here b s i is the background frequence of base s i. An example: Assume we have the sequence AATCG and the matrix If all background frequencies are 0.25, this would give the score Ameur, Orzechowski 11/ The score is then compared to a threshold value:

Pre-processing and REDUCE Ameur, Orzechowski 11/3 2003

REDUCE output consensus id description   Fprobes hits NNNRRCCAATSRGNNN M00287 NF-Y NNNCGGCCATCTTGNCTSNW M00069 YY NNRACAGGTGYAN M00060 Sn NNNRGGNCAAAGKTCANNN M00134 HNF TWTTTAATTGGTT M00424 NKX KNNKNNTYGCGTGCMS M00235 AhR/Arnt NANCACGTGNNW M00123 c-Myc/Max NNBTNTNCTATTTNTT M00092 BR-CZ NNGAATATKCANNNN M00136 Oct consensus - A consensus sequence for the motif. id - A unique id for each motif. description - The transcription factor name.   - The significance of the motif. F - The effect. A positive value indicates activation and negative repression. probes - Number of probes with occurences of the motif in their upstream regions. hits - Total number of motif occurences. Ameur, Orzechowski 11/3 2003

REDUCE outadata can be visualized in a heatmap. Visualizing REDUCE outdata The motifs in this heatmap are taken from TransFac. Green dots indicate repressing and red dots indicate activating motifs. The heatmap gives a clustering of samples on motifs. Ameur, Orzechowski 11/3 2003

Analyzing REDUCE outdata Validation: The pictures below show the samples clustered on expression and on motifs. Analysis of significant motifs: By analyzing the motifs found by REDUCE we hope to find motifs that explain clusters of correlated genes. For example, REDUCE found a TransFac motif in the samples associated with the red area in the picture. It matches 18% of the 109 genes in the picture, and 4% of the other genes. Finding new motifs:One iteration of REDUCE was run on all sequences of length 5. Ameur, Orzechowski 11/3 2003

Applications Identify coregulated genes with potentially different expression profiles, using the motifs found by REDUCE. Predict previously unknown motifs, or new properties of known ones. Conclusions Our results on human data had somewhat lower significance than previuos results on yeast presented in (Bussemaker, Li & Siggia, 2001). There are several possible causes for this: Data quality: Expression data, upstream regions. Hard to validate findings. Gene regulation probably more complicated in human. Even so, our results suggest that the REDUCE method might give useful information about transcription factor binding sites in humans. Probably, this requires prior knowledge about motifs and other methods such as clustering. Ameur, Orzechowski 11/3 2003