Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Periodic clusters. Non periodic clusters That was only the beginning…
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Promoter Panel Review. Background related Promoter In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. It may be very long.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Multiple testing correction
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Copyright OpenHelix. No use or reproduction without express written consent1.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Mark D. Adams Dept. of Genetics 9/10/04
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Cis-regulatory Modules and Module Discovery
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Input: Alignment. Model parameters from neutral sequence Estimation example.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
1 Paper Outline Specific Aim Background & Significance Research Description Potential Pitfalls and Alternate Approaches Class Paper: 5-7 pages (with figures)
Transcription factor binding motifs (part II) 10/22/07.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Volume 12, Issue 11, Pages (September 2015)
Phylogenetic footprinting and shadowing
Finding regulatory modules
Nora Pierstorff Dept. of Genetics University of Cologne
Basic Local Alignment Search Tool
Presentation transcript:

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Recent Work Identifying statistically significant regulatory modules Computing motif statistics Evaluation of motif discovery algorithms Future directions: motif discovery in sets of orthologous sequences

Identifying Statistically Significant Regulatory Modules Overview of the problem Previous research The MCAST algorithm Validation Discussion

Problem Statement Given a set of one or more motifs, can we identify the genes that they regulate by searching a genomic database?

The Problem is Hard The futility theorem: the vast majority potential TF binding sites are false positives (Wasserman). This is because TF binding sites are short and degenerate, so they occur frequently at random in DNA.

The Approach Groups of transcription factors often operate in concert, binding near each other. Multiple binding sites for the same TF often occur close together. Whereas individual binding sites cannot be statistically significant, clusters may be.

MCAST Hybrid of Cisanalyst and COMET Based on Meta-MEME (CABIOS Grundy et al. 13: , 1997) MCAST has two input parameters: Motif p-value threshold (p) Maximum gap size (L) MCAST builds a motif-based HMM and uses the Viterbi algorithm to find clusters.

Definition of a Motif Cluster A “cluster” is a collection of “hits” (matches to motifs) with with no gaps longer than L. Hits are shown schematically as beads on a string. The number is the motif identifier. +/- indicates which DNA strand the hit is on

Cluster Scoring Function h1h1 h2h2 h3h3 h4h4 d3d3 d4d4 d2d2 One cluster Genomic DNA Hit scores Gap penalty Gap widths

Performance metrics ROC50 measures the area under a curve that plots true positive rate as a function of false positive rate, up to the 50th false positive. KB60 is the average number of kilobases per false positive at a threshold that yields 60% sensitivity. For both metrics, larger is better.

Four Data sets Drosophila Eve regulators (Bcd, Cad, Hb, Kr, Kni). 19 positives and 2039 putative negatives. Human LSF-regulated promoters (LSF, Sp1, Ets, TATA). 9 positives and 2005 putative negatives. Human muscle-specific promoters (Mef-2, Myf, SRF, Tef, Sp1). 27 positives and 2005 putative negatives. Muscle* - motifs generated without muscle- specific genes.

Comparison with COMET KB60ROC50KB60ROC50 MCAST COMET Drosoph> LSF muscle muscle* Red indicates better performance.

Computing motif statistics Looking for fast ways to compute the probability of a local, multiple alignment. Objective function of the latest version of the MEME algorithm.

Computing the statistics of random alignments Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance. Computing motif significance is therefore critical to any motif discovery approach.

Measuring the goodness off DNA regulatory motifs: IC Alignment n ij Counts f ij =n ij /N Frequencies 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Sequences IC =IC 1 + …+IC w Information Content 1 GACATCGAAA 2 GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG N TGTGAAGCAC 12 … w i j

POP: product of IC p-values IC is the sum of the information contents of the motif columns. POP is an alternative measure of motif quality: the product of the p-values of the column information contents.

Statistics of IC scores Large deviation method for computing distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15: , 1999). Time to compute the p-value of one IC score is O(N 2 ). MEME computes O(w 2 N) IC scores per motif, so the total time—O(w 2 N 3 )—is prohibitive. POP p-values can be computed efficiently.

Correction factor for POP p-values The p-value of POP score, p, is roughly: Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values. Empirically, the p-value error for POP, p, letting x = ln(p), is about

Estimating the POP p-value correction factor parameters To estimate the correction factor parameters we: estimate the right tail of the distribution using a convolution method, fit the (non-linear) correction function to the tail of the distribution using a least squares approach. The CPU time per motif to compute POP p- values is negligible once the correction factor parameters are known.

CPU time per motif using LD method to compute p-values w=16

CPU time to estimate correction factor parameters w=16

Speedup using POP statistic

Discovering regulatory elements in orthologous genes De novo discovery of most known regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423: , 2003). We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.

Evaluation of motif discovery algorithms Joint work with Martin Tompa and others. Eighteen motif discovery algorithms were tested evaluated on DNA regulatory motifs in four organisms. Each algorithm was run by experts in that particular algorithm. The ability of the algorithm to discover motifs in sets of DNA sequences was measured.

Performance of Motif Discovery Algorithms Finding Regulatory Motifs

Conservation of known regulatory elements in sets of orthologous genes Human vs. MouseFour yeast species Source: Liu et al., Genome Res 14: , Background sequences Regulatory elements Background sequences

Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements make up less of human intergenic DNA (3% vs. 15%). The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species. Large-scale motif discovery should be possible using human and mouse orthologous genes.