Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Recent Work Identifying statistically significant regulatory modules Computing motif statistics Evaluation of motif discovery algorithms Future directions: motif discovery in sets of orthologous sequences

Identifying Statistically Significant Regulatory Modules Overview of the problem Previous research The MCAST algorithm Validation Discussion

Problem Statement Given a set of one or more motifs, can we identify the genes that they regulate by searching a genomic database?

The Problem is Hard The futility theorem: the vast majority potential TF binding sites are false positives (Wasserman). This is because TF binding sites are short and degenerate, so they occur frequently at random in DNA.

The Approach Groups of transcription factors often operate in concert, binding near each other. Multiple binding sites for the same TF often occur close together. Whereas individual binding sites cannot be statistically significant, clusters may be.

MCAST Hybrid of Cisanalyst and COMET Based on Meta-MEME (CABIOS Grundy et al. 13:397-406, 1997) MCAST has two input parameters: Motif p-value threshold (p) Maximum gap size (L) MCAST builds a motif-based HMM and uses the Viterbi algorithm to find clusters.

Definition of a Motif Cluster A “cluster” is a collection of “hits” (matches to motifs) with with no gaps longer than L. Hits are shown schematically as beads on a string. The number is the motif identifier. +/- indicates which DNA strand the hit is on. +3 -2 +1

Cluster Scoring Function h1h1 h2h2 h3h3 h4h4 d3d3 d4d4 d2d2 One cluster Genomic DNA Hit scores Gap penalty Gap widths

Performance metrics ROC50 measures the area under a curve that plots true positive rate as a function of false positive rate, up to the 50th false positive. KB60 is the average number of kilobases per false positive at a threshold that yields 60% sensitivity. For both metrics, larger is better.

Four Data sets Drosophila Eve regulators (Bcd, Cad, Hb, Kr, Kni). 19 positives and 2039 putative negatives. Human LSF-regulated promoters (LSF, Sp1, Ets, TATA). 9 positives and 2005 putative negatives. Human muscle-specific promoters (Mef-2, Myf, SRF, Tef, Sp1). 27 positives and 2005 putative negatives. Muscle* - motifs generated without muscle- specific genes.

Comparison with COMET KB60ROC50KB60ROC50 MCAST COMET Drosoph>40410.6810100.61 LSF1670.44850.35 muscle300.38690.46 muscle*140.1660.25 Red indicates better performance.

Computing motif statistics Looking for fast ways to compute the probability of a local, multiple alignment. Objective function of the latest version of the MEME algorithm.

Computing the statistics of random alignments Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance. Computing motif significance is therefore critical to any motif discovery approach.

Measuring the goodness off DNA regulatory motifs: IC Alignment n ij Counts f ij =n ij /N Frequencies 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Sequences IC =IC 1 + …+IC w Information Content 1 GACATCGAAA 2 GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG N TGTGAAGCAC 12 … w i j

POP: product of IC p-values IC is the sum of the information contents of the motif columns. POP is an alternative measure of motif quality: the product of the p-values of the column information contents.

Statistics of IC scores Large deviation method for computing distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15:653-577, 1999). Time to compute the p-value of one IC score is O(N 2 ). MEME computes O(w 2 N) IC scores per motif, so the total time—O(w 2 N 3 )—is prohibitive. POP p-values can be computed efficiently.

Correction factor for POP p-values The p-value of POP score, p, is roughly: Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values. Empirically, the p-value error for POP, p, letting x = ln(p), is about

Estimating the POP p-value correction factor parameters To estimate the correction factor parameters we: estimate the right tail of the distribution using a convolution method, fit the (non-linear) correction function to the tail of the distribution using a least squares approach. The CPU time per motif to compute POP p- values is negligible once the correction factor parameters are known.

CPU time per motif using LD method to compute p-values w=16

CPU time to estimate correction factor parameters w=16

Speedup using POP statistic

Discovering regulatory elements in orthologous genes De novo discovery of most known regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423:241-254, 2003). We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.

Evaluation of motif discovery algorithms Joint work with Martin Tompa and others. Eighteen motif discovery algorithms were tested evaluated on DNA regulatory motifs in four organisms. Each algorithm was run by experts in that particular algorithm. The ability of the algorithm to discover motifs in sets of DNA sequences was measured.

Performance of Motif Discovery Algorithms Finding Regulatory Motifs

Conservation of known regulatory elements in sets of orthologous genes Human vs. MouseFour yeast species Source: Liu et al., Genome Res 14:451-458, 2004. Background sequences Regulatory elements Background sequences

Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements make up less of human intergenic DNA (3% vs. 15%). The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species. Large-scale motif discovery should be possible using human and mouse orthologous genes.

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

Similar presentations

Presentation on theme: "Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

Similar presentations

Presentation on theme: "Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland."— Presentation transcript:

Similar presentations

About project

Feedback