Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Cis/TF discovery for Arabidopsis Aristotelis Tsirigos NYU Computer Science.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Journal club 06/27/08. Phylogenetic footprinting A technique used to identify TFBS within a non- coding region of DNA of interest by comparing it to the.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Multiple testing correction
BPS - 3rd Ed. Chapter 211 Inference for Regression.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Mark D. Adams Dept. of Genetics 9/10/04
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.
COMPUTATIONAL BIOLOGIST DR. MARTIN TOMPA Place of Employment: University of Washington Type of Work: Develops computer programs and algorithms to identify.
Cis-regulatory Modules and Module Discovery
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Input: Alignment. Model parameters from neutral sequence Estimation example.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Detection of genome regulation sequences
Statistical Applications in Biology and Genetics
Bioinformatics tools to identify structured motifs in the upstream regions of stress-response-involved genes in Tetrahymena thermophila Antonietta La Terza*,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
CSCI2950-C Lecture 13 Network Motifs; Network Integration
Finding regulatory modules
Basic Practice of Statistics - 3rd Edition Inference for Regression
Presented by, Jeremy Logue.
Transcription factor binding motifs
Nora Pierstorff Dept. of Genetics University of Cologne
Basic Local Alignment Search Tool
Presented by, Jeremy Logue.
Presentation transcript:

Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB

Research Goals Developing fast and accurate methods for computing the statistics of random alignments Discovering regulatory elements in the upstream regions of orthologous genes We are studying algorithms for discovering regulatory elements in DNA. Our research includes:

Recent Work Developed new way of computing statistics for DNA regulatory motif scores Participated in the evaluation of most extant motif discovery algorithms Studied prediction of subcellular localization Studied prediction of accessible protein area Developing algorithms for motif discovery in sets of orthologous sequences

Collaborations Algorithm evaluation: Martin Tompa (University of Washington) Protein accesible surface area: Zheng Yuan (IMB) Subcellular localization: Rohan Teasdale, Melissa Davis (IMB)

Computing the statistics of random alignments Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance. Computing motif significance is therefore critical to any motif discovery approach.

Measuring the goodness off DNA regulatory motifs: IC Alignment n ij Counts f ij =n ij /N Frequencies 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Sequences IC =IC 1 + …+IC w Information Content 1 GACATCGAAA 2 GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG N TGTGAAGCAC 12 … w i j

POP: product of IC p-values IC is the sum of the information contents of the motif columns. POP is an alternative measure of motif quality: the product of the p-values of the column information contents.

Statistics of IC scores Large deviation method for computing distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15: , 1999). Time to compute the p-value of one IC score is O(N 2 ). MEME computes O(w 2 N) IC scores per motif, so the total time—O(w 2 N 3 )—is prohibitive. POP p-values can be computed efficiently.

Discovering regulatory elements in orthologous genes De novo discovery of most known regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423: , 2003). We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.

Speedup using POP statistic

Evaluation of motif discovery algorithms Eighteen motif discovery algorithms were tested evaluated on DNA regulatory motifs in four organisms. Each algorithm was run by experts in that particular algorithm. The ability of the algorithm to discover motifs in sets of DNA sequences was measured.

Performance of Motif Discovery Algorithms Finding Regulatory Motifs

Conservation of known regulatory elements in sets of orthologous genes Human vs. MouseFour yeast species Source: Liu et al., Genome Res 14: , Background sequences Regulatory elements Background sequences

Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements make up less of human intergenic DNA (3% vs. 15%). The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species. Large-scale motif discovery should be possible using human and mouse orthologous genes.

Estimating the POP p-value correction factor parameters To estimate the correction factor parameters we: estimate the right tail of the distribution using a convolution method, fit the (non-linear) correction function to the tail of the distribution using a least squares approach. The CPU time per motif to compute POP p- values is negligible once the correction factor parameters are known.

Correction factor for POP p-values The p-value of POP score, p, is roughly: Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values. Empirically, the p-value error for POP, p, letting x = ln(p), is about where a and b are parameters that must be estimated.

CPU time per motif using LD method to compute p-values w=16

CPU time to estimate correction factor parameters w=16