Cis-regulatory Modules and Module Discovery

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Periodic clusters. Non periodic clusters That was only the beginning…
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Ab initio gene prediction Genome 559, Winter 2011.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
California Pacific Medical Center
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
Cluster validation Integration ICES Bioinformatics.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Construction of Substitution matrices
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Transcription factor binding motifs (part II) 10/22/07.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A Very Basic Gibbs Sampler for Motif Detection
Learning Sequence Motif Models Using Expectation Maximization (EM)
Ab initio gene prediction
Cis-regulatory evolution of duplicate genes in yeasts
Finding regulatory modules
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Cis-regulatory Modules and Module Discovery The slides for module discovery are provided by Prof. Qing Zhou @ UCLA

Motif Discovery Background Motif (weight matrix) Mixture modeling 1 2 3 4 5 Mixture modeling

Difficulties in motif discovery in higher organisms Upstream sequences are longer. Motifs are less conserved and shorter. Background sequence structures are more complicated. To solve the problem, utilize more biological knowledge in our model. 1) module structure 2) multiple species conservation

Cis-regulatory module Combinatorial control of genes: cis-regulatory modules module

CisModule: modeling module structure (Zhou and Wong, PNAS 2004) Module structure: consider co-localization of motif sites. B M S Motif 1 Motif 2 Motif 3 Hierarchical Mixture modeling  K: # of motifs

Parameters and missing data Missing data problem. K # of motifs l Module length S Set of sequences M Indicators for a module start A Indicators for a motif site start Background model Weight matrices for motifs W Motif widths r Probability of a module start q Probability of starting a motif site Given  Observed data Missing data Parameters Ψ

Bayesian inference by posterior sampling Module-motif detection Given Θ, r, q, and W, Sample modules: 2) Within each module, sample motif sites: M=1 M=0 Parameter Update Given M and A, 1) Infer Θ from aligned sites. 2) Update r, q and W. Aligned TTTGC TATCC CTTGC TTTAC GTTGC

Module sampling Want to sample from P (M | S, Ψ), need to calculate Denote Forward summation: Module: Background:

Module sampling Backward sampling How to calculate

Posterior inference Motif sites: marginal posterior probability of being a motif start position > 0.5. Modules: marginal posterior probability of being within a module > 0.5.

Simulation study Generate 30 data sets independently, each contains: 1) 20 sequences, each of length 1000; 2) 25 modules, with length 150; 3) each module contains 1 E2F site, 1 YY1 site, and 1 cMyc site. CisModule Do not consider module Motifs Fail TP FP E2F 0.03 17.9 7.5 0.37 17.1 11.6 YY1 0.07 16.0 8.7 0.20 11.0 cMyc 15.7 9.9 0.63 13.6 12.4

Example: Discovery of tissue-specific modules in Ciona Sidow lab Collected 21 genes that are co-expressed during the development of muscle tissue in Ciona. Want to find motifs and modules in the upstream sequences (average length = 1330) of these genes. Found 3 motifs in 28 modules (4860 bps). Are they real motifs that determine the gene expression??

Experimental validation Positive element: the shortest sufficient and non-overlapping sequence that drives strong expression in muscle: average length of 289 bps.

Experimental validation 70% of our predicted motif sites are located in the positive elements!

Other tools Gibbs Module Sampler (Thompson et al. Genome Res. 2004) EMCMODULE (Gupta and Liu, PNAS, 2005)

Phylogenetic Footprinting

Functional elements tend to be conserved across species For example, exons are conserved due to the selection pressure. Introns and intergenic regions are less likely to be conserved.

Phylogenetic footprinting Miller et al. Annu. Rev. Genomics Hum. Genet. 2004

Incorporating cross-species conservation into motif discovery A threshold method (Wasserman et al. Nature Genetics, 2000) STEP1: construct cross-species alignment STEP2: compute conservation measure from the alignment STEP3: Non-conserved regions are filtered out STEP4: Gibbs motif sampler is applied to conserved regions of the target genome

Phylogenetic footprinting & motif discovery CompareProspector (Liu Y. et al. Genome Res. 2004) STEP1: construct cross-species alignment STEP2: compute conservation measure (window percent identity, WPID) from the alignment STEP3: multiply the likelihood ratio at a position by the corresponding WPID, thus likelihood landscape is changed to favor conserved sites STEP4: apply a Gibbs motif sampler based algorithm

Phylogenetic footprinting & motif discovery Evolutionary model based approach EMnEM (Moses et al. 2004) PhyME (Sinha et al. 2004) PhyloGibbs (Siddharthan et al. 2005) Tree Sampler (Li and Wong, 2005)

Incorporating cross-species conservation into motif discovery PhyloCon(Wang and Stormo, Bioinformatics, 2003) STEP 1: construct alignment among orthologous sequences; STEP 2: convert conserved regions into profiles; STEP 3: use profiles in the first sequence as seeds; STEP 4: find matches of each seed in the second sequence; STEP 5: update seeds; STEP 6: repeat step 2 and 3 for all sequences.

Phylogenetic footprinting & module discovery Multimodule (Zhou and Wong, The Annals of Applied Statistics, 2007)

Multimodule Module structure of each sequence is modeled by an HMM. Couple HMMs via multiple alignment: Aligned states are coupled and collapsed into one common state. Uncoupled states: similar to single species model. Coupled states: evolutionary model.

Comparing with other methods Three data sets with experimental validation reported previously, which contain 9 known motifs with 152 validated sites. CompareProspector (Liu et al. 2004): conservation score PhyloCon (Wang and Stormo 2003): progressive alignment of profiles EMnEM (Moses et al. 2004): Phylogenetic motif discovery CisModule (Zhou and Wong 2004): Single-species module discovery.

Comparing with other methods # known motifs identified For correctly identified motifs by each method # predicted sites # overlaps Sensitivity (%) Specificity (%) CompareProspector 7 75 36 24 48 PhyloCon 3 50 26 17 52 EMnEM 6 130 44 29 34 CisModule 5 110 35 23 32 MultiModule 8 157 79 # of known sites = 152