Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.

Similar presentations


Presentation on theme: "Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium."— Presentation transcript:

1 Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium Slides by Chulyun Kim Presented by Saurabh Sinha

2 Contents Introduction Methods Methodology overview Score functions ModuleSearch algorithm Results Conclusions

3 Contents Introduction Methods Methodology overview Score functions ModuleSearch algorithm Results Conclusions

4 Motivation The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors These factors bind to cis-regulatory modules(CRMs) located in the neighborhood of the gene By integrating multiple signals, CRMs confer an organism specific spatial and temporal rate of transcription

5 Related Works Yuh et al., 1998: Working with combinations of factors makes it possible to integrate multiple inputs and this further provides cross-coupling of a signal transduction and gene regulatory path ways Bray et al., 2003: AVID, alignment algorithm designed to identify functional non coding segments Aerts et al., 2003: delineation of putative regions containing CRMs in large intergenic sequences Thijs et al., 2002: detecting DNA motifs by their statistical over- representation in a set of sequences Aerts et al., 2003: detecting over-represented hits of known TFBSs Recently, exploiting colocalization to find true biding sites in a particular gene yields valuable hypotheses regarding transcriptional regulation

6 Problem To find the best combination of transcription factor binding sites(TFBSs) that occur several times across multiple coregulated human genes Specifically within syntenic regions with respective mouse orthologous genes

7 Contents Introduction Methods Methodology overview Score functions ModuleSearch algorithm Results Conclusions

8 Methodology Overview

9 Data Human-mouse orthologous pairs 10kb of sequence upstream of the coding sequence of the human and mouse gene from Ensemble release 9 18,778 pairs with successful selection

10 Alignment and Parsing Parsing The alignment output was parsed using VISTA Select regions with at least 75% identity in windows of 100 bp 33,282 regions in total Syntenic fastA database Alignment Each 10kb pair was aligned with AVID

11 Background Model and MotifScanner Background Model 3 rd -order Markov model is calculated form Syntenic fastA database For scoring and generating artificial dataset MotifScanner All syntenic regions are scanned to predict trascription factor binding sites(TFBSs) TRANSFAC: Frequency matrices All occurrences are stored in GFF format in Syntenic GFF database PO A C G T 01 12 4 3 1 A 02 3 2 11 4 G 03 11 2 4 3 A ….. GFF (Gene-Finding Format or General Feature Format): a protocol for the transfer of feature information Fields are:

12 Coregulated Genes Sets of coexpressed genes From SOURCE database for cyclin B2 Dataset of gene expression during the cell cycle in a human cancer cell line 44 genes might share a common cis-regulatory element Of these, 34 had a Ensemble identifier Among them, 13 genes have at least one syntenic region with the respective mouse gene 32 regions in total

13 Contents Introduction Methods Methodology overview Score functions ModuleSearch algorithm Results Conclusions

14 Scoring single TFBSs Combining a position-specific frequency matrix Θ (PSFM) and a higher-order background model B m How likely it is that the segment is generated by the motif model with respect to the background x is a segment [b1, b2, …, b w ] B j is the nucleotide found at position j in x Θ(b j, j) is the probability of fiding bj at position j according to the PSFM P(b j | s, B m ) is the probability of finding bj in the sequence according to the background model

15 Matrix similarity Redundancy of motif model There can be multiple matrices describing the same TF There can be distinct TFs with similar PSFMs Kullback-Leiber distance between two motif models Θ 1 (j,b) is the probability of finding base b at position j in Motif 1 w is the length of the motif A is the set of all possible alignments for an allowed shift The motif models can be grouped into classes depending on a threshold on this average distance

16 Module Score Function A biding site and a motif model (a frequency matrix)  CRMs and CRM models CRMs: clusters of actual binding sites on a sequence CRM models: sets of motif models The score of a CRM model m on a set of sequences s=(s 1,…,s n )

17 The score of a CRM model m on a sequence s m is a collection of motif models Θ 1, …, Θ l is a set of matching binding sites represents a count over the occurring TFBSs of model Θ i in sequence s  If the number of the occurrences is q, can take any value in 0, …, q is the kth instance of Θ i on sequence s is the score of single TFBS b(t) is a boolean function expressing whether the given combination of TFBSs is valid or not Overlap between different TFBSs The sites within the specified window length  distance constraint p(t) is the penalization function of CRMS The number of occurring sites divided by the number of motif models l The score does not take the motif order into account

18 Contents Introduction Methods Methodology overview Score functions ModuleSearch algorithm Results Conclusions

19 ModuleSearch Since the order of sites is not considered, CRM models can be sorted in alphabetical order n Θ which is the number of sites a module should contain is given Search for the best CRM model on a set of coregulated genes Typical Best-First / Branch-and-bound search From empty model, expand incomplete models by adding a model in a different class until there is no incomplete models whose overestimate heuristic score is greater than the score of the current best complete model The model having the best heuristic score is first expanded

20 Heuristic Score is the score function without penalization of m is an overestimate heuristic value of the rise in score from CRM model m to the best child CRM model  [Θ i ] is a CRM model containing one matrix Θ i  t = ( )  (Θ l +1, …, Θ e )  is a boolean function expressing whether the classes of motif models, when added to m, their class are all different or not

21 Contents Introduction Methods Methodology overview Score functions ModuleSearch algorithm Results Conclusions

22 Semi-Artificial Sequences Artificial sequences were generated by sampling symbols from the background model

23 Detecting Modules in Microarray Clusters Selected gene cluster around cyclin B2 The best module model in the cluster selected by ModuleSearcher window=100 bp and n Θ =4 [NFY, STAF, TCF4, CEBPA]

24 Contents Introduction Methods Methodology overview Score functions ModuleSearch algorithm Results Conclusions

25 the scoring functions of module for syntenic regions and the algorithm to find the best scoring module were proposed They have tested the proposed algorithm on artificial data and showed that wit could find the hidden modules with a high sensitivity They predicted a module in a set of coexpressed genes and validated the prediction using the same approach


Download ppt "Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium."

Similar presentations


Ads by Google