Presentation is loading. Please wait.

Presentation is loading. Please wait.

Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.

Similar presentations


Presentation on theme: "Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe."— Presentation transcript:

1 Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe 1 1 - Institute for Genetics, University of Cologne, Germany, Email: nora.pierstorff@uni-koeln.de 2 - Institute for Developmental Biology, University of Cologne ABSTRACT: Several regulatory region prediction methods using computation have been developed in the last few years. Most of the available methods require transcription factor binding site matrices to achieve reasonable results. In order to avoid the need of biological information, we developed a program named SHUREG to predict regulatory regions without any extrinsic information but the sequence itself. Calculating shustrings (shortest unique substrings) we find statistically overrepresented motifs which are assumed to be indicators of regulatory elements. [3] SHUREG - ALGORITHM: 1.Calculation of shustrings (shortest unique substrings) at every position relative to a surrounding window on forward- and backwardstrand. 2.Counting of neighbours (exact repeats in the surrounding) 3.Calculation of P-values for each shustring 4.Smoothing of P-values WHY SHORTEST UNIQUE SUBSTRINGS? Analyzing the human (mouse-) genome we found 255 (293) global shustrings of length 11bp. [4] 29 (22) of the shustrings are positioned in 1000bp-upstream- regions. The probability of this distribution is 3.3 x 10 -24 (5.0 x 10 -18 ) We applied our program to different well explored regions of the Drosophila melanogaster genome. Our dataset includes segmentation and dorsal- ventral genes. We compare our predictions to the results of AHAB[1], a program that uses PWM‘s Figure 1 shows two predictions for the giant region. 1a is computed using Shureg. 1b is the result of the Ahab- program applied to the same sequence. Figure 2a shows the Shureg prediction for the regulatory regions of the hairy gene. 2b shows the corresponding Ahab-prediction. Figure 3 is partitioned into 3 predictions. Figure 3a is the Shureg prediction for the dorsal regulated enhancer of the sog gene. Figure 3b shows the Ahab prediction using only the PWM of the Dorsal binding site. Figure 3c shows the Ahab-prediction using all known PWM‘s in an hypothetical case that we do not know the actual factors responsable for this gene regulation. INTRODUCTION: In order to localize regulatory regions three basic computational approaches have been followed. 1.Search for bindingsites of known transcription factors using Position Weight Matrices. [1] 2.Search for conserved motifs in upstream-regions of homologous or coregulated genes. [2] 3.Search for statistically overrepresented motifs [3] Our program SHUREG follows the third approach which is supported by two hypotheses: 1.Degenerate binding site lead the transcription factor to the bindingsite 2.New bindingsites can be created easily from degenerate bindingsites through few mutations to adapt the organism to environmental changes. DISCUSSION: To localize regulatory regions without any extrinsic information is a hard topic. To use the amount of overrepresented patterns in a region as indicator of regulatory regions is a reasonable measure and can lead to reasonable results. But it also leads to a lot false positive predictions, because we find additional overrepresented patterns which cannot be set into correlation to binding sites. To improve the predictions of our method we need to find more features to distinguish between true positive and false positive predictions, we are currently investigating the conservation of overrepresented motifs between species. Figure 1a: SHUREG prediction in the giant region References: [1] N. Rajewsky, M. Vergassola, U. Gaul, and E. D. Siggia (2002): Computational detection of genomic cis-regulatory modules, applied to body patterning in the early Drosophila embryo. BMC Bioinformatics, 3:30 [2] H. Bussemaker, H. Li, E Siggia (200): Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. PNAS, Aug 2000; 97 [3] Nazina A., Papatsenko D. (2003). Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics 4:1471-2105/4/65 [4] Haubold, B., Pierstorff, N., Moeller, F., Wiehe, T. (2005). Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics, 6:123. Figure 1b: AHAB prediction in the giant region Figure 2a: SHUREG prediction in the hairy regionFigure 2b: Ahab prediction in the hairy region Figure 3a: SHUREG prediction in the sog region Figure 3c: AHAB prediction in the sog region using all known PWM‘s RESULTS:


Download ppt "Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe."

Similar presentations


Ads by Google