Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp. 327-338, 2003 Reporter: Chu-Ting Tseng Advisor:

Slides:



Advertisements
Similar presentations
Lecture 15. Graph Algorithms
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Lower Bound for Sparse Euclidean Spanners Presented by- Deepak Kumar Gupta(Y6154), Nandan Kumar Dubey(Y6279), Vishal Agrawal(Y6541)
Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction
Weighted graphs Example Consider the following graph, where nodes represent cities, and edges show if there is a direct flight between each pair of cities.
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Chapter 3 The Greedy Method 3.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
Protein Functional Site Prediction The identification of protein regions responsible for stability and function is an especially important post-genomic.
CS 311 Graph Algorithms. Definitions A Graph G = (V, E) where V is a set of vertices and E is a set of edges, An edge is a pair (u,v) where u,v  V. If.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Fall 2007CS 2251 Graphs Chapter 12. Fall 2007CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs To.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Graphs & Graph Algorithms 2 Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Sequence comparison: Local alignment
Parallel Programming – Graph Algorithms David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama,
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 223 – Advanced Data Structures Graph Algorithms: Minimum.
The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each decision is locally optimal. These.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Chapter 2 Graph Algorithms.
BCT 2083 DISCRETE STRUCTURE AND APPLICATIONS
COSC 2007 Data Structures II Chapter 14 Graphs III.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Module 5 – Networks and Decision Mathematics Chapter 23 – Undirected Graphs.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Graphs. Definitions A graph is two sets. A graph is two sets. –A set of nodes or vertices V –A set of edges E Edges connect nodes. Edges connect nodes.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Local Multiple Sequence Alignment Sequence Motifs
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Prims Algorithm for finding a minimum spanning tree
Transcription factor binding motifs (part II) 10/22/07.
Graphs Definition: a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected.
Introduction to NP Instructor: Neelima Gupta 1.
1 GRAPHS – Definitions A graph G = (V, E) consists of –a set of vertices, V, and –a set of edges, E, where each edge is a pair (v,w) s.t. v,w  V Vertices.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
Flow cytometry data analysis: SPADE for cell population identification and sample clustering Narahara.
Minimum Spanning Trees
Minimum Spanning Tree Chapter 13.6.
Sequence comparison: Local alignment
Minimum Spanning Trees
Randomized Algorithms CS648
Connected Components Minimum Spanning Tree
Graphs & Graph Algorithms 2
Graphs Chapter 13.
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
ITEC 2620M Introduction to Data Structures
SEG5010 Presentation Zhou Lanjun.
Graphs G = (V, E) V are the vertices; E are the edges.
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor: Prof. Chang-Biau Yang Date: Apr. 2, 2004

Outline Introduction Minimum Spanning Tree (MST) Binding Site Identification by MST Distance Scoring Function Position-specific Information Content Applications

Introduction Computationally, the binding-site identification problem is often defined as to find short ”conserved” fragments, from a set of genomic sequences, which cover many (or all) of the provided genomic sequences.

Minimum spanning trees (MST) It may be defined on Euclidean space points or on a graph. G = (V, E): weighted connected undirected graph Spanning tree : S = (V, T), T  E, undirected tree Minimum spanning tree(MST) : a spanning tree with the smallest total weight.

An example of MST A graph and one of its minimum costs spanning tree (sum=105)

Prim’s algorithm for finding MST Step 1: x  V, Let A = {x}, B = V - {x}. Step 2: Select (u, v)  E, u  A, v  B such that (u, v) has the smallest weight between A and B. Step 3: Put (u, v) in the tree. A = A  {v}, B = B - {v} Step 4: If B = , stop; otherwise, go to Step 2. (see the example on the next page)

An example for Prim’s algorithm

Binding Site Finding by MST (1) Conceptually, we map all the fragments, collected from the provided genomic sequences, into a space so that similar fragments (on the sequence level) are mapped to nearby positions and dissimilar fragments to far away positions.

An Example of Mapping

Binding Site Finding by MST (2) Because of the relatively high frequency of the conserved binding sites appearing in the targeted genomic sequence regions, a group of such sites should form a “dense” cluster in a sparsely- distributed background. If C is a cluster in D, then C’s data points form a subtree of the MST of D.

Binding Site Finding by MST (3) If we plot the edge distance in the selection order by the Prim’s algorithm, with x-axis be the linear representation L(D) of D, and the y-axis represents the distance of the corresponding MST edge. Each cluster should form a “valley” in this plot. A substring S of L(D) represents a cluster if and only if (a) S’s elements form a subtree, T S, of D’s MST, and (b) S’s both boundary edges have larger distances than any edge-distance of T S.

Edge-distance Plot Example

Binding Site Finding by MST (4) For every substring of L(D) check whether it’s a cluster, it can be done linear time of the number of vertices. Total time: O(||D|| 3 )

Distance Scoring Function For two k-mers A = a 1 …a k, B = b 1 …b k ∊ S, we define their distance ρ(A,B) = where M(x, y) = 0 if x = y otherwise 1. Initially, all σ i is set to 1/K, where K is the number of sequences containing at least one of the k-mers A or B.

Method 1.Break the sequences into k-mers 2.Calculate the distance between each pair. 3.Apply the ClusterIdentification procedure to identify all clusters.

Conditions for a Cluster to Be a Binding Site 1.The position-specific information content of the gapless multiple-sequence alignment, among all the sequence fragments represented by a cluster, should be relatively high. 2.Elements of an identified cluster should not be among long, simple repeats 3.The data density within a cluster should be relatively higher than the one of the overall background.

Position-Specific Information Content where f b is the observed frequency of each base in the collection of sites and P b is the fraction of each base in the genome.

An Example of PSIC (1)

An Example of PSIC (2) fbfb A11/23= C1/23= G2/23= T9/23= log 2 (f b / p b ) A C G T

An Example of PSIC (3)

Scoring Function using PSIC (1) After a cluster is identified, we will measure the position specific information content. If the overall information content is lower than some threshold, we will discard this cluster for further consideration. Otherwise…

Scoring Function using PSIC (2) For each position i, we use its information content as σ i in the next iteration. Set M(a i, b i ) = 2 - (p i (a i ) + p i (b i )) + |p i (a i ) - p i (b i )|, where p i (x) represents the frequency of letter x among all letters in position i.

Applications-- CRP (1) CRP: Cyclic AMP receptor protein CRP binding Sites: 18 sequences, each of length 105 bps, with 23 experimentally verified CRP binding sites (22-mers). The only cluster identified consists of 24 fragments, of which 20 are known CRP sites.

Applications-- CRP (2)

Applications-- CRP (3)

Applications-- Yeast (1) Yeast binding Sites: There are 8 regulatory sequences, each containing 1000 bp. By using 9-mers, our method identified several clusters. The most populated cluster is TTACCACCG.

Applications-- Yeast (2)

Applications-- Yeast (3)

Applications-- Human (1) Human binding Sites:113 regulatory sequences containing regulatory regions. Each sequence is 300 bp long, with 250 bp upstream and 50 bp downstream of the transcriptional start site.

Applications-- Human (2) The GCAGCC motif with at most one mismatch appears in 96 regulatory sequences, even more frequently than the TATAAA motif, where appears in 66 regulatory sequences with at most one mismatch.

Applications-- Human (3)

Reference Stormo, G. D. and Hartzell III, G. W. “Identifying protein-binding sites from unaligned DNA fragments.” Proceedings of the National Academy of Sciences USA, Vol. 86, pp , 1989.

END