Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Multiple Sequence Alignment Motif Finding and Gene Prediction.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Introduction to Bioinformatics Algorithms Clustering.
Comparative Motif Finding
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
CSE182-L17 Clustering Population Genetics: Basics.
Multiple Sequence alignment Chitta Baral Arizona State University.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Multiple Sequence Alignment
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Sequence comparison: Local alignment
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Chapter 3 Computational Molecular Biology Michael Smith
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
Distance based phylogenetics
Multiple sequence alignment (msa)
A Very Basic Gibbs Sampler for Motif Detection
Clustering.
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Clustering.
Presentation transcript:

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas

2 Motif consensus The consensus is the true underlying motif, that is expressed imperfectly in real genes because of mutations across organisms A motif instance is a particular realization of the motif consensus in a given gene; it will differ from the consensus in a small number of positions

3 Motif data example (made up) Motif instances: –AAAAACAC –CAAAACAA –ACACAAAA –CAAAAAAC –AAAGAACA –GACAAAAA –AAGAGAAA Motif consensus: AAAAAAAA

4 Motif data example (real) Positions 3-9 (out of about 22) of the cyclic AMP receptor protein transcription factor binding site in 20 samples –TTGTGGC –TTTTGAT –AAGTGTC –ATTTGCA –CTGTGAG –ATGCAAA –GTGTTAA –ATTTGAA –TTGTGAT –ATTTATT − ACGTGAT − ATGTGAG − TTGTGAG − CTGTAAC − CTGTGAA − TTGTGAC − GCCTGAC − TTGTGAT − GTGTGAA

5 Phylogenetic footprinting A phylogenetic tree organizes related (orthologous) sequences from different species The sequences appear as leaves Internal nodes indicate evolutionary divergence between species A footprint is a highly conserved region across species

6 Identifying footprints Main assumption: Functional DNA changes more slowly than other DNA Therefore, closely related regions in different species are –more likely to be functional sequences –a basis for grouping species together Footprints are DNA motifs

7 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat)

8 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat)

9 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT

10 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT ACGG

11 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT ACGG ACG[TG]

12 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT ACGG ACGT T→G mutation ACGT

13 Finding motifs Start with a number of related genes (or proteins) In regulatory motif finding, –the related genes are co-expressed Recall our discussion of DNA micro-arrays

14 The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) Finding motifs: Start......

15 The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) Finding motifs: Goal......

16 How does this relate to what we have discussed before? Motif finding a clear instance of a data mining problem Motif finding is equivalent to local alignment across multiple sequences Typically hundreds of sequences are aligned, sometimes thousands There are also corresponding biological problems for global alignment of multiple sequences

17 Multiple sequence alignment Protein families –Sets of proteins with similar structure (3D shape), function, or evolutionary history –Usually the above properties are correlated –Given several families, where to assign a new protein? DNA repeating sequences –ALU sequence in humans (300bp, appears more than 1 million times – 10% of our DNA) –Estimated 60% of the “junk” in human genome consists of such sequences

18 Optimal alignment We define the multiple global alignment as an extension of strings S 1, S 2,..., S k to S ′ 1, S′ 2,..., S′ k that may contain spaces with – |S′ 1 | = |S′ 2 | =... = |S′ k | – Removing all spaces from each S′ i leaves S i – No position has a space in all S′ i We need to extend our similarity function to handle multiple strings The optimal alignment is the one that maximizes the similarity function

19 Multiple string similarity Many ways to do so. Most common: Sum of pairwise similarities Assumes symmetric similarity We need to account for σ(-,-) (usually 0) Alternatively, we can use distances between strings and minimize the sum of the pairwise distances

20 Dynamic programming for multiple sequence alignment In pairwise alignment, we used a two- dimensional matrix to record three choices at each cell: {01}, {10}, and {11} where 1 means consume a character from the corresponding string

21 DP for multiple alignment For k strings we need a k-dimensional table Each dimension has as many elements as the length of the corresponding string plus one (for gaps at the start) Assuming the same length n, the matrix has (n+1) k cells At each cell, we consider 2 k – 1 choices

22 Multiple alignment complexity (n+1) k = O(n k ) entries need to be filled, each in O(2 k ) time Total time O(n k 2 k ) = O((2n) k ) Total space O(n k ) Typically n is a few thousand, k a few hundred making this approach impractical Independently of whether DP is used, for the sum of pairwise similarities the problem is provably NP-complete

23 What to do for NP-complete problems? Use exact methods (such as DP) for small inputs only Use approximate methods with polynomial time and a provable error bound Use heuristic approaches that follow plausible choices but have no guaranteed error bound –specific to the problem (such as FASTA) –general (optimization, estimation via statistical sampling such as MCMC)

24 Center star algorithm for multiple sequence global alignment T is the set of strings that we want to align Pick S  T that minimizes The initial alignment starts with S (≡S 1 ) Suppose we have already aligned S 1, S 2,..., S i as S ′ 1, S′ 2,..., S′ i. Then we add the remaining strings one at a time by aligning S i+1 with S′ 1, obtaining S′ i+1 and S′′ 1. We replace S′ 1 with S′′ 1 and add spaces to S′ 2,..., S′ i wherever spaces were added to S′ 1.