Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
Fast Algorithms For Hierarchical Range Histogram Constructions
Advertisements

Comparison Methodologies. Evaluating the matching characteristics Properties of the similarity measure Robustness of the similarity measure – Low variation.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Heuristic alignment algorithms and cost matrices
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Evaluating Hypotheses
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Multiple sequence alignment
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Multiple Sequence alignment Chitta Baral Arizona State University.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Copyright © Cengage Learning. All rights reserved. 5 Integrals.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Multiple Sequence Alignment
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.
Radial Basis Function Networks
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Outline More exhaustive search algorithms Today: Motif finding
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Comp. Genomics Recitation 3 The statistics of database searching.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
1 Combinatorial Algorithms Local Search. A local search algorithm starts with an arbitrary feasible solution to the problem, and then check if some small,
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
Chapter 7. Classification and Prediction
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Clustering.
Chapter 8 Estimation.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas

2 Center star algorithm for multiple sequence global alignment T is the set of strings that we want to align Pick S  T that minimizes The initial alignment starts with S (≡S 1 ) Suppose we have already aligned S 1, S 2,..., S i as S ′ 1, S′ 2,..., S′ i. Then we add the remaining strings one at a time by aligning S i+1 with S′ 1, obtaining S′ i+1 and S′′ 1. We replace S′ 1 with S′′ 1 and add spaces to S′ 2,..., S′ i wherever spaces were added to S′ 1.

3 Finding S S is the best representative of the set T in terms of the distance metric d If T is considered as a cluster of strings, then S is the centroid of the cluster To find S, align each string with every other ( pairs) and calculate the sum for each candidate. Pick the choice that minimizes this sum

4 Example Three strings: GTA, CGT, CAG Step 1: Calculate all three pairwise similarities and pick the string that minimizes total distance; let’s say it’s CGT Step 2-1: Align CGT with GTA  CGT-  -GTA Step 2-2: Extend uninvolved, processed strings with spaces (not needed now)

5 Example (continued) Step 3-1: Align CGT- with CAG  C-GT-  CAG-- Step 3-2: Extend uninvolved, processed strings with spaces ( -GTA )  C-GT-  --GTA  CAG--

6 Algorithm complexity – Finding S To find S, we consider k candidates For each candidate, we calculate the sum of k-1 terms – O(k 2 ) such terms total If the maximum string length is n, then each term can be calculated in O(n 2 ) time Total for finding S is O(k 2 n 2 )

7 Algorithm complexity – Subsequent alignments Each subsequent alignment at step i+1 aligns a string S′ 1 of length at most in with a string S i+1 of length at most n Each alignment can be found in time O(in∙n) Total time for these alignments is

8 Algorithm complexity – Extensions with spaces At step i+1 there is an extension of i-1 strings each of length at most in For each such string, we need to consider a total of n new space positions Time required is Overall total time for the algorithm is O(k 2 n 2 )

9 Error bounds It is useful to know how far the solution found by an approximate algorithm is from the true optimal solution Sometimes (but not always) it is possible to provide error bounds, that is give upper and lower bounds for the quantity Bounds may depend on n and k

10 Error analysis assumptions Sometimes we need additional assumptions in order to derive useful bounds For the approximate algorithm for multiple string alignment, we assume the triangle inequality for measure d:

11 Background on distances A distance or metric d is formally defined as a function A×A→ ℜ on a set A (called a metric space) with the following properties: –d(x,y)≥0 (non-negativity) –d(x,y)=0 iff x=y (identity of indiscernibles) –d(x,y)=d(y,x) (symmetry) –d(x,y)≤d(x,z)+d(z,y) (triangle inequality) Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the L p spaces, and inner product spaces.

12 Background on distances A distance or metric d is formally defined as a function A×A→ ℜ on a set A (called a metric space) with the following properties: –d(x,y)≥0 –d(x,y)=0 iff x=y –d(x,y)=d(y,x) –d(x,y)≤d(x,z)+d(z,y) Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the L p spaces, and inner product spaces. follows from 2, 3, and 4 pseudometric quasimetric semimetric

13 Deriving an error bound Let v 0 be the score for the optimal alignment and v * the score for the alignment produced by the center star algorithm Let d 0 (i,j) (d * (i,j)) be the corresponding induced distances on strings S i and S j

14 Lower bound for v 0 Because the induced distance can be no less than the distance between the strings themselves Choice of S 1

15 Upper bound for v * Triangle inequality Symmetry Each string is aligned with S 1 optimally (there may be additional spaces in matching positions, which do not change the distance)

16 Combining the bounds Better bound for low k

17 Motif data notation A motif is denoted by three parameters –Its length l –The number of allowed spaces g –The number of allowed changes d –(l, d, g) notation Changes and gaps allowed because of mutations across organisms In a “good” motif, g and d are small compared to l Most work assumes g = 0

18 Finding the motif consensus Assume known motif instance positions and length (e.g., via multiple alignment) Also known as the known site problem Input: A set of motif instances Output: What is the motif consensus? Further, is the consensus a valid motif, or is it statistically indistinguishable from what we would expect from other randomly chosen regions?

19 Statistical estimation An important approach to many data mining and machine learning tasks Requirement: The problem must be expressed as a probability function that depends on a number of modeled parameters whose value is unknown The estimation task: Find the optimal values for these parameters

20 Estimation example Can be performed without an explicit probabilistic model Example: Future markets are exchanges where contracts are traded for future execution Contract price reflects probabilities of events

21 Obama contract at intrade.com