CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
BNFO 602 Multiple sequence alignment Usman Roshan.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Protein Multiple Alignment by Konstantin Davydov.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Expected accuracy sequence alignment
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
(Regulatory-) Motif Finding
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
Multiple Sequence alignment Chitta Baral Arizona State University.
BNFO 602 Multiple sequence alignment Usman Roshan.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
More Graph Algorithms Weiss ch Exercise: MST idea from yesterday Alternative minimum spanning tree algorithm idea Idea: Look at smallest edge not.
Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Important Problem Types and Fundamental Data Structures
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Gene expression & Clustering (Chapter 10)
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Lars Arge Presented by Or Ozery. I/O Model Previously defined: N = # of elements in input M = # of elements that fit into memory B = # of elements per.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Expected accuracy sequence alignment Usman Roshan.
Algorithm Design Methods (II) Fall 2003 CSE, POSTECH.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Expected accuracy sequence alignment Usman Roshan.
Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
9/8/10CS 6463: AT Computational Geometry1 CS 6463: AT Computational Geometry Fall 2010 Triangulations and Guarding Art Galleries Carola Wenk.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
Multiple Sequence Alignment
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
Dynamic Programming II DP over Intervals
Presentation transcript:

CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable speed

CS262 Lecture 9, Win07, Batzoglou MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on those distances, with UPGMA 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.Measure new Kimura-based distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

CS262 Lecture 9, Win07, Batzoglou PROBCONS at a glance 1.Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM 2.Re-estimation of posterior matrices M’ xy with probabilistic consistency M’ xy (i, j) = 1/N  sequence z  k M xz (i, k)  M yz (j, k);M’ xy = Avg z (M xz M zy ) 3.Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes  aligned (i, j) in A M’ xy (i, j) Define E(x, y) =  aligned (i, j) in Axy M’ xy (i, j) 4.Build tree T with hierarchical clustering using similarity measure E(x, y) 5.Progressive alignment on T to maximize E(.,.) 6.Iterative refinement; for many rounds, do: Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

CS262 Lecture 9, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC Specialized VISTA alignment browser at LBNL ABC—Nice Stanford tool for browsing alignments Protein Multiple Aligners CLUSTALW – most widely used MUSCLE – most scalable PROBCONS – most accurate

CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Motivation Genomic sequences are very long:  Human genome = 3 x 10 9 –long  Mouse genome = 2.7 x 10 9 –long Aligning genomic regions is useful for revealing common gene structure  It is useful to compare regions > 1,000,000-long

CS262 Lecture 9, Win07, Batzoglou The UCSC Browser

CS262 Lecture 9, Win07, Batzoglou Main Idea Genomic regions of interest contain islands of similarity, such as genes 1.Find local alignments 2.Chain an optimal subset of them 3.Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

CS262 Lecture 9, Win07, Batzoglou Saving cells in DP 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

CS262 Lecture 9, Win07, Batzoglou Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

CS262 Lecture 9, Win07, Batzoglou The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

CS262 Lecture 9, Win07, Batzoglou Quadratic Time Solution Build Directed Acyclic Graph (DAG):  Nodes: local alignments [(x a,x b )  (y a,y b )] & score  Directed edges: local alignments that can be chained edge ( (x a, x b, y a, y b ), (x c, x d, y c, y d ) ) x a < x b < x c < x d y a < y b < y c < y d Each local alignment is a node v i with alignment score s i

CS262 Lecture 9, Win07, Batzoglou Quadratic Time Solution Initialization: Find each node v a s.t. there is no edge (u, v a ) Set score of V(a) to be s a Iteration: For each v i, optimal path ending in v i has total score: V(i) = ma x j s.t. there is edge (v j, v i ) ( weight(v j, v i ) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from v j Worst case time: quadratic

CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming Back to the LCS problem: Given two sequences  x = x 1, …, x m  y = y 1, …, y n Find the longest common subsequence  Quadratic solution with DP How about when “hits” x i = y j are sparse?

CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet  x = x 1, …, x m Find a subsequence  s = s 1, …, s k  s 1 < s 2 < … < s k

CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Let input be w: w 1,…, w n INITIALIZATION: L:last LIS elt. array L[0] = -inf L[1] = w 1 L[2…n] = +inf B:array holding LIS elts; B[0] = 0 P:array of backpointers // L[j]: smallest j th element w i of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j]  w[i] B[j]  i P[i]  B[j – 1] } That’s it!!! Running time?

CS262 Lecture 9, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w Every matching point (i, j), is inserted into w as follows: For each column j = 1…m, insert in w the points (i, j), in decreasing row i order The 11 example points are inserted in the order given a = (y, x), b = (y’, x’) can be chained iff  a is before b in w, and  y < y’ x y

CS262 Lecture 9, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y x y

CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = [L1] [L2] [L3] [L4] [L5] … 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, x y

CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j : smallest (North) to largest (South) value  L is implemented as a balanced binary tree y h l

CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) score V(b) V(a)

CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i i j k Is k ever removed?

CS262 Lecture 9, Win07, Batzoglou Example x y a: 5 c: 3 b: 6 d: 4 e: When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i abcde V 5 L lili V(i) i 5 5 a c b d

CS262 Lecture 9, Win07, Batzoglou Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

CS262 Lecture 9, Win07, Batzoglou Examples Human Genome Browser ABC