Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Heuristic alignment algorithms and cost matrices
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Lecture 8: Multiple Sequence Alignment
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Introduction to Bioinformatics Algorithms Multiple Alignment.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
(Regulatory-) Motif Finding
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Sequence Alignment.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Introduction to Bioinformatics Algorithms Multiple Alignment.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Local alignment
Introduction to Bioinformatics Algorithms Multiple Alignment.
Developing Pairwise Sequence Alignment Algorithms
Multiple Sequence Alignments
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.
Multiple Sequence Alignment
Construction of Substitution matrices
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Multiple Sequence Alignment
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
Multiple Sequence Alignment
Intro to Alignment Algorithms: Global and Local
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003

Rapid global alignment Genomic regions of interest contain ordered islands of similarity –E.g. genes 1.Find local alignments 2.Chain an optimal subset of them

Lecture 10, Thursday May 1, 2003 Suffix Trees Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d ab d a c c a b d a c c c c a d b

Lecture 10, Thursday May 1, 2003 Application: Find all Matches Between x and y 1.Build suffix tree for x, mark nodes with x 2.Insert y in suffix tree, mark all nodes y “passes from” with y –The path label of every node marked both 0 and 1, is a common substring

Lecture 10, Thursday May 1, 2003 Example of Suffix Tree Construction for x, y 1 x = d a b d a $ y = a b a d a $ d ab d a $ 1. Construct tree for x a b d a $ 2 $ a d b 3 $ 4 $ 5 $ 6 x x x 6. Insert a $ Insert $ 4. Insert a d a $ d a $ 3 5. Insert d a $ y 4 2. Insert a b a d a $ a y d a $ 1 y y x 3. Insert b a d a $ a d y 2 a $ x

Lecture 10, Thursday May 1, 2003 Application: Online Search of Strings on a Database Say a database D = { s 1, s 2, …s n } (eg. proteins) Question: given new string x, find all matches of x to database 1.Build suffix tree for {s 1,…, s n } 2.All new queries x take O( |x| ) time (somewhat like BLAST)

Lecture 10, Thursday May 1, 2003 Application: Common Substrings of k Strings Say we want to find the longest common substring of s 1, s 2, …s n 1.Build suffix tree for s 1,…, s n 2.All nodes labeled {s i1, …, s ik } represent a match between s i1, …, s ik

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Lecture 10, Thursday May 1, 2003 The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Lecture 10, Thursday May 1, 2003 Quadratic Time Solution Build Directed Acyclic Graph (DAG): –Nodes: local alignments (x a,x b )  (y a,y b ) Directed Edges: any two local alignments that can be chained Each local alignment is a node v i with alignment score s i

Lecture 10, Thursday May 1, 2003 Quadratic Time Solution (cont’d) Dynamic programming: Initialization: Find each node v a s.t. there is no edge (u,v 0 ) Set score of V(a) to be s a Iteration: For each v i, optimal path ending in v i has total score: V(i) = max ( weight(v j, v i ) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from v j Worst case time: quadratic

Lecture 10, Thursday May 1, 2003 Faster Solution: Sparse DP 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j) L is sorted by l j y h l

Lecture 10, Thursday May 1, 2003 Sparse DP Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.j: rectangle in L, with largest l j  l i b.If V(i) > V(j): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l k, V(k), k) with V(k)  V(i) & l k  l i x y

Lecture 10, Thursday May 1, 2003 Example x y 1: 5 3: 3 2: 6 4: 4 5:

Lecture 10, Thursday May 1, 2003 Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N

Lecture 10, Thursday May 1, 2003 Putting it All Together: Fast Global Alignment Algorithms 1.FIND local alignments 2.CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer:Suffix TreeSparse DP Avid:Suffix Treehierarchical DP LAGANCHAOSSparse DP

LAGAN: Pairwise Alignment 1.FIND local alignments 2.CHAIN local alignments 3.DP restricted around chain

Lecture 10, Thursday May 1, 2003 LAGAN 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

Lecture 10, Thursday May 1, 2003 LAGAN: recursive call What if a box is too large? –Recursive application of LAGAN, more sensitive word search

Lecture 10, Thursday May 1, LAGAN: memory “necks” have tiny tracebacks …only store tracebacks

Multiple Sequence Alignments

Lecture 10, Thursday May 1, 2003

Overview Definition Scoring Schemes Algorithms

Multiple Sequence Alignments Definition and Motivation

Lecture 10, Thursday May 1, 2003 Definition Given N sequences x 1, x 2,…, x N : –Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences Same for a multiple alignment!

Lecture 10, Thursday May 1, 2003 Motivation A faint similarity between two sequences becomes very significant if present in many Protein domains Motifs responsible for gene regulation

Lecture 10, Thursday May 1, 2003 Another way to view multiple alignments Pairwise alignment: Good alignment  Evolutionary relationship Multiple alignment: Something in common  Sequence in common “inverse” to pairwise alignment

Multiple Sequence Alignments Scoring Function

Lecture 10, Thursday May 1, 2003 Scoring Function Ideally: –Find alignment that maximizes probability that sequences evolved from common ancestor x y z w v ?

Lecture 10, Thursday May 1, 2003 Scoring Function (cont’d) Unfortunately: too many parameters Compromises: –Ignore phylogenetic tree –Statistically independent columns: S(m) = G(m) +  i S(m i ) m: alignment matrix G: function penalizing gaps

Lecture 10, Thursday May 1, 2003 Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Lecture 10, Thursday May 1, 2003 Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)

Lecture 10, Thursday May 1, 2003 Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

Lecture 10, Thursday May 1, 2003 Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) =  i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l)

Multiple Sequence Alignments Algorithms

Lecture 10, Thursday May 1, 2003 Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

Lecture 10, Thursday May 1, 2003 Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } Multidimensional Dynamic Programming

Lecture 10, Thursday May 1, 2003 Multidimensional Dynamic Programming Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N )