Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.

Similar presentations


Presentation on theme: "Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University."— Presentation transcript:

1 Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2 Outline Variations of the basic global/local alignment algorithms Basic information theory concepts Significance of alignment scores (to be continued in the next class)

3 Dynamic Programming Equations Alignment: F(0,0)-F(n,m)Alignment: 0-F(i,j) We can vary both the model and the alignment strategies

4 In general, we can vary Initial values Recursive functions Start and end of paths The model s, d

5 Variation I: Repeated Matches Unmatched regions Matched regions Find non-overlapping copies of sections of Y in X: X= HEAGAWGHEE Y= HEA. AW –HE. Alignment: (0,0)-(n+1,0)

6 Variation II: Overlap Matches Matched regions X is contained in Y, doesn’t penalize “overhanging ends” Alignment: (0,0)- maximum{(i,m), (n,j)} Ignore overhanding prefix Ignore overhanding suffix

7 Variation III: A general gap model Alignment: F(0,0)-F(n,m) Gap-open penalty Gap-extension penalty This can be more easily described as a Finite State Automaton (FSA)… (3 States: Match, Insertion in X, Insertion in Y)

8 Heuristic alignment algorithms Motivation: Complexity of alignment algorithms O(nm) –Current protein DB: 100 million base pairs –Imagine matching each sequence with a 1,000 base pair query –Takes about 3 hours! Heuristic algorithms aim at speeding up at the price of possibly missing the best scoring alignment Two well known programs –BLAST: Basic Local Alignment Search Tool –FASTA: –Both find high scoring local alignments between a query sequence and a target database –Basic idea: first locate high-scoring short stretches and the extend them

9 BLAST (Basic Local Alignment Search Tool) Three steps –Compiling a list of high-scoring “words” of fixed length –Scanning database to find occurrences of these words –Extend each word occurrence Basic BLAST only finds ungapped alignments; newer versions can find gapped alignments (PSI-BLAST) Visit BLAST (need some help!)BLAST

10 FASTA (Fast Alignment) Quite similar to BLAST Multi-step procedure –Locate all identically matching words of a fixed length (1-2 for proteins, 4-6 for DNAs) –Look for diagonals with many mutually supporting word matches –The best diagonals are selected as “seeds” for extension –Extend a seed word to find maximal scoring ungapped regions (possibly joining several seeds) –Check to see if adjacent ungapped matches can be joined by a gapped region allowing for gap costs –Finally the full dynamic programming algorithm is run on the regions of best matching alignments

11 Significance of Scores How do we assess the significance of an alignment score? Two basic approaches –The classical approach: Extreme value distribution Assume a null (random) model for scores M0 P(Score > s|M0, x, y)=? –The Bayesian approach: Model comparison Assume two models for (x,y): random M0; aligned: M1 P(M1|x,y)/P(M0|x,y)=? Log-odds score of the alignment prior


Download ppt "Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University."

Similar presentations


Ads by Google