Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA Homology vs. similarity What is pair-wise sequence alignment? Why pair-wise alignment?

Some concepts Optimal alignment Global alignment Gaps Local alignment Gap penalty Substitution matrix

Dotplot What dotplot shows What dotplot does not show A simplified representation

Sequence Alignment Dynamic programming a method for some optimization problems determine a scoring scheme best solution based on a scoring scheme Total number of possible alignments for length n ~ 2 2n / sqrt(2  n) Needleman-Wunsch - global

Questions How does it work? How to come up with a DP approach to an exponential problem? How to implement a DP approach?

Dynamic Programming Algorithm F(i,j) = max Break a problem into subproblems Solve each subproblem separately F(i-1,j-1) + s(x i, y j ) F(i,j-1) + g F(i-1,j) + g s(x i, y j ) : substitution score for aligning x i with y j g : gap penalty F(i,j) : The max score for aligning 1 st i symbols of sequence 1 with 1 st j symbols of sequence 2

Example Initialization matrix filling (scoring) Trace back ACTCG ACAGTAG Match: 1 Mismatch: 0 Gap: -1

0-2-3-4-5-6-7 10 -2-3-4-5 -20210-2-3 12110 -4-2012110 -5-302212 A C A G T A G A C T C G i=0 i=1 i=2 i=3 i=4 i=5 j =0, 1, 2, 3, 4, 5, 6, 7

Local Alignment: Smith- Waterman Biological significance F(i,j) = max F(i-1,j-1) + s(x i, y j ) F(i,j-1) + g F(i-1,j) + g 0 O(n 2 ) time

000000000000 000000000100 000110000021 000000000101 011000101000 000001021001 011000203210 000001132212 011000224321 A A C C T A T A G C T G C G A T A T A |||| GCGATATA Local Alignment

Issues in alignment Different ways to fill the table Multiple optimal alignments s(xi, yj) – from substitution matrix gap penalty: linear: w(k) = gk Affine: w(k) = h + gk, k>=1 0, k=0

Gap models New gap vs. gap extension A gap of length k vs. k gaps of length 1 1 insersion / deletion event vs. k events gap penalty: linear: w(k) = gk Affine: w(k) = h + gk, k>=1 0, k=0

Affine Gap Penalty M( i, j ) : best score when xi aligned with yj I x (i, j) : best score when xi aligned with a gap I y (i, j) : best score when yj aligned with a gap Aligning 1 st i symbols of x with 1 st j symbols of y ? Wrong with the F(i,j) formula if AGP is used Three matrices

DP for global alignment for AGP M (i, j) = max M(i-1, j-1) + s(xi, yj) Ix (i-1, j-1) + s(xi, yj) ly (i-1, j-1) + s(xi, yj) Ix (i, j) = max M(i-1, j) + h + g Iy(i-1, j) + h + g lx (i-1, j) + g Iy (i, j) = max M(i, j-1) + h + g Ix(i, j-1) + h + g ly (i, j-1) + g

DP for global alignment using AGP Initialization M(0, 0) =0 Ix(i, 0) = h+gi ly(0, j) = h+gj all other cases: -  Start at the largest element in the three matrices M(m, n), Ix(m, n), ly(m, n) Traceback to (0,0)

DP for local alignment for AGP M (i, j) = max M(i-1, j-1) + s(xi, yj) Ix (i-1, j-1) + s(xi, yj) ly (i-1, j-1) + s(xi, yj) 0 Ix (i, j) = max M(i-1, j) + h + g Iy(i-1, j) + h + g // ignored lx (i-1, j) + g Iy (i, j) = max M(i, j-1) + h + g Ix(i, j-1) + h + g // ignored ly (i, j-1) + g

DP for Local Alignment for AGP Initialization M(0, 0) =0 Ix(i, 0) = 0 ly(0, j) = 0 all other cases: -  Start at the largest M(i, j), Ix(i, j), ly(i, j) Traceback till M(i, j) = 0

Database searching methods Need more efficient methods Dynamic programming - O(n 2 L), L: size of database Why DP is slow? Ideas: Regions that are similar likely to share short identical subsequences Quick search for the regions, then check carefully locally

FASTA related methods Word, word size (2,6), sensitivity vs. speed What are the words in the query also in target Pre-computed table that stores locations of words – “hashing” Heuristic approximation 1. Quick initial “guess” – common subsequences An example

FASTA related methods Use Smith-Waterman method in a band, 32 aa wide around the best score 2. Find the region with high population of common words Process diagonals, rescore, join regions, using gaps 3. Local alignment (DP) in the region identified

Limitation of FASTA Speed vs. sensitivity Can miss biologically significant similarity some proteins do not share identical a.a. initial step Different codons encodes same protein Identical words

BLAST Previous 2 kinds approaches 1. Word list Incorporate similarity measurement for words – PAM120 e.g. ACDE Theoretically sound search for common subsequences Scan for word occurrences hash table Finite state machine (Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

BLAST 2. Extend words to HSP (locally optimal pairs) Find additional words within threshold Merge within distance A 3. Select significant HSPs, use DP in banded region

Mini Presentations 1.Previous BLAST 2.Major concepts in BLAST 3.Statistical issue 4.Gapped local alignment –Gapped 5.Position-specific scoring matrix (PSSM) – overall idea, architecture, multiple - alignment construction 6.PSSM – target frequency estimation, application to BLAST (Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

Multiple Sequence Alignment Motivation What is MSA? How do we extend knowledge of pair-wise alignment? An example: AGAC, AC, AG AGAC --AC AGAC AG-- AC AG Some possibilities AG-- --ACAGAC Fix pair-wise alignment and then add? Evaluate all the possible alignment of N sequences?

Sum of pairs (SP) scoring methods Given a alignment of N sequences, each of which has length L, in the LxN alignment: Pair-wise sum for each column, then sum all columns Scoring MSA Example (c(match)=1, c(mismatch)=-1, c(gap)=-2, c(gap,gap) =0 SP 4 =SP(I,-,I,V) = -2+1-1-2-2-1=-7 SP = SP 1 +SP 2 + … + SP 8 AQPILLLV ALR-LL—- AK-ILLL- CPPVLILV SP tends to overweight a single mutation SP(A,A,A,C) = 0, SP(A,A,A,A) = 6

DP of N dimensions using SP Time: in the order of (L N )(2 N -1)N 2 ~ O((2L) N N 2 ) Extension of DP for N sequences Extend F(i,j) for N dimensions

STAR method DP provide optimal solution but costly Heuristic methods – STAR, CLUSTALW, … Progressive alignment STAR - pair-wise - build similarity matrix - find a “star” sequence - use “star” to align other sequence - once gap, all time gap

STAR method Example

CLUSTAL family Build Similarity tree – “clustering” Alignment starts at most similar sequences What are the disadvantages of STAR method? 1.Pair-wise alignment --> distance matrix Fast approximate approach or DP

CLUSTALW 2. Construct similarity tree, “the guide tree” Start with most similar sequences Align group with group using pair-wise alignment e.g. 3. Progressive alignment UPGMA ( un-weighted pair-group method using arithmetic average)

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Similar presentations

Presentation on theme: "Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Similar presentations

Presentation on theme: "Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI."— Presentation transcript:

Similar presentations

About project

Feedback