Pairwise Sequence Alignment:

Pairwise Sequence Alignment:
Summary of Previous Lesson Global Alignment Local Alignment Blast Fasta Scoring Matrices Pam BLOSUM Multiple Alignment (If Time Allows)

Global Alignment Needleman-Wunsch 1970
Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 Add score from table HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P— -33 Gap with bottom Top and bottom Gap with top

Global Alignment Notation xi – ith letter of string x
yj – jth letter of string y x1..i – Prefix of x from letters 1 through I F – matrix of optimal scores F(i,j) represents optimal score lining up x1..i with y1..j d – gap penalty s – scoring matrix

Global Alignment The work is to build up S
Initialize: S(0,0) = 0, S(i,0) = d(i), S(0,j)=d(j) Fill from top left to bottom right using the recursive relation F(I,0) and F(0,j) represents aligning x to all gaps and y to all gaps respectively.

Global Alignment F(i-1,j-1) F(i,j-1) F(i-1,j) F(i,j) s(xi,yj) d d
yj aligned to gap F(i-1,j-1) F(i,j-1) F(i-1,j) F(i,j) Move ahead in both s(xi,yj) d X represents the top string, y the bottom string d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows

Completed Table H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2
-8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 -14 -19 -22 3 -30 2 -38 1

Traceback Trace arrows back from the lower right to top left
Diagonal – both Up – upper gap Left – lower gap H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 -14 -19 -22 3 -30 2 -38 1 Diagonal means use one letter from both, up means one letter from bottom and gap on top, left means one letter from top and gap on bottom. HEAGAWGHE-E --P-AW-HEAE

Local Alignment Problem
First formulated: Smith and Waterman (1981) Problem definition: Find subsequences in S1 and S2 whose similarity is maximum over all pairs of subsequences from S1 and S2

Motivation Searching for unknown domains or motifs within proteins from different families Proteins encoded from Homeobox genes (only conserved in 1 region called the Homeodomain – 60 Aminoacids long, 50-95% alignment across certain insect and mammalian genes) Identifying active sites of enzymes Comparing long stretches of anonymous DNA Querying databases where query word much smaller than sequences in database

Repeat and Overlap Matches
Repeat matches allow for sections of a sequence to match repeatedly Repeated domain or motif Overlap matches Matching when the two sequences overlap Does not penalize overhanging ends x x y y

Changes to DP algorithm
Interpretation of array values: S(i,j) = score of best alignment of a suffix of G1(1..i) and a suffix of G2(1..j) Recurrence relation: S(i,j) = Max { 0, S(i-1,j-1) + s(G1(i), G2(j)), S(i-k, j) + d(k), S(i, j-k) + d(k)} Empty substrings value: Restriction on scoring scheme

Changes to DP algorithm
Initialization of matrix: First row and column with 0’s Traceback: Find maximum value of S(i,j) Traceback pointers until you hit cell with value 0

Example Let d = -2 Let s(a,b) = 1 if a=b, and –1 otherwise
Alignment: G T G T V(i,j) C G T A 1 2

Gap cost or penalty functions
Observation: Gap of length k more probable than k gaps of length 1 Cause might be single mutational event Separated gaps probably arose due to different events Gap penalty functions: Linear cost: Treats both cases uniformly Common to use a higher cost for h for opening a gap and a lower cost g for extending a gap

Pairwise Alignment on the Web
The ALIGN global alignment program is available at several servers: LFASTA uses FASTA for local alignment of 2 sequences: BLAST 2 Sequences (NCBI)

End of Summary Now to a new interesting discussion on alignment.
First on alignment algorithms on the net, mainly FASTA and BLAST.

FASTA-Stages Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences) Score and select top 10 scoring “local diagonals” For proteins, each k-tup found is scored using the PAM250 matrix For DNA, the number of k-tups found Penalize intervening gaps

Finding k-tups position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a protein a c s p r k position in offset amino acid protein A protein B pos A - posB a c k n p r s t Note the common offset for the 3 amino acids c,s and p A possible alignment is thus quickly found - protein 1 n c s p t a | | | protein 2 a c s p r k

FASTA, K-tups with common offset

FASTA-Stages Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.

FASTA FastA is a family of programs: FastA, TFastA, FastX, FastY
Query: DNA Protein Database: DNA Protein

FastA Blosum50 default. Lower PAM higher blosum to detect close sequences Higher PAM and lower blosum to detect distant sequences Gap opening penalty -12, -16 by default for fasta with proteins and DNA, respectively Gap extension penalty -2, -4 by default for fasta with proteins and DNA, respectively The larger the word-length the less sensitive, but faster the search will be Max number of scores and alignments is 100

FastA Output Database code hyperlinked to the SRS database at EBI
Accession number Description Length Initn, init1, opt, z-score calculated during run E score - expectation value, how many hits are expected to be found by chance with such a score while comparing this query to this database. E() does not represent the % similarity

FASTA Output

BLAST Basic Local Alignment Search Tool
Altschul et al. 1990,1994,1997 Heuristic method for local alignment Designed specifically for database searches Idea: Good alignments contain short lengths of exact matches

Blast – Basic Local Alignment Search Tool
Blast uses a heuristic search algorithm and uses statistical methods of Karlin and Altshul (1990) Blast programs were designed for fast database searching with minimal sacrifice of sensitivity for distantly related sequences

Low-complexity and Gapped Blast Algorithm
The SEG program has been implemented as part of the blast routine in order to mask low-complexity regions Low-complexity regions are denoted by strings of Xs in the query sequence In 1997 a modification introduced generation of gapped alignments The gapped algorithm seeks only ONE ungapped alignment that makes up a significant match and hence speeds the initial database search Dynamic programming is used to extend a central pair of aligned residues in both directions to yield the final gapped alignment Gapped blast is 3 times faster than ungapped blast

Mathematical Basis of BLAST
Model matches as a sequence of coin tosses Let p be the probability of a “head” For a “fair” coin, p = 0.5 (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n). Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.

To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”. For DNA, the probability of a “head” is 1/4 What is it for amino acid sequences? AATCAT ATTCAG HTHHHT

Model matches as a sequence of coin tosses Let p be the probability of a “head” For a “fair” coin, p = 0.5 (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n). Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.

So, for one particular alignment, the Erdös-Rényi property can be applied What about for all possible alignments? Consider that sequences are being shifted back and forth, dot matrix plot The expected length of the longest match is R=log1/p(mn) where m and n are the lengths of the two sequences.

Steps of BLAST Filter out low-complexity regions
where L is length, N is alphabet size, ni is the number of letter i appearing in sequence. Example: AAAT K=1/4 log4(24/(3!*1!*0!*0!))=0.25

Steps of BLAST Query words of length 3 (for proteins) or 11 (for DNA) are created from query sequence using a sliding window Expected run length in sequences of ~90 for proteins and 64 for DNA. MEFPGLGSLGTSEPLPQFVDPALVSS MEF EFP FPG PGL GLG The values 90 and 64 can easily be obtained from the expected run length formula shown earlier.

Steps of BLAST Using BLOSUM62 (for proteins) or scores of +5/-4 (DNA, PAM40), score all possible words of length 3 or 11 respectively against a query word. Select a neighborhood word score threshold (T) so that only most significant sequences are kept. Approximately 50 hits per query word. Repeat 3 and 4 for each query word in step 2. Total number of high scoring words is approximately 50 * sequence length.

Steps of BLAST Organize the high-scoring words into a search tree
Scan each database sequence for match to high-scoring words. Each match is a seed for an ungapped alignment. M E F G P

Steps of BLAST (Original BLAST) extend matching words to the left and right using ungapped alignments. Extension continues as long as score increases or stays same. This is a HSP (high scoring pair). (BLAST2) Matches along the same diagonal (think dot plot) within a distance A of each other are joined and then the longer sequence extended as before. (Requires lower T)

Steps of BLAST Using a cutoff score S, keep only the extended matches that have a score at least S. Determine statistical significance of each remaining match (from last time). Try to extend the HSPs if possible. Show Smith-Waterman local alignments.

Blast Application Blast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastX BlastN - nt versus nt database BlastP - protein versus protein database BlastX - translated nt versus protein database tBlastN - protein versus translated nt database tBlastX - translated nt versus translated nt database Query: DNA Protein Database: DNA Protein

Statistical Significance of Blast
Probability (P) – score of the likelihood of its having arisen by chance. The closer the p-value approaches zero, the greater the confidence that the match is real. The closer the value is to unity, the greater the chance that the match is spurious Expected Frequency (E) value – number of hits one can expect to see by chance (noise) when searching a database of a particular size. E value of 1 – one match with a similar score by chance. E value of 0 – no matches expected by chance

Pairwise Sequence Alignment:

Similar presentations

Presentation on theme: "Pairwise Sequence Alignment:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pairwise Sequence Alignment:

Similar presentations

Presentation on theme: "Pairwise Sequence Alignment:"— Presentation transcript:

Similar presentations

About project

Feedback