Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its.

Similar presentations


Presentation on theme: "Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its."— Presentation transcript:

1 Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its three dimensional structure. We want to compare the linear sequence between various genes, in order to deduce function, phylogeny, structure,origin… The level of similarity is the homology

2 The Problem Biological problem
Finding a way to compare and represent similarity or dissimilarity between biomolecular sequences (DNA, RNA or amino acid) Computational problem Finding a way to perform inexact or approximate matching of subsequences within strings of characters

3 Homology Similarity due to descent from a common ancestor
Homologous sequences can be identified through sequence alignment Thus, possible to predict/infer structure or function from primary sequence analysis

4 Gaps Sequences may have diverged from common ancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG) Latter two operations result in gaps ( _ ) K contiguous spaces = gap of length K ( > 0 )

5 Similarity and Alignment
Similarity has two aspects: Quantitative aspect: Similarity measure A number that represents degree of similarity Example: a score indicating 10% match between 2 DNA sequences. Qualitative aspect: An alignment a mutual arrangement of two sequences that shows where the two sequences are similar, and where they differ. An optimal alignment is one that exhibits the most correspondences, and the least differences. Example: a b c d e – h z a b w d e f h _

6 The Edit Distance between two strings
Definition: The edit distance between two strings is defined as the minimum number of edit operations – insertions, deletions, & substitutions – needed to transform the first string into the second. For emphasis, note that matches are not counted. Example: AATT and AATG Distance = 1 (edit operation of substitution)

7 String alignment An edit transcript is a way to represent a particular transformation of one string into another Emphasizes point mutations in the model An alignment displays a relationship between two strings Global alignment means for each string, entire string is involved in the alignment Examples: (1) A A G C A (2) GSAQVKGHGKKVADAL …. A A _ C _ H+ KV …. NNPELQAHAGKVFKLV ….

8 Alignment vs. Edit Transcript
Essentially equivalent: Two opposing characters in an alignment a substitution in edit transcript A gap or space in an alignment in first string an insertion of opposing character A gap or space in second string a deletion of opposing character Distinction is one of product vs. process

9 Gap cost or penalty functions
Observation: Gap of length k more probable than k gaps of length 1 Cause might be single mutational event Separated gaps probably arose due to different events Gap penalty functions: Linear cost: Treats both cases uniformly Common to use a higher cost for h for opening a gap and a lower cost g for extending a gap

10 Pairwise Sequence Alignment
Example Which one is better? HEAGAWGHEE PAWHEAE HEAGAWGHE-E HEAGAWGHE-E Here we have two different possible alignments. How do we determine which one is better. P-A--W-HEAE --P-AW-HEAE

11 Example Gap penalty: -8 Gap extension: -3 HEAGAWGHE-E --P-AW-HEAE
5 -1 -2 -3 6 10 P -4 15 Gap penalty: -8 Gap extension: -3 HEAGAWGHE-E --P-AW-HEAE The matrix is from the BLOSUM50 matrix. The answer is (-2)+(-8)+5+(-8)+(-8)+15+(-8)+10+6+(-8)+6 = 0 (-8) + (-8) + (-1) (-8) (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE

12 Formal Description Problem: PairSeqAlign Input: Two sequences x,y
Scoring matrix s Gap penalty d Gap extension penalty e Output: The optimal sequence alignment

13 How Difficult Is This? Given two sequences of length m and n.
How many alignments are there? f(m,n) How many non-equivalent alignments are there ? g(m,n) Lets take a break from power-point to wake you up. Lights on please.

14 Back to Power-point

15 F(n,m) F(n,m)=f(n-1,m)+f(n,m-1)+f(n-1,m-1)

16 F(n,m) F(n,m-1) F(n-1,m-1) F(n,m) F(n-1,m)

17 G(n,m)

18 G(n,m) g(n,m-1) g(n-1,m-1) g(n,m) g(n-1,m)

19 So what? So at n = 20, we have over 120 billion possible alignments
We want to be able to align much, much longer sequences Some proteins have 1000 amino acids Genes can have several thousand base pairs 120,000,000,000

20 Dynamic Programming General algorithmic development technique
Reuses the results of previous computations Store intermediate results in a table for reuse Look up in table for earlier result to build from

21 Global Alignment Needleman-Wunsch 1970
Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 Add score from table HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P— -33 Gap with bottom Top and bottom Gap with top

22 Global Alignment Notation xi – ith letter of string x
yj – jth letter of string y x1..i – Prefix of x from letters 1 through I F – matrix of optimal scores F(i,j) represents optimal score lining up x1..i with y1..j d – gap penalty s – scoring matrix

23 Global Alignment The work is to build up F
Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd Fill from top left to bottom right using the recursive relation F(I,0) and F(0,j) represents aligning x to all gaps and y to all gaps respectively.

24 Global Alignment F(i-1,j-1) F(i,j-1) F(i-1,j) F(i,j) s(xi,yj) d d
yj aligned to gap F(i-1,j-1) F(i,j-1) F(i-1,j) F(i,j) Move ahead in both s(xi,yj) d X represents the top string, y the bottom string d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows

25 Example H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17
-8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 Exercise fill in the rest of the table

26 Completed Table Score Gap –8 Error –2 Fit +6 H E A G W -8 -16 -24 -32
-8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 -14 -19 -22 3 -30 2 -38 1 Score Gap –8 Error –2 Fit +6

27 Traceback Trace arrows back from the lower right to top left
Diagonal – both Up – upper gap Left – lower gap H E A G W -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 -14 -19 -22 3 -30 2 -38 1 Diagonal means use one letter from both, up means one letter from bottom and gap on top, left means one letter from top and gap on bottom. HEAGAWGHE-E --P-AW-HEAE

28 Summary Uses recursion to fill in intermediate results table
Uses O(nm) space and time O(n2) algorithm Feasible for moderate sized sequences, but not for aligning whole genomes. m and n are the lengths of the strings.

29 Local Alignment Smith-Waterman (1981)
Another dynamic programming solution

30 Example H E A G W P 5 2 20 12 4 10 18 22 14 6 16 8 28 21 13 27 26

31 Traceback Start at highest score and traceback to first 0 AWGHE AW-HE
P 5 2 20 12 4 10 18 22 14 6 16 8 28 21 13 27 26 Start at highest score and traceback to first 0 AWGHE AW-HE

32 Summary Similar to global alignment algorithm
For this to work, expected match with random sequence must have negative score. Behavior is like global alignment otherwise Similar extensions for repeated and overlap matching Care must be given to gap penalties to maintain O(nm) time complexity Affine gap scores have an efficient implementation. Otherwise algorithm has cubic time complexity.

33 Repeat and Overlap Matches
Repeat matches allow for sections of a sequence to match repeatedly Repeated domain or motif Overlap matches Matching when the two sequences overlap Does not penalize overhanging ends x x y y

34 BLAST O(n2) algorithms are too slow for large scale searches
BLAST developed by Altschul et al (1990) Uses probabilistic approach to searching Idea: True alignments will have a short stretch of identities (perfect match)

35 BLAST Overview Make a list of neighborhood words
Length 3 for proteins, 11 for nucleic acids Match query with score higher than some threshold Usually 2 bits per residue Scans database for words When a hit is obtained, extends the match in both direction as ungapped alignment

36 FASTA Pearson & Lipman (1988) Find all matching words of length ktup
1 or 2 for proteins, 4 or 6 for DNA Look for diagonals supporting word matches Extend with ungapped alignment Join ungapped regions with gaps

37 Pairwise Sequence Alignment: Local alignment, Scoring Matrices

38 Topics of Discussion Last class we covered: This class: Next class:
Dynamic programming for global alignment This class: Local alignment algorithms Database searching and scoring matrices Next class: Summaries due for BLAST paper

39 Local Alignment Problem
First formulated: Smith and Waterman (1981) Problem definition: Find subsequences in S1 and S2 whose similarity is maximum over all pairs of subsequences from S1 and S2

40 Motivation Searching for unknown domains or motifs within proteins from different families Proteins encoded from Homeobox genes (only conserved in 1 region called the Homeodomain – 60 Aminoacids long, 50-95% alignment across certain insect and mammalian genes) Identifying active sites of enzymes Comparing long stretches of anonymous DNA Querying databases where query word much smaller than sequences in database

41 Changes to DP algorithm
Interpretation of array values: V(i,j) = score of best alignment of a suffix of S1(1..i) and a suffix of S2(1..j) Recurrence relation: V(i,j) = Max { 0, V(i-1,j-1) + s(S1(i), S2(j)), V(i-1, j) + g, V(i, j-1) + g } Empty substrings value: Restriction on scoring scheme

42 Changes to DP algorithm
Initialization of matrix: First row and column with 0’s Traceback: Find maximum value of V(i,j) Traceback pointers until you hit cell with value 0

43 Example Let g = -2 Let s(a,b) = 1 if a=b, and –1 otherwise
Alignment: G T G T V(i,j) C G T A 1 2

44 Gap cost or penalty functions
Observation: Gap of length k more probable than k gaps of length 1 Cause might be single mutational event Separated gaps probably arose due to different events Gap penalty functions: Linear cost: Treats both cases uniformly Common to use a higher cost for h for opening a gap and a lower cost g for extending a gap

45 Review of Optimal Alignment Methods
Needleman-Wunsch : best-path strategy Smith-Waterman: best score local alignment – reports first one Note: best local score could be best global score Refinements to Smith-Waterman algorithm proposed for detecting k best nonintersecting local alignments Altschul & Erickson, 1986 Waterman & Eggert, 1987

46 Substitution Matrix Scoring system last discussed: Well known:
Simple match/mismatch scheme Well known: Amino acids substitute easily for another due to similar physicochemical properties Isoleucine for Valine (both small, hydrophobic) Serine for Threonine (both polar) Such changes – “conservative” Thus, need a way to increase sensitivity of the alignment algorithm Solution – substitution matrix

47 Scoring scheme Identical amino acids >
Conservative substitutions > Nonconservative substitutions Therefore, we need a range of values that depend on the nature of sequences being compared Substitution matrix – flexible lookup scheme for any pair of amino acids

48 PAM Matrices First substitution matrices widely used
Based on the point-accepted-mutation (PAM) model of evolution (Dayhoff..1978) PAMs are relative measures of evolutionary distance 1 PAM = 1 accepted mutation per 100 AAs Does not mean that after 100 PAMs every AA will be different? Why or why not?

49 PAM Matrices If changes were purely random In related proteins:
Frequency of each possible substitution is proportional to background frequencies In related proteins: Observed substitution frequencies called the target (replacement) frequencies are biased toward those that do not seriously disrupt the protein’s function These point mutations are “accepted” during evolution Log-odds approach: Scores proportional to the natural log of the ratio of target frequencies to background frequencies

50 The Math Score matrix entry for time t given by:
s(a,b|t) = log P(b|a,t) qb Conditional probability that a is substituted by b in time t Frequency of amino acid b

51 PAM Matrices Construction
Pairs of very closely related sequences used to collect mutation frequencies corresponding to 1 PAM – Explicit model Extrapolation of the data to a distance of 250 PAMs PAM250 was original Dayhoff matrix Family of matrices – PAM10… PAM200 Matrix multiplication using PAM-1

52 PAM Matrices: salient points
Derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. Does not take into account different evolutionary rates between conserved and non-conserved regions.

53 BLOSUM Matrices Henikoff, S. & Henikoff J.G. (1992)
Use blocks of protein sequence fragments from different families (the BLOCKS database) Amino acid pair frequencies calculated by summing over all possible pairs in block Different evolutionary distances are incorporated into this scheme with a clustering procedure (identity over particular threshold = same cluster)

54 BLOSUM Matrices Similar idea to PAM matrices
Probabilities estimated from blocks of sequence fragments Blocks represent structurally conserved regions

55 BLOSUM Matrices Target frequencies are identified directly instead of extrapolation. Sequences more than x% identitical within the block where substitutions are being counted, are grouped together and treated as a single sequence BLOSUM 50 : >= 50% identity BLOSUM 62 : >= 62 % identity

56 BLOSUM Matrices: Salient points
Derived from local, ungapped alignments of distantly related sequences All matrices are directly calculated; no extrapolations are used – no explicit model The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances. The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49).

57 BLOSUM Example PSC Tutorial - BLOSUM example
l


Download ppt "Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its."

Similar presentations


Ads by Google