Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignments and Dynamic Programming BIO/CS 471 – Algorithms for Bioinformatics.

Similar presentations


Presentation on theme: "Sequence Alignments and Dynamic Programming BIO/CS 471 – Algorithms for Bioinformatics."— Presentation transcript:

1 Sequence Alignments and Dynamic Programming BIO/CS 471 – Algorithms for Bioinformatics

2 Sequence Alignments2 Module II: Sequence Alignments  Problem: SequenceAlignment Input: Two or more strings of characters Output: The optimal alignment of the input strings, possibly including gaps, and an alignment score.  Example Input: The calico cat was very sleepy. The cat was sleepy. Output: The calico cat was very sleepy. The ------ cat was ---- sleepy.

3 Sequence Alignments3 Biological Sequence Alignment  Input: DNA or Amino Acid sequence  Output: Optimal alignment  Example Input: ACTATAGCTATAGCTATAGCGTATCG ACGATTACAGGTCGTGTCG Output: ACTATAGCTATAGCTATAGCGTATCG || || ||*|| | |||*||| ACGAT---TACAGGT----CGTGTCG  Score: +8

4 Sequence Alignments4 Why align sequences?  Why would we want to align two sequences? Compare two genes/proteins, e.g. to infer function Database searches Protein structure prediction  Why would we want to align multiple sequences? Conservation analysis Molecular evolution

5 Sequence Alignments5 How do we decide on a scoring function?  Objective: Sequences that are homologous or evolutionarily related, should align with high scores. Less related sequences should score lower.  Question: How do homologous sequences arise?

6 Sequence Alignments6 Transcription The Central Dogma DNAtranscription  RNAtranslation  Proteins

7 Sequence Alignments7 DNA Replication  Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set  DNA polymerase is the enzyme that copies DNA Reads the old strand in the 3´ to 5´ direction

8 Over time, genes accumulate mutations  Environmental factors Radiation Oxidation  Mistakes in replication or repair  Deletions, Duplications  Insertions  Inversions  Point mutations

9 Sequence Alignments9  Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal  Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal Deletions

10 Sequence Alignments10 Indels  Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT

11 Sequence Alignments11 The Genetic Code Substitutions Substitutions are mutations accepted by natural selection. Synonymous: CGC  CGA Non-synonymous: GAU  GAA

12 Sequence Alignments12 Two kinds of homologs  Orthologs Species A has a particular gene Species A diverges into species B and C, each with the same gene The two genes accumulate substitutions independently

13 Sequence Alignments13 Orthologs and Paralogs  Paralogs A gene duplication event within a species results in two copies of a gene One of the copies is now under less selective constraint The copies can accumulate substitutions independently, before and after speciation

14 Sequence Alignments14 Scoring a sequence alignment  Match score:+1  Mismatch score:+0  Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Matches: 18 × (+1)  Mismatches: 2 × 0  Gaps: 7 × (– 1) Score = +11

15 Sequence Alignments15 Origination and length penalties  We want to find alignments that are evolutionarily likely.  Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT  We can achieve this by penalizing more for a new gap, than for extending an existing gap  

16 Sequence Alignments16 Scoring a sequence alignment (2)  Match/mismatch score:+1/+0  Origination/length penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Matches: 18 × (+1)  Mismatches: 2 × 0  Origination: 2 × (–2)  Length: 7 × (–1) Score = +7

17 Sequence Alignments17 Optimal Substructure in Alignments  Consider the alignment: ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Is it true that the alignment in the boxed region must be optimal?

18 Sequence Alignments18 A Greedy Strategy  Consider this pair of sequences GAGC CAGC  Greedy Approach: G or G or - C - G  Leads to GAGC--- Better: GACG ---CAGC CACG GAP =  1 Match = +1 Mismatch =  2

19 Sequence Alignments19 Breaking apart the problem  Suppose we are aligning: ACTCG ACAGTAG  First position choices: A+1CTCG ACAGTAG A-1CTCG -ACAGTAG --1ACTCG ACAGTAG

20 Sequence Alignments20 A Recursive Approach to Alignment  Choose the best alignment based on these three possibilities: align(seq1, seq2) { if (both sequences empty) {return 0;} if (one string empty) { return(gapscore * num chars in nonempty seq); else { score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2)); score2 = align(tail(seq1), seq2) + gapscore; score3 = align(seq1, tail(seq2) + gapscore; return(min(score1, score2, score3)); }

21 Sequence Alignments21 Time Complexity of RecurseAlign  What is the recurrence equation for the time needed by RecurseAlign? 3 3 33 33 … n 3 9 27 3n3n

22 Sequence Alignments22 RecurseAlign repeats its work

23 Sequence Alignments23 How can we find an optimal alignment?  Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG—CATCGTC--T-ATCT  C(27,7) gap positions = ~888,000 possibilities  It’s possible, as long as we don’t repeat our work!  Dynamic programming: The Needleman & Wunsch algorithm

24 Sequence Alignments24 What is the optimal alignment?  ACTCG ACAGTAG  Match: +1  Mismatch: 0  Gap: –1

25 Sequence Alignments25 Needleman-Wunsch: Step 1  Each sequence along one axis  Mismatch penalty multiples in first row/column  0 in [1,1] (or [0,0] for the CS-minded)

26 Sequence Alignments26 Needleman-Wunsch: Step 2  Vertical/Horiz. move: Score + (simple) gap penalty  Diagonal move: Score + match/mismatch score  Take the MAX of the three possibilities

27 Sequence Alignments27 Needleman-Wunsch: Step 2 (cont’d)  Fill out the rest of the table likewise…

28 Sequence Alignments28 Needleman-Wunsch: Step 2 (cont’d)  Fill out the rest of the table likewise…  The optimal alignment score is calculated in the lower-right corner

29 Sequence Alignments29 But what is the optimal alignment  To reconstruct the optimal alignment, we must determine of where the MAX at each step came from…

30 Sequence Alignments30 A path corresponds to an alignment  = GAP in top sequence  = GAP in left sequence  = ALIGN both positions  One path from the previous table:  Corresponding alignment (start at the end): AC--TCG ACAGTAG Score = +2

31 Sequence Alignments31 Practice Problem  Find an optimal alignment for these two sequences: GCGGTT GCGT  Match: +1  Mismatch: 0  Gap: –1

32 Sequence Alignments32 Practice Problem  Find an optimal alignment for these two sequences: GCGGTT GCGT GCGGTT GCG-T- Score = +2

33 Sequence Alignments33 Semi-global alignment  Suppose we are aligning: GCG GGCG  Which do you prefer? G-CG-GCG GGCGGGCG  Semi-global alignment allows gaps at the ends for free.

34 Sequence Alignments34 Semi-global alignment  Semi-global alignment allows gaps at the ends for free.  Initialize first row and column to all 0’s  Allow free horizontal/vertical moves in last row and column

35 Sequence Alignments35 Local alignment  Global alignments – score the entire alignment  Semi-global alignments – allow unscored gaps at the beginning or end of either sequence  Local alignment – find the best matching subsequence  CGATG AAATGGA  This is achieved by allowing a 4 th alternative at each position in the table: zero.

36 Sequence Alignments36 Local alignment  Mismatch = –1 this time CGATG AAATGGA

37 Sequence Alignments37 Food for thought…  What is the asymptotic time complexity of the Needleman & Wunsch algorithm? Is this good or bad?  How about the space complexity? Why might this be a problem?

38 Sequence Alignments38 Saving Space  Note that we can throw away the previous rows of the table as we fill it in: This row is based only on this one

39 Sequence Alignments39 Saving Space (2)  Each row of the table contains the scores for aligning a prefix of the left-hand sequence with all prefixes of the top sequence: Scores for aligning aca with all prefixes of actcg

40 Sequence Alignments40 Divide and Conquer  By using a recursive approach, we can use only two rows of the matrix at a time: Choose the middle character of the top sequence, i Find out where i aligns to the bottom sequence  Needs two vectors of scores Recursively align the sequences before and after the fixed positions ACGCTATGCTCATAG CGACGCTCATCG i

41 Sequence Alignments41 Finding where i lines up  Find out where i aligns to the bottom sequence  Needs two vectors of scores  Assuming i lines up with a character: alignscore = align( ACGCTAT, prefix(t)) + score( G, char from t) + align( CTCATAG, suffix(t))  Which character is best? Can quickly find out the score for aligning ACGCTAT with every prefix of t. s: ACGCTATGCTCATAG t: CGACGCTCATCG i

42 Sequence Alignments42 Finding where i lines up  But, i may also line up with a gap  Assuming i lines up with a gap: alignscore = align( ACGCTAT, prefix(t)) + gapscore + align( CTCATAG, suffix(t)) s: ACGCTATGCTCATAG t: CGACGCTCATCG i

43 Sequence Alignments43 Recursive Call  Fix the best position for I  Call align recursively for the prefixes and suffixes: s: ACGCTATGCTCATAG t: CGACGCTCATCG i

44 Sequence Alignments44 Complexity  Let len(s) = m and len(t) = n  Space: 2m  Time: Each call to build similarity vector = m´n´ First call + recursive call: s: ACGCTATGCTCATAG t: CGACGCTCATCG i j

45 Sequence Alignments45 General Gap Penalties  Suppose we are no longer using simple gap penalties: Origination = −2 Length = −1  Consider the last position of the alignment for ACGTA with ACG  We can’t determine the score for unless we know the previous positions! G-G- -G-G or

46 Sequence Alignments46 Scoring Blocks  Now we must score a block at a time  A block is a pair of characters, or a maximal group of gaps paired with characters  To score a position, we need to either start a new block or add it to a previous block A A C --- A TATCCG A C T AC A C T ACC T ------ C G C --

47 Sequence Alignments47 The Algorithm  Three tables a – scores for alignments ending in char-char blocks b – scores for alignments ending in gaps in the top sequence (s) c – scores for alignments ending in gaps in the left sequence (t)  Scores no longer depend on only three positions, because we can put any number of gaps into the last block

48 Sequence Alignments48 The Recurrences

49 Sequence Alignments49 The Optimal Alignment  The optimal alignment is found by looking at the maximal value in the lower right of all three arrays  The algorithm runs in O(n 3 ) time Uses O(n 2 ) space


Download ppt "Sequence Alignments and Dynamic Programming BIO/CS 471 – Algorithms for Bioinformatics."

Similar presentations


Ads by Google