Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.

Similar presentations


Presentation on theme: "Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence."— Presentation transcript:

1 Genomic Sequence Alignment

2 Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence alignment Rearrangements in genomic sequences

3 Biology in One Slide – Twentieth Century …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… …and today

4 Complete DNA Sequences About 300 complete genomes have been sequenced

5 Evolution

6 Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

7 Evolutionary Rates OK X X Still OK? next generation

8 Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering the evolutionary forces

9 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

10 What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”? Alignment: A hypothesis that the two sequences come from a common ancestor through sequence edits Parsimonious explanation: Find the minimum number of edits that transform one sequence into the other

11 Scoring Function Sequence edits:AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

12 How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: O( 2 N )

13 Dynamic Programming Given two sequences x = x 1 ……x M and y = y 1 ……y N Let F(i, j) = Score of best alignment of x 1 ……x i to y 1 ……y j Then, F(M, N) == Score of best alignment Idea:  Compute F(i, j) for all i and j  Do this by using F(i–1, j), F(i, j–1), F(i–1, j–1)

14 Dynamic Programming (cont’d) Notice three possible cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j m, if x i = y j F(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - d

15 Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal Then, F(i-1, j-1) + s(x i, y j ) F(i, j) = maxF(i-1, j) – d F( i, j-1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not i-1, j-1i-1, j i, j-1i, j

16 Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A - TA

17 The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(x i, y j ) [case 3] UP if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

18 Performance Time: O(NM) Space: O(NM)

19 Alignment on a Large Scale Given a gene that we care about, how can we compare it to all existing DNA? Assume we use Dynamic Programming: The entire genomic database gene of interest ~10 5 ~10 11

20 Index-based Local Alignment Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: Theoretical worst case: O(MN) Fast in practice query DB

21 Index-based Local Alignment — BLAST Dictionary: All words of length k (~11) Alignment initiated between exact-matching words (more generally, between words of alignment score  T) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

22 Index-based Local Alignment — BLAST A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC

23 Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Nearby alignments are merged Extensions with gaps until score < T below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT

24 Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi|28570323|gb|AC108906.9| Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi|28173089|gb|AC104321.7| Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911 http://www.ncbi.nlm.nih.gov

25 Efficient global alignment

26 Global alignment with the chaining approach 1.Find local alignments 2.Chain them into a rough global map 3.Align regions in-between

27 LAGAN: 1. FIND Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong Do, et al.

28 LAGAN: 2. CHAIN Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong Do, et al.

29 LAGAN: 3. Restricted DP 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong Do, et al.

30 Multiple Alignment

31

32 Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that  All sequences have the same length L  Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

33 Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG Given sequences x 1, …, x N, aligned in a multiple alignment m, S(m) =  k<l w kl s(x k, x l )

34 A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i  Frequency of each letter in   # gaps  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 -.2.8.4.8.4

35 Multiple Sequence Alignments Algorithms

36 Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr)) Multidimensional DP

37 Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } Multidimensional DP

38 Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

39 Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

40 Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

41 Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z

42 Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree  Align on the tree x w y z ?

43 Some useful sites Genome browsers  Ensembl:www.ensembl.org  UCSC:genome.ucsc.edu/cgi-bin/hgGateway Genomic alignment  LAGAN: lagan.stanford.edu  MAVID: baboon.math.berkeley.edu/mavid Protein multiple alignment  MUSCLE: www.drive5.com  ProbCons: probcons.stanford.edu

44 Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

45 Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Global

46 Glocal Alignment Problem Find least cost transformation of one sequence into another using shuffle operations Sequence edits Inversions Translocations Duplications Combinations of above AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

47 SLAGAN: 1. Find Local Alignments 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts Mike Brudno, Sanket Malde, et al.

48 SLAGAN: 2. Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts Mike Brudno, Sanket Malde, et al.

49 SLAGAN: 3. Global Alignment 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts Mike Brudno, Sanket Malde, et al.

50 SLAGAN Example: Chromosome 20 Human Chromosome 20 versus Mouse Chromosome 2 270 Segments of conserved synteny 70 Inversions

51 SLAGAN example: HOX cluster 10 paralogous genes Conserved order in Human/Mouse/Rat

52 SLAGAN example: HOX cluster 10 paralogous genes Conserved order in Human/Mouse/Rat

53 Examples of shuffled regions Hum/Mus Hum/Rat

54 Examples of shuffled regions Hum/Mus Hum/Rat

55 Examples of shuffled regions Hum/Mus Hum/Rat

56 Examples of shuffled regions Hum/MusHum/Rat

57 Examples of shuffled regions Hum/Mus Hum/Rat

58 More DNA is coming…


Download ppt "Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence."

Similar presentations


Ads by Google