Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Similar presentations


Presentation on theme: "CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time."— Presentation transcript:

1 CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

2 CS262 Lecture 9, Win07, Batzoglou

3 Saving cells in DP 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

4 CS262 Lecture 9, Win07, Batzoglou Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

5 CS262 Lecture 9, Win07, Batzoglou The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

6 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming 15324162042431118 4 20 24 3 11 15 11 4 18 20 Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

7 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet  x = x 1, …, x m Find a subsequence  s = s 1, …, s k  s 1 < s 2 < … < s k

8 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Let input be w: w 1,…, w n INITIALIZATION: L:last LIS elt. array L[0] = -inf L[1] = w 1 L[2…n] = +inf B:array holding LIS elts; B[0] = 0 P:array of backpointers // L[j]: smallest j th element w i of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j]  w[i] B[j]  i P[i]  B[j – 1] } That’s it!!! Running time?

9 CS262 Lecture 9, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w Every matching point (i, j), is inserted into w as follows: For each column j = 1…m, insert in w the points (i, j), in decreasing row i order The 11 example points are inserted in the order given a = (y, x), b = (y’, x’) can be chained iff  a is before b in w, and  y < y’ 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

10 CS262 Lecture 9, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

11 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = [L1] [L2] [L3] [L4] [L5] … 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, 18 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

12 CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j : smallest (North) to largest (South) value  L is implemented as a balanced binary tree y h l

13 CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) score V(b) V(a)

14 CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i i j k Is k ever removed?

15 CS262 Lecture 9, Win07, Batzoglou Example x y a: 5 c: 3 b: 6 d: 4 e: 2 2 5 6 9 10 11 12 14 15 16 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i abcde V 5 L lili V(i) i 5 5 a 8 11 8 c 12 9 11 b 15 12 d 13 16 13 3

16 CS262 Lecture 9, Win07, Batzoglou Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

17 CS262 Lecture 9, Win07, Batzoglou Examples Human Genome Browser ABC

18 CS262 Lecture 9, Win07, Batzoglou Gene Recognition

19 CS262 Lecture 9, Win07, Batzoglou Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid

20 CS262 Lecture 9, Win07, Batzoglou Where are the genes?

21 CS262 Lecture 9, Win07, Batzoglou

22 Needles in a Haystack

23 CS262 Lecture 9, Win07, Batzoglou Classes of Gene predictors  Ab initio Only look at the genomic DNA of target genome  De novo Target genome + aligned informant genome(s)  EST/cDNA-based & combined approaches Use aligned ESTs or cDNAs + any other kind of evidence Gene Finding EXON Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

24 CS262 Lecture 9, Win07, Batzoglou Signals for Gene Finding 1.Regular gene structure 2.Exon/intron lengths 3.Codon composition 4.Motifs at the boundaries of exons, introns, etc. Start codon, stop codon, splice sites 5.Patterns of conservation 6.Sequenced mRNAs 7.(PCR for verification)

25 CS262 Lecture 9, Win07, Batzoglou Next Exon: Frame 0 Next Exon: Frame 1

26 CS262 Lecture 9, Win07, Batzoglou Exon and Intron Lengths

27 CS262 Lecture 9, Win07, Batzoglou Nucleotide Composition Base composition in exons is characteristic due to the genetic code Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG

28 CS262 Lecture 9, Win07, Batzoglou atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

29 CS262 Lecture 9, Win07, Batzoglou Splice Sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

30 CS262 Lecture 9, Win07, Batzoglou HMMs for Gene Recognition GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Intergene State First Exon State Intron State Intron State

31 CS262 Lecture 9, Win07, Batzoglou HMMs for Gene Recognition exon intron intergene Intergene State First Exon State Intron State Intron State GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

32 CS262 Lecture 9, Win07, Batzoglou Duration HMMs for Gene Recognition TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3 Duration d  i P INTRON (x i | x i-1 …x i-w ) P EXON_DUR (d)  i P EXON((i – j + 2)%3)) (x i | x i-1 …x i-w ) j+2 P 5’SS (x i-3 …x i+4 ) P STOP (x i-4 …x i+3 )

33 CS262 Lecture 9, Win07, Batzoglou Genscan Burge, 1997 First competitive HMM-based gene finder, huge accuracy jump Only gene finder at the time, to predict partial genes and genes in both strands Features –Duration HMM –Four different parameter sets Very low, low, med, high GC-content

34 CS262 Lecture 9, Win07, Batzoglou Using Comparative Information

35 CS262 Lecture 9, Win07, Batzoglou Using Comparative Information Hox cluster is an example where everything is conserved

36 CS262 Lecture 9, Win07, Batzoglou Patterns of Conservation 30% 1.3% 0.14% 58% 14% 10.2% GenesIntergenic Mutations Gaps Frameshifts Separation 2-fold 10-fold 75-fold 

37 CS262 Lecture 9, Win07, Batzoglou Comparison-based Gene Finders Rosetta, 2000 CEM, 2000 –First methods to apply comparative genomics (human-mouse) to improve gene prediction Twinscan, 2001 –First HMM for comparative gene prediction in two genomes SLAM, 2002 –Generalized pair-HMM for simultaneous alignment and gene prediction in two genomes NSCAN, 2006 –Best method to-date based on a phylo-HMM for multiple genome gene prediction

38 CS262 Lecture 9, Win07, Batzoglou Twinscan 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } 3.Run Viterbi using emissions e k (b) where b  { A-, A:, A|, …, T| } Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns Example Human : ACGGCGACGUGCACGU Mouse : ACUGUGACGUGCACUU Alignment : ||:|:|||||||||:|

39 CS262 Lecture 9, Win07, Batzoglou SLAM – Generalized Pair HMM d e Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.

40 CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction GENSCAN TWINSCAN N-SCAN TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA... Target sequence: Informant sequences (vector): Joint prediction (use phylo-HMM):

41 CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction X X C C Y Y Z Z H H M M R R X X C C Y Y Z Z H H M M R R

42 CS262 Lecture 9, Win07, Batzoglou Performance Comparison GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution NSCAN human/mouse > Human/multiple informants

43 CS262 Lecture 9, Win07, Batzoglou 2-level architecture No Phylo-HMM that models alignments CONTRAST Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg SVM X Y abab

44 CS262 Lecture 9, Win07, Batzoglou CONTRAST

45 CS262 Lecture 9, Win07, Batzoglou log P(y | x) ~ w T F(x, y) F(x, y) =  i f(y i-1, y i, i, x) f(y i-1, y i, i, x):  1{y i-1 = INTRON, y i = EXON_FRAME_1}  1{y i-1 = EXON_FRAME_1, x human,i-2,…, x human,i+3 = ACCGGT)  1{y i-1 = EXON_FRAME_1, x human,i-1,…, x dog,i+1 = ACC, AGC)  (1-c)1{a<SVM_DONOR(i)<b}  (optional)1{EXON_FRAME_1, EST_EVIDENCE} CONTRAST - Features

46 CS262 Lecture 9, Win07, Batzoglou Accuracy increases as we add informants Diminishing returns after ~5 informants CONTRAST – SVM accuracies SNSP

47 CS262 Lecture 9, Win07, Batzoglou CONTRAST - Decoding Viterbi Decoding: maximize P(y | x) Maximum Expected Boundary Accuracy Decoding: maximize  i,B 1{y i-1, y i is exon boundary B} Accuracy(y i-1, y i, B | x) Accuracy(y i-1, y i, B | x) = P(y i-1, y i is B | x) – (1 – P(y i-1, y i is B | x))

48 CS262 Lecture 9, Win07, Batzoglou CONTRAST - Training Maximum Conditional Likelihood Training: maximize L(w) = P w (y | x) Maximum Expected Boundary Accuracy Training: Expected BoundaryAccuracy (w) =  i Accuracy i Accuracy i =  B 1{(y i-1, y i is exon boundary B} P w (y i-1, y i is B | x) -  B’ ≠ B P(y i-1, y i is exon boundary B’ | x)

49 CS262 Lecture 9, Win07, Batzoglou Performance Comparison De Novo EST-assisted Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken

50 CS262 Lecture 9, Win07, Batzoglou Performance Comparison


Download ppt "CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time."

Similar presentations


Ads by Google