Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.

Similar presentations


Presentation on theme: "CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments."— Presentation transcript:

1 CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

2 CS262 Lecture 15, Win07, Batzoglou Saving cells in DP 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

3 CS262 Lecture 15, Win07, Batzoglou Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

4 CS262 Lecture 15, Win07, Batzoglou The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

5 CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming Back to the LCS problem: Given two sequences  x = x 1, …, x m  y = y 1, …, y n Find the longest common subsequence  Quadratic solution with DP How about when “hits” x i = y j are sparse?

6 CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming 15324162042431118 4 20 24 3 11 15 11 4 18 20 Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

7 CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet  x = x 1, …, x m Find a subsequence  s = s 1, …, s k  s 1 < s 2 < … < s k

8 CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Let input be w: w 1,…, w n INITIALIZATION: L:last LIS elt. array L[0] = -inf L[1] = w 1 L[2…n] = +inf B:array holding LIS elts; B[0] = 0 P:array of backpointers // L[j]: smallest j th element w i of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j]  w[i] B[j]  i P[i]  B[j – 1] } That’s it!!! Running time?

9 CS262 Lecture 15, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w Every matching point (i, j), is inserted into w as follows: For each column j = 1…m, insert in w the points (i, j), in decreasing row i order The 11 example points are inserted in the order given a = (y, x), b = (y’, x’) can be chained iff  a is before b in w, and  y < y’ 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

10 CS262 Lecture 15, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y Why don’t we insert elements (i, j) in w in increasing row i order?

11 CS262 Lecture 15, Win07, Batzoglou Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = [L1] [L2] [L3] [L4] [L5] … 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, 18 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

12 CS262 Lecture 15, Win07, Batzoglou Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j : smallest (North) to largest (South) value  L is implemented as a balanced binary tree y h l

13 CS262 Lecture 15, Win07, Batzoglou Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) score V(b) V(a)

14 CS262 Lecture 15, Win07, Batzoglou Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i i j k Is k ever removed?

15 CS262 Lecture 15, Win07, Batzoglou Example x y a: 5 c: 3 b: 6 d: 4 e: 2 2 5 6 9 10 11 12 14 15 16 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i abcde V 5 L lili V(i) i 5 5 a 8 11 8 c 12 9 11 b 15 12 d 13 16 13 3

16 CS262 Lecture 15, Win07, Batzoglou Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so logN per deletion Each element is deleted at most once: < N logN for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

17 CS262 Lecture 15, Win07, Batzoglou Whole-genome Alignment Pipelines Given N species, phylogenetic tree: 1.Local Alignment between all pairs – BLAST 2.In the order of the tree: 1.Synteny mapping: find long regions with lots of collinear alignments 2.In each synteny region, 1.Chaining 2.Global alignment Alternatively, all species are mapped to one reference (e.g., human) Then, in each unbroken synteny region between multiple species, perform chaining & progressive multiple alignment

18 CS262 Lecture 15, Win07, Batzoglou Examples Human Genome Browser ABC

19 CS262 Lecture 15, Win07, Batzoglou Whole-genome alignment Rat—Mouse—Human

20 CS262 Lecture 15, Win07, Batzoglou Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

21 CS262 Lecture 15, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

22 CS262 Lecture 15, Win07, Batzoglou The Central Dogma Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

23 CS262 Lecture 15, Win07, Batzoglou Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid

24 CS262 Lecture 15, Win07, Batzoglou Finding Genes in Yeast Start codon ATG 5’3’ Stop codon TAG/TGA/TAA Intergenic Coding Intergenic Mean coding length about 1500bp (500 codons) Transcript

25 CS262 Lecture 15, Win07, Batzoglou Finding Genes in Yeast Yeast ORF distribution

26 CS262 Lecture 15, Win07, Batzoglou Introns: The Bane of ORF Scanning Start codon ATG 5’ 3’ Stop codon TAG/TGA/TAA Splice sites Intergenic Exon Intron Intergenic Exon Intron Transcript

27 CS262 Lecture 15, Win07, Batzoglou Introns: The Bane of ORF Scanning Drosophila: 3.4 introns per gene on average mean intron length 475, mean exon length 397 Human: 8.8 introns per gene on average mean intron length 4400, mean exon length 165 ORF scanning is defeated

28 CS262 Lecture 15, Win07, Batzoglou Where are the genes?

29 CS262 Lecture 15, Win07, Batzoglou

30 Needles in a Haystack

31 CS262 Lecture 15, Win07, Batzoglou Signals for Gene Finding We need to use more information to help recognize genes 1.Regular gene structure 2.Exon/intron lengths 3.Nucleotide composition 4.Motifs at the boundaries of exons, introns, etc. Start codon, stop codon, splice sites 5.Patterns of conservation

32 CS262 Lecture 15, Win07, Batzoglou Regular Gene Structure Start, Stop of translation region:  Protein-coding starts with ATG  ends with TAA / TAG / TGA Exon – Intron – Exon – Intron … – Exon g[ GT/GC ]gag – Intron – cAGt Exon reading frame:  NNN – NNN – NNN – NNN – NN…  NN – NNN – NNN – NNN – NN…  N – NNN – NNN – NNN – NNN…

33 CS262 Lecture 15, Win07, Batzoglou Next Exon: Frame 0 Next Exon: Frame 1

34 CS262 Lecture 15, Win07, Batzoglou Exon/Intron Lengths

35 CS262 Lecture 15, Win07, Batzoglou Nucleotide Composition Base composition in exons is characteristic due to the genetic code Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG

36 CS262 Lecture 15, Win07, Batzoglou Biological Signals How does the cell recognize start/stop codons and splice sites?  In part, from characteristic base composition Donor site (start of intron) is recognized by a section of U1 snRNA U1 snRNA: GUCCAUUCA Donor site consensus: MAGGTRAGT M means “A or C”, R means “A or G”

37 CS262 Lecture 15, Win07, Batzoglou atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

38 CS262 Lecture 15, Win07, Batzoglou 5’ 3’ Donor site Position  -8…-2012…17 A26…6090054…21 C26…155012…27 G25…1278100041…27 T23…1380993…25 Splice Sites

39 CS262 Lecture 15, Win07, Batzoglou Splice Sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

40 CS262 Lecture 15, Win07, Batzoglou WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997)  Decision-tree algorithm to take pairwise dependencies into account Starting with a training set of known splice sites: For each position I, calculate S i =  j  i  2 (C i, X j ) Choose i * such that S i* is maximal and partition into two subsets, until No significant dependencies left, or Not enough sequences in subset  Train separate WMM models for each subset All donor splice sites G5G5 not G 5 G 5 G -1 G 5 not G -1 G 5 G -1 A 2 G 5 G -1 not A 2 G 5 G -1 A 2 U 6 G 5 G -1 A 2 not U 6 Splice Sites


Download ppt "CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments."

Similar presentations


Ads by Google