Download presentation
Presentation is loading. Please wait.
1
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain Proteins”Wissam Kazan “Human Migrations”Anjalee Sujanani 10/26:“Comparison of Networks Across Species”Chuan Sheng Foo “Repetitive DNA Detection and Classification”Vijay Krishnan 10/19
2
CS374 Presentation - Searching Biological Sequence Databases 2 CS374 Algorithms in Biology Searching Biological Sequence Databases Siddharth Jonathan
3
CS374 Presentation - Searching Biological Sequence Databases 3 Outline Background Problem Typhon Overview Typhon Components Results
4
CS374 Presentation - Searching Biological Sequence Databases 4 Background Sequence Alignment Multiple Alignment Databases Probabilistic Profile Phylogenetic Tree
5
CS374 Presentation - Searching Biological Sequence Databases 5 Sequence Alignment Identifying regions of similarity in the genome, proteins etc. Types –Global –Local Seeded Non-seeded Why is it important? –Comparative analysis of genomes –Producing Phylogenetic trees –Understanding newly sequenced genomes
6
CS374 Presentation - Searching Biological Sequence Databases 6 Seeds – A Review A seed, P = a set of ordered list of w positions i.e. P = {x 1, x 2, …, x w } w = weight of P = |P| s = span of P = x w – x 1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6
7
CS374 Presentation - Searching Biological Sequence Databases 7 Indexing in Seeded Local Alignment algorithms …G A T T A C C A G A T T A C C A G A T T A … Gene Sequence S Seed A = {0,1,2,3} …G A T T A C C A G A T T A C C A G A T T A … GATTS,0 …G A T T A C C A G A T T A C C A G A T T A … ATTAS,1 The same idea holds for non-contiguous seeds as well! Average number of seeds indexed per position is called the Budget
8
CS374 Presentation - Searching Biological Sequence Databases 8 Seeded Local Alignment Algorithms BLAST BLAT BLASTZ Exonerate Usage of multiple seeds, spaced seeds What do they have in common? Indexing!
9
CS374 Presentation - Searching Biological Sequence Databases 9 Multiple alignment Species 1 Species 2
10
CS374 Presentation - Searching Biological Sequence Databases 10 Phylogenetic Tree
11
CS374 Presentation - Searching Biological Sequence Databases 11 Probabilistic Profile Each cell corresponds to one position in the alignment… We’ll learn what information it carries very shortly!
12
CS374 Presentation - Searching Biological Sequence Databases 12 Regions
13
CS374 Presentation - Searching Biological Sequence Databases 13 The Problem Say, we have a database of multiple alignments So what’s the challenge? Find local alignments for the query Candidate seeds
14
CS374 Presentation - Searching Biological Sequence Databases 14 The Problem Statement Budget Can we do better? Make use of information implicit in multiple alignment for selecting which seeds to index for a given position
15
CS374 Presentation - Searching Biological Sequence Databases 15 The Problem Statement - Typhon Given Probabilistic Profile Candidate Seeds Budget Indexing Scheme that indexes only a subset of candidate seeds at each position
16
CS374 Presentation - Searching Biological Sequence Databases 16 Overall Architecture of Typhon
17
CS374 Presentation - Searching Biological Sequence Databases 17 Step 1: Probabilistic Profile Construction 6 tuple for each position in the multiple alignment P present – existence probability P A P C P T P G P id – Probability that the corresponding query position has the consensus character Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists. Nucleotide with highest such value is called the consensus character
18
CS374 Presentation - Searching Biological Sequence Databases 18 Calculation of Probabilistic Profile A A A C T _ T T C C C C Human Chimp Rat Pig 1 1 1 1 P Present =100% P A =75% P C =25% P G =0% P T =0% Propagation of values up the tree to the root is a tricky problem!
19
CS374 Presentation - Searching Biological Sequence Databases 19 Calculating probabilistic profile P Present and P N calculated independently P Present Weighted average of children’s P Present values. Weights proportional to the inverse of the branch length P N calculated through Felsentein’s algorithm with a Kimura Matrix P id = max(P N ) (This is calculated at the root)
20
CS374 Presentation - Searching Biological Sequence Databases 20 Overall Architecture of Typhon
21
CS374 Presentation - Searching Biological Sequence Databases 21 Region Decomposition ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT 12321 Each region is characterized by a P Present and a P id How do we come up with these regions?
22
CS374 Presentation - Searching Biological Sequence Databases 22 Hidden Markov Models (HMM) Given an observation sequence Predict the sequence of Hidden states
23
CS374 Presentation - Searching Biological Sequence Databases 23 Region Decomposition – Simple Method Come up with a set of region classes (states) Construct an HMM Looking at the observation sequence, try to determine the most likely parse –Viterbi algorithm Problem – Need to determine classes at the beginning
24
CS374 Presentation - Searching Biological Sequence Databases 24 Alternative Split the Profile into 2 classes at a time Use 2 stage HMM Stop until bound on number of region classes is reached
25
CS374 Presentation - Searching Biological Sequence Databases 25 Region Decomposition with HMM
26
CS374 Presentation - Searching Biological Sequence Databases 26 Overall Architecture of Typhon
27
CS374 Presentation - Searching Biological Sequence Databases 27 Step 3: Seed Indexing What are we trying to do? 1213 A B D C E Candidate Seeds A D C B C A D CB C E D
28
CS374 Presentation - Searching Biological Sequence Databases 28 The Goal Maximize expected number of regions matched to a homologue
29
CS374 Presentation - Searching Biological Sequence Databases 29 Seed Assignment 2 Approaches: –General Method –Greedy Approximation
30
CS374 Presentation - Searching Biological Sequence Databases 30 General Method - Terminology Region Classes Size of the candidate set Object[i][j] i j
31
CS374 Presentation - Searching Biological Sequence Databases 31 Calculation of number of matching regions (done for each cell in the previous table) Probability that a region matches a homologue Conditional Probability that the seeds match the region and its homologue given that it exists Number of regions XX ‘P Present P hit |C|
32
CS374 Presentation - Searching Biological Sequence Databases 32 General Method - Explained P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345
33
CS374 Presentation - Searching Biological Sequence Databases 33 Some Terminology Weight –Total Length of all regions in a region class * # of seeds indexed at each position –Sort of like the Budget for a region Value –Expected Number of Regions matched. (previous calculation)
34
CS374 Presentation - Searching Biological Sequence Databases 34 Solving the Seed Assignment Problem P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345
35
CS374 Presentation - Searching Biological Sequence Databases 35 Solving the Seed Assignment Problem Weight, Value 10,5 Weight, Value 20,30 Weight, Value 30,31 Weight, Value 40,34 Weight, Value 50,40 Weight, Value 15,8 Weight, Value 30,20 Weight, Value 45,22 Weight, Value 60,24 Weight, Value 75,30 Weight, Value 12,7 Weight, Value 24,10 Weight, Value 36,32 Weight, Value 48,36 Weight, Value 60,40 Weight, Value 9,9 Weight, Value 18,10 Weight, Value 27,25 Weight, Value 36,27 Weight, Value 5,30 Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345
36
CS374 Presentation - Searching Biological Sequence Databases 36 Solving the Seed Assignment Problem Budget =112 Weight, Value 10,5 Weight, Value 20,30 Weight, Value 30,31 Weight, Value 40,34 Weight, Value 50,40 Weight, Value 15,8 Weight, Value 30,20 Weight, Value 45,22 Weight, Value 60,24 Weight, Value 75,30 Weight, Value 12,7 Weight, Value 24,10 Weight, Value 36,32 Weight, Value 48,36 Weight, Value 60,40 Weight, Value 9,9 Weight, Value 18,10 Weight, Value 27,25 Weight, Value 36,27 Weight, Value 5,30 Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345
37
CS374 Presentation - Searching Biological Sequence Databases 37 Looks Familiar? Closely related to the Knapsack Problem, a well studied problem in Computer Science
38
CS374 Presentation - Searching Biological Sequence Databases 38 Approximate Solution Faster Space Efficient New Terminology : –Density of an object = Value/Weight
39
CS374 Presentation - Searching Biological Sequence Databases 39 Approximate Solution – General Intuition Select objects in order of decreasing density Disallow more than one object per row
40
CS374 Presentation - Searching Biological Sequence Databases 40 Approximate Method in Action Candidate Set Object[1,1] Density=V/W=3 Object[2,1] Density=V/W=2 Object[3,1] Density=V/W=5 Object[4,1] Density=V/W=4 Object[3,2] Density=V/W=6 What are the new values of Weight, Value and Density? Value = additional number of regions matched Weight = amount of budget used by this one seed. And keep track of the Budget!
41
CS374 Presentation - Searching Biological Sequence Databases 41 Results Considerations –Sensitivity –Speed –Space
42
CS374 Presentation - Searching Biological Sequence Databases 42 Sensitivity Results Experimental Setup Detection of Hypothetical Homologous Alignments (HHA) Typhon Vs Standard
43
CS374 Presentation - Searching Biological Sequence Databases 43 Sensitivity Comparison
44
CS374 Presentation - Searching Biological Sequence Databases 44 Effect of Multiple Alignment on Sensitivity
45
CS374 Presentation - Searching Biological Sequence Databases 45 Running time Comparison Time spent building the index –Typhon takes longer Time spent scanning the index Typhon 3-4 times slower at run time which is reasonable
46
CS374 Presentation - Searching Biological Sequence Databases 46 Scanning time
47
CS374 Presentation - Searching Biological Sequence Databases 47 Conclusion Information implicit from Multiple Alignments helps search sensitivity Variable allocation of seeds by region classes helps (Typhon) Space and time complexities of Typhon comparable to STANDARD Most effective for queries far from each species in the alignment
48
CS374 Presentation - Searching Biological Sequence Databases 48 Questions?
49
CS374 Presentation - Searching Biological Sequence Databases 49 Acknowledgements Serafim Batzoglou, George Asimenos, Jason Flannick
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.