Index-based search of single sequences Omkar Mate CS 374 Stanford University.

Index-based search of single sequences Omkar Mate CS 374 Stanford University

Motivation A newly discovered gene … Occurrence in other species Mutation … (may hold clues to evolution of human brain capacity )

Sequence Alignment Existing Genome Database ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… gacatta New Query ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… Easy??? Think again………..

State of Biological Databases ~300 sequenced genomes

Alignment Problem Time and Space complexity: O(10 15 ) The entire genomic database 10 11 Our new gene 10 4 Assume we try Smith-Waterman: [running time = O(MN)] Huge number! But we can do better …

Indexing-based Local Alignment BLAST- Basic Local Alignment Search Tool Main idea: Construct a dictionary of all the words in the query Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman! query DB hits

BLAST Step 1: Construct dictionary of query words (Query indexed by all words of size, k = 4) Query : AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… Index of query words AAAAAAAC…AACG…ACGT…CGTT…TTTT AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC…

BLAST Step 2 – Generate all the relatives of a word (Relative: a word with alignment score greater than a threshold, T) Query Word: ATGC, T: 50 Candidates Score ATGC ATGG ACGC ATGA ACCC … 60 55 52 48 37 … Relatives!Update the index … (Query: ) AATGCCGATAGCATCG … …ACGC…ATGCATGG…

BLAST Step 3: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query Database:ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … Query: Index: ATCGTCGC ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC …

BLAST Step 4 – Alignment Extension Once we find an alignment, extend to left and right with no gaps until alignment score falls below a certain threshold. ATGCCGATACGATCAGCTACGATCAG…

Sensitivity-Speed Tradeoff long words (k = 15) short words (k = 7) Sensitivity Speed Longer words => Fewer alignments => Faster but Low chance of a match Shorter words => More alignments => High chance of a match but Slower

BLAT - The BLAST-Like Alignment Tool Similarities: Rapid scans for relatively short matches (hits) Extend these hits into high-scoring pairs (HSP) Differences: BLASTBLAT Index of query sequenceIndex of database Scan through databaseScan through query sequence Gaps not allowed in ext.Gaps allowed in extension Returns smaller alignmentsReturns larger alignments

BLAT Strategies - I Single perfect matches We do not allow any mismatch. Common intuition : fewer matches for longer words

Sensitivity and Specificity Single Perfect Nucleotide K-mer Matches as a Search Criterion

BLAT Strategies - II Allow one mismatch Intuition: Higher number of matches for same word length => Better sensitivity (Caution: Keep k higher, else no. of matches will be huge)

Sensitivity and Specificity Single Near-Perfect Nucleotide K-mer Matches as a Search Criterion (one mismatch allowed)

BLAT Strategies - III Allow multiple perfect matches Two parameter: N: no. of matches, K: word size Practically: Same sensitivity, higher speed

Sensitivity and Specificity Multiple Perfect Nucleotide K-mer Matches as a Search Criterion (2 and 3 perfect matches)

Seeded Alignment A dominant paradigm for fast comparisons Seed: A common pattern of positions used for efficient large scale comparison of genomic DNA …G A T T A C C A G A T T A C C A G A T T A … Seed: {0,1,2,3} => Comparison sequences:{G A T T}{A T T A}{T T A C} … Seed: {0,2,4,6} => Comparison sequences:{G T A C}{A T C A}{T A C G} …

Similarity Detection Sequence 1: Sequence 2: Seed: {0,2,4,5} A T C G A C T C T A G T C T Offset = 0 => Mismatch A T C G A C T C T A G T C T Offset = 1 => Match A T C G A C T C T A G T C T We can have multiple seeds (patterns)!

Seed Design A (hazy) problem definition Collection of ungapped genomic sequence similarities Parameters: length of seeds, resource limits ….. Algorithm A set of seeds that will give “optimum performance” (What are the parameters? How do you define optimum performance?)

Tasks in Seed Designing Define a measure of goodness for a seed. Easy ! Sensitivity to interesting biosequence similarities. Show how to evaluate goodness for a seed. Hard  ! No efficient algorithm.

Terminology related to a Seed A seed, P = a set of ordered list of w positions i.e. P = {x 1, x 2, …, x w } w = weight of P = |P| s = span of P = x w – x 1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6

Computational Cost Seed weight w No. of seeds n f Computational Cost

Performance Measurement Optimum performance => Maximum sensitivity (i.e. detection probability) to the similarities S (Currently, Markov Models are used to measure these probabilities!)

Markov Models A k th order Markov model M: Given k bits, predict (k+1) th bit X4X4 X3X3 X2X2 X1X1 Pr [ X 5 =1|X 4 X 3 X 2 X 1 ] 00...100...1 00...100...1 00...100...1 01...101...1 Pr[X 5 =1|0000] Pr[X 5 =1|0001]. Pr[X 5 =1|1111] 4 th order Markov Model

(Exact) Problem Definition Inputs: Number of seeds: n Weight of each seed: w Markov Model: M Similarities: S Output: A set of seeds (ordered positions), P = {x 11, …, x 1w }, {x 21,…,x 2w },…,{x n1,…,p nw } that maximizes detection probability for S

Computing Detection Probabilities Challenge: The probability of at least one match varies because the probabilities of matches at different offsets are not independent. Ex: Seed = {0,2,3,5} This similarity has 2 matches, at offsets 0 and 2, which share two of four positions in common

DFA to compute Detection Probability (Deterministic Finite Automaton) Construct a DFA that accepts a string of 1’s and 0’s defined by the seed P. Ex: P = {0,2} i.e. for a substring of length 3, we need a match in 1 st and 3 rd position. Then the DFA should accept strings given by the regular expression: “(0+1)*1(0+1)1(0+1)*”

Dynamic Programming Algorithm Size of DFA <= Time to construct a DFA = Time for each step = No. of Steps = l Total time complexity = This is faster by a factor of s 2 /w than the best previous algorithm for detection probabilities To compute the detection probabilities recursively! Complexity analysis:

Remarks about the algorithm Can be extended to work with a set of seeds The DFA need not be minimal Time complexity can be further reduced

Structure in Seed Space Addressing the problem: When is one seed more sensitive than another? Factors: Parameters of the Markov model M Similarity length : l Smaller length: irregular behavior => We can generalize only for asymptotic cases

Asymptotic Result Let, E l (P) = Event that P detects S at some offset E l c (P) = complementary event Then, A seed P is asymptotically worse than a seed P’, P < P’, if Lim l Pr[ E l c (P) / E l c (P’) ] > 1 (P’ has more chances of detecting S, than P does!)

Mandala: Fast, Practical Seed Design Seed selection: No efficient algorithm to find optimum w, s given M (except Brute Force) Applies local search method; global efficiency sacrificed Training a Markov model: Training set is adaptively selected to suit the intended application Samples training set using LSH-ALL-PAIRS algorithm

Experimental Results - I Avg. detection probabilities given by theoretical models for random seeds (w = 11) Solid line: M 0 Dashed line: M 5

Experimental Results - II Detection probabilities for best seeds found by Mandala (k = 5) Solid: noncoding DNA model Dashed: coding DNA model

Directions for further research Extend the model to evaluate seeds Extend similarity models to distinguish between different classes of substitution Construct models of multiple alignment: to compare 3 or more genomes at once

Timing - BLAT vs WU-TBLASTX Dataset: 1000 Mouse Reads and a RepeatMasked Human Chromosome 22

Sensitivity – BLAT vs WU-TBLASTX Dataset: 13 million Mouse Shotgun Reads and Human Chromosome 22

Sensitivity and Specificity Single Perfect Amino Acid K-mer Matches as a Search Criterion

Sensitivity and Specificity Single Near-Perfect Amino Acid K-mer Matches as a Search Criterion (one mismatch allowed)

Sensitivity and Specificity Multiple Perfect Amino Acid K-mer Matches as a Search Criterion (2 and 3 perfect matches)

Mathematical Formula To compute the probability that the DFA associated with a seed accepts a string randomly chosen from a Markov Model M S = similarity, l = length of S, k = order of M, delta = bit string of length k, q = a state, Phi(q) = set of states that transition to q on bit b t = no. of bits read, k’ = min {k, t}

Dynamic Programming Initialize the recurrence: P (q 0, 0, 0) = 1 After l steps, return the sum over all k+1-mer bit strings delta.1 of P (q a, l, delta.1)

Index-based search of single sequences Omkar Mate CS 374 Stanford University.

Similar presentations

Presentation on theme: "Index-based search of single sequences Omkar Mate CS 374 Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Index-based search of single sequences Omkar Mate CS 374 Stanford University.

Similar presentations

Presentation on theme: "Index-based search of single sequences Omkar Mate CS 374 Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback