Index-based search of single sequences Omkar Mate CS 374 Stanford University.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Searching Sequence Databases
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Alignment.
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Exploration Session Week 8: Computational Biology Melissa Winstanley: (based on slides by Martin Tompa,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Construction of Substitution Matrices
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Fast Sequence Alignments
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
CSE 5290: Algorithms for Bioinformatics Fall 2009
Searching Sequence Databases
Presentation transcript:

Index-based search of single sequences Omkar Mate CS 374 Stanford University

Motivation A newly discovered gene … Occurrence in other species Mutation … (may hold clues to evolution of human brain capacity )

Sequence Alignment Existing Genome Database ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… gacatta New Query ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… Easy??? Think again………..

State of Biological Databases ~300 sequenced genomes

Alignment Problem Time and Space complexity: O(10 15 ) The entire genomic database Our new gene 10 4 Assume we try Smith-Waterman: [running time = O(MN)] Huge number! But we can do better …

Indexing-based Local Alignment BLAST- Basic Local Alignment Search Tool Main idea: Construct a dictionary of all the words in the query Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman! query DB hits

BLAST Step 1: Construct dictionary of query words (Query indexed by all words of size, k = 4) Query : AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… Index of query words AAAAAAAC…AACG…ACGT…CGTT…TTTT AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC…

BLAST Step 2 – Generate all the relatives of a word (Relative: a word with alignment score greater than a threshold, T) Query Word: ATGC, T: 50 Candidates Score ATGC ATGG ACGC ATGA ACCC … … Relatives!Update the index … (Query: ) AATGCCGATAGCATCG … …ACGC…ATGCATGG…

BLAST Step 3: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query Database:ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … Query: Index: ATCGTCGC ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC …

BLAST Step 4 – Alignment Extension Once we find an alignment, extend to left and right with no gaps until alignment score falls below a certain threshold. ATGCCGATACGATCAGCTACGATCAG…

Sensitivity-Speed Tradeoff long words (k = 15) short words (k = 7) Sensitivity Speed Longer words => Fewer alignments => Faster but Low chance of a match Shorter words => More alignments => High chance of a match but Slower

BLAT - The BLAST-Like Alignment Tool Similarities: Rapid scans for relatively short matches (hits) Extend these hits into high-scoring pairs (HSP) Differences: BLASTBLAT Index of query sequenceIndex of database Scan through databaseScan through query sequence Gaps not allowed in ext.Gaps allowed in extension Returns smaller alignmentsReturns larger alignments

BLAT Strategies - I Single perfect matches We do not allow any mismatch. Common intuition : fewer matches for longer words

Sensitivity and Specificity Single Perfect Nucleotide K-mer Matches as a Search Criterion

BLAT Strategies - II Allow one mismatch Intuition: Higher number of matches for same word length => Better sensitivity (Caution: Keep k higher, else no. of matches will be huge)

Sensitivity and Specificity Single Near-Perfect Nucleotide K-mer Matches as a Search Criterion (one mismatch allowed)

BLAT Strategies - III Allow multiple perfect matches Two parameter: N: no. of matches, K: word size Practically: Same sensitivity, higher speed

Sensitivity and Specificity Multiple Perfect Nucleotide K-mer Matches as a Search Criterion (2 and 3 perfect matches)

Seeded Alignment A dominant paradigm for fast comparisons Seed: A common pattern of positions used for efficient large scale comparison of genomic DNA …G A T T A C C A G A T T A C C A G A T T A … Seed: {0,1,2,3} => Comparison sequences:{G A T T}{A T T A}{T T A C} … Seed: {0,2,4,6} => Comparison sequences:{G T A C}{A T C A}{T A C G} …

Similarity Detection Sequence 1: Sequence 2: Seed: {0,2,4,5} A T C G A C T C T A G T C T Offset = 0 => Mismatch A T C G A C T C T A G T C T Offset = 1 => Match A T C G A C T C T A G T C T We can have multiple seeds (patterns)!

Seed Design A (hazy) problem definition Collection of ungapped genomic sequence similarities Parameters: length of seeds, resource limits ….. Algorithm A set of seeds that will give “optimum performance” (What are the parameters? How do you define optimum performance?)

Tasks in Seed Designing Define a measure of goodness for a seed. Easy ! Sensitivity to interesting biosequence similarities. Show how to evaluate goodness for a seed. Hard  ! No efficient algorithm.

Terminology related to a Seed A seed, P = a set of ordered list of w positions i.e. P = {x 1, x 2, …, x w } w = weight of P = |P| s = span of P = x w – x Ex: P = {0, 1, 4, 5} w = 4 s = 5 – = 6

Computational Cost Seed weight w No. of seeds n f Computational Cost

Performance Measurement Optimum performance => Maximum sensitivity (i.e. detection probability) to the similarities S (Currently, Markov Models are used to measure these probabilities!)

Markov Models A k th order Markov model M: Given k bits, predict (k+1) th bit X4X4 X3X3 X2X2 X1X1 Pr [ X 5 =1|X 4 X 3 X 2 X 1 ] Pr[X 5 =1|0000] Pr[X 5 =1|0001]. Pr[X 5 =1|1111] 4 th order Markov Model

(Exact) Problem Definition Inputs: Number of seeds: n Weight of each seed: w Markov Model: M Similarities: S Output: A set of seeds (ordered positions), P = {x 11, …, x 1w }, {x 21,…,x 2w },…,{x n1,…,p nw } that maximizes detection probability for S

Computing Detection Probabilities Challenge: The probability of at least one match varies because the probabilities of matches at different offsets are not independent. Ex: Seed = {0,2,3,5} This similarity has 2 matches, at offsets 0 and 2, which share two of four positions in common

DFA to compute Detection Probability (Deterministic Finite Automaton) Construct a DFA that accepts a string of 1’s and 0’s defined by the seed P. Ex: P = {0,2} i.e. for a substring of length 3, we need a match in 1 st and 3 rd position. Then the DFA should accept strings given by the regular expression: “(0+1)*1(0+1)1(0+1)*”

Dynamic Programming Algorithm Size of DFA <= Time to construct a DFA = Time for each step = No. of Steps = l Total time complexity = This is faster by a factor of s 2 /w than the best previous algorithm for detection probabilities To compute the detection probabilities recursively! Complexity analysis:

Remarks about the algorithm Can be extended to work with a set of seeds The DFA need not be minimal Time complexity can be further reduced

Structure in Seed Space Addressing the problem: When is one seed more sensitive than another? Factors: Parameters of the Markov model M Similarity length : l Smaller length: irregular behavior => We can generalize only for asymptotic cases

Asymptotic Result Let, E l (P) = Event that P detects S at some offset E l c (P) = complementary event Then, A seed P is asymptotically worse than a seed P’, P < P’, if Lim l Pr[ E l c (P) / E l c (P’) ] > 1 (P’ has more chances of detecting S, than P does!)

Mandala: Fast, Practical Seed Design Seed selection: No efficient algorithm to find optimum w, s given M (except Brute Force) Applies local search method; global efficiency sacrificed Training a Markov model: Training set is adaptively selected to suit the intended application Samples training set using LSH-ALL-PAIRS algorithm

Experimental Results - I Avg. detection probabilities given by theoretical models for random seeds (w = 11) Solid line: M 0 Dashed line: M 5

Experimental Results - II Detection probabilities for best seeds found by Mandala (k = 5) Solid: noncoding DNA model Dashed: coding DNA model

Directions for further research Extend the model to evaluate seeds Extend similarity models to distinguish between different classes of substitution Construct models of multiple alignment: to compare 3 or more genomes at once

Timing - BLAT vs WU-TBLASTX Dataset: 1000 Mouse Reads and a RepeatMasked Human Chromosome 22

Sensitivity – BLAT vs WU-TBLASTX Dataset: 13 million Mouse Shotgun Reads and Human Chromosome 22

Sensitivity and Specificity Single Perfect Amino Acid K-mer Matches as a Search Criterion

Sensitivity and Specificity Single Near-Perfect Amino Acid K-mer Matches as a Search Criterion (one mismatch allowed)

Sensitivity and Specificity Multiple Perfect Amino Acid K-mer Matches as a Search Criterion (2 and 3 perfect matches)

Mathematical Formula To compute the probability that the DFA associated with a seed accepts a string randomly chosen from a Markov Model M S = similarity, l = length of S, k = order of M, delta = bit string of length k, q = a state, Phi(q) = set of states that transition to q on bit b t = no. of bits read, k’ = min {k, t}

Dynamic Programming Initialize the recurrence: P (q 0, 0, 0) = 1 After l steps, return the sum over all k+1-mer bit strings delta.1 of P (q a, l, delta.1)