C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 3: BLAST Sequence Analysis.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Heuristic alignment algorithms and cost matrices
1-month Practical Course
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 8 Database searching (2)
Bioinformatics For MNW 2 nd Year Lecture 20: Homology searching using heuristic methods Integrative Bioinformatics Institute VU (IBIVU)
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Genome Analysis 2007 Lecture 7 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Iterative homology searching (PSI-BLAST)
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Introduction to Bioinformatics Lecture 11: Homology searching using heuristic methods Centre for Integrative Bioinformatics VU (IBIVU)
Iterative homology searching using PSI-BLAST, scoring statistics and performance evaluation Introduction to bioinformatics 2008 Lecture 10 C E N T R F.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Comp. Genomics Recitation 3 The statistics of database searching.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Sequence database searching – Homology searching Dynamic Programming (DP) too slow for repeated database searches. Therefore fast heuristic methods: FASTA.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Homology searching using heuristic methods
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Introduction to bioinformatics 2007
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 3: BLAST Sequence Analysis

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] Sequence Analysis Sequence searching - challenges Exponential growth of databases

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] Sequence Analysis Sequence searching – definition Task: Query: short, new sequence (~1000b) Database (searching space): very many sequences Goal: find seqs related to query We want: fast tool primarily a filter: most sequences will be unrelated to the query fine-tune the alignment later

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] Sequence Analysis dynamic programming has performance O(mn) which is too slow for large databases with high query traffic – MPsrch [ Sturrock & Collins, MPsrch version 1.3 (1993) – Massively parallel DP] heuristic methods do fast approximation to dynamic programming – FASTA [Pearson & Lipman, 1988] – BLAST [Altschul et al., 1990] Heuristic Alignment Motivation

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] Sequence Analysis Heuristic Alignment Motivation consider the task of searching SWISS-PROT against a query sequence: say our query sequence is 362 amino-acids long SWISS-PROT release 38 contains 29,085,265 amino acids finding local alignments via dynamic programming would entail O(10 10 ) matrix operations many servers handle thousands of such queries a day (NCBI > 50,000) Using the DP algorithm for this is clearly prohibitive Note: each database search can be sped up by ‘trivial parallelisation”

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] Sequence Analysis Heuristic Alignment Today: BLAST is discussed to show you a few of the tricks people have come up with to make alignment and database searching fast, while not losing too much quality.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] Sequence Analysis What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristic Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work Also see Basic idea: High scoring segments have well conserved (almost identical) part As well conserved parts are identified, extend these to the real alignment q e s - euqes-

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] Sequence Analysis What means well conserved for BLAST? BLAST works with k-words (words of length k) k is a parameter different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp. word w 1 is T-similar to w 2 if the sum of pair scores is at least T (e.g. T=12) Similar 3-words W 1 :R K P W 2 :R R P Score:9 –1 7  = 15

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] Sequence Analysis BLAST algorithm 3 basic steps 1)Preprocess the query: extract all the k-words 2)Scan for T-similar matches in database 3)Extend them to alignments 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] Sequence Analysis BLAST, Step 1: Preprocess the query Take the query (e.g. LVNRKPVVP ) Chop it into overlapping k-words (k=3 in this case) For each word find all similar words (scoring at least T) E.g. for RKP the following 3-words are similar: QKP KKP RQP REP RRP RKP 1) Preprocess 2) Scan 3) Extend Query:LVNRKPVVP Word1:LVN Word2: VNR Word3: NRK …

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] Sequence Analysis Step 2: Scanning the Database with DFA (Deterministic Finite-state Automaton) search database for all occurrences of query words can be a massive task approach: build a DFA (deterministic finite-state automaton) that recognizes all query words run DB sequences through DFA remember hits 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] Sequence Analysis DFA Finite state machine AC*T|GGC abstract machine constant amount of memory (states) used in computation and languages recognizes regular expressions cp dmt*.pdf /home/john 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] Sequence Analysis BLAST, Step 2: Find “exact” matches with scanning Use all the T-similar k-words to build the Finite State Machine Scan for exact matches...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA... QKP KKP RQP REP RRP RKP... movement 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] Sequence Analysis Scanning the Database - DFA Example (next 2 slides): consider a DFA to recognize the query words: QL, QM, ZL All that a DFA does is read strings, and output "accept" or "reject." use Mealy paradigm (accept on transitions) to save space and time Moore paradigm: the alphabet is (a, b), the states are q0, q1, and q2, the start state is q0 (denoted by the arrow coming from nowhere), the only accepting state is q2 (denoted by the double ring around the state), and the transitions are the arrows. The machine works as follows. Given an input string, we start at the start state, and read in each character one at a time, jumping from state to state as directed by the transitions. When we run out of input, we check to see if we are in an accept state. If we are, then we accept. If not, we reject. Moore paradigm: accept/reject states Mealy paradigm: accept/reject transitions

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] Sequence Analysis a DFA to recognize the query words: QL, QM, ZL in a fast way Q Z L or M Q not (L or M or Q) Z L not (L or Z) Mealy paradigm not (Q or Z) Accept on red transitions start This DFA is downloaded from expert website, but what do you think (see next..)?

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] Sequence Analysis a DFA to recognize the query words: QL, QM, ZL in a fast way Q Z L or M Q not (L or M or Q or Z) Z L not (L or Z or Q) Mealy paradigm not (Q or Z) Accept on red transitions start Z Q spot and justify the differences with the last slide..

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] Sequence Analysis BLAST, Step 3: Extending “exact” matches Having the list of matches (hits) we extend alignment in both directions Query: L V N R K P V V P T-similar: R R P Subject: G V C R R P L K C Score: ) Preprocess 2) Scan 3) Extend …till the sum of scores drops below some level X from the best known

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] Sequence Analysis Step 3: Extending Hits extend hits in both directions (without allowing gaps) terminate extension in one direction when score falls certain distance below best score for shorter extensions return segment pairs scoring at least S 1) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] Sequence Analysis More Recent BLAST Extensions the two-hit method gapped BLAST hashing the database PSI-BLAST all are aimed at increasing sensitivity while keeping run-times minimal Altschul et al., Nucleic Acids Research 1997

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] Sequence Analysis The Two-Hit Method extension step typically accounts for 90% of BLAST’s execution time key idea: do extension only when there are two hits on the same diagonal within distance A of each other to maintain sensitivity, lower T parameter more single hits found but only small fraction have associated 2nd hit

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] Sequence Analysis The Two-Hit Method Figure from: Altschul et al. Nucleic Acids Research 25, 1997

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] Sequence Analysis Gapped BLAST trigger gapped alignment if two-hit extension has a sufficiently high score find length-11 segment with highest score; use central pair in this segment as seed run DP process both forward & backward from seed prune cells when local alignment score falls a certain distance below best score yet

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] Sequence Analysis Gapped BLAST Figure from: Altschul et al. Nucleic Acids Research 25, 1997

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] Sequence Analysis Combining the two-hit method and Gapped BLAST Before: relatively high T threshold for 3-letter word (hashed) lists two-way hit extension (see earlier slides) Current BLAST: Lower T: many more hits (more 3-letter words accepted as match) Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet) The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] Sequence Analysis Hashing – associative arrays Indexing with the object, the Hash function: Objects should be “well spread” hash: x set of possible objects - large small (fits in memory)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] Sequence Analysis Hashing - examples T9 Predictive Text in mobile phones “hello”: 4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6 “hello” in T9: 4, 3, 5, 5, 6 Collisions: 4, 6: “in”, “go”

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] Sequence Analysis Hashing – examples (cont..) Other easier hash function: let a=1, b=2, c=3, etc. “hello” now gets hash address = 52 “olleh” will get same address (collision) Each word encountered gets a hash address immediately and can be indexed. How good is this hash function?

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] Sequence Analysis BLAST, Step 2: Find ”exact” matches with hashing Preprocess the database Hash the database with k-words For each k-word store in which sequences it appears k-word: RKP Hashed DB: QKP: HUgn , Gene14, IG0,... KKP: haemoglobin, Gene134, IG_30,... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30,... RRP: Z17368, Creatine kinase, ) Preprocess 2) Scan 3) Extend

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] Sequence Analysis BLAST, Step 2: Find “exact” matches with hashing The database is preprocessed only once! (independent from the query) In a constant time we can get the sequences with a certain k-word k-word: RKP Hashed DB: QKP: HUgn , Gene14, IG0,... KKP: haemoglobin, Gene134, IG_30,... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30,... RRP: Z17368, Creatine kinase,......

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] Sequence Analysis BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db in all reading frames. Used to find potential translation products of an unknown nucleotide sequence. tblastn: protein query, DNA db database dynamically translated in all reading frames. tblastx: DNA query, DNA db all translations of query against all translations of db (compare at protein level)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] Sequence Analysis PSI-BLAST Position-Specific Iterated BLAST A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence) Database is searched against the profile (instead of a sequence) in subsequent rounds Up to 3-10 iterations are recommended

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] Sequence Analysis 1.Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment. 2.The program then initially operates on a single query sequence by performing a gapped BLAST search 3.Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master- slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment. 4.The database is rescanned in a subsequent round, now using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged PSI-BLAST steps in words

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [33] Sequence Analysis Profile a Profile is a generalized form of sequence probabilities instead of a letter ACDWYACDWY       

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [34] Sequence Analysis Constructing a profile Take significant BLAST hits Make an alignment Assign weights to sequences Construct profile ACDWYACDWY       

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [35] Sequence Analysis PSI BLAST: Constructing the Profile Matrix Figure from: Altschul et al. Nucleic Acids Research 25, 1997

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [36] Sequence Analysis 12345Overall A /30 =.20 C /30 =.30 G /30 =.23 T /30 = S1 GCTCC S2 AATCG S3 TACGC S4 GTGTT S5 GTAAA S6 CGTCC 12345Overall A /30 =.20 C /30 =.30 G /30 =.23 T /30 = A C G T Normalise by dividing by overall frequencies Convert to log to base of A C G T Match GATCA to PSSM Score = = 3.23 Find nucleotides at corresponding positions Sum corresponding log odds matrix scores (A) (B) Profile calculation example using frequency normalisation and log conversion profile

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [37] Sequence Analysis PSI BLAST: Determining profile elements more reliably using pseudo-counts the value for a given element of the profile matrix is given by: where the probability of seeing amino acid a i in column j is estimated as: Observed frequency Pseudocount (e.g. database frequency) e.g.  = number of sequences in profile,  =1 Overall alignment frequency (preceding slide)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [38] Sequence Analysis PSI BLAST: Determining profile elements more reliably using pseudo-counts Pseudo-counts: mix observed a.a. frequencies with prior (e.g. database) frequencies drawback is pulling all frequencies to prior frequencies, which reduces differences are useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet with greater numbers of sequences in the MSA, the profile becomes less dependent

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [39] Sequence Analysis PSI-BLAST iteration graphic… Q ACD..YACD..Y Query sequence PSSM Q Query sequence Gapped BLAST search Database hits Gapped BLAST search ACD..YACD..Y PSSM Database hits xxxxxxxxxxxxxxxxx iterate Low-complexity region

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [40] Sequence Analysis DBT hits PSSM Q Discarded sequences Run query sequence against database Run PSSM against database Another PSI-BLAST iteration graphic…

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [41] Sequence Analysis (A)(B) (C)(D) Figure 6

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [42] Sequence Analysis PSI-BLAST entry page Paste your query sequence Switch this off for default run

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [43] Sequence Analysis

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [44] Sequence Analysis 1 - This portion of each description links to the sequence record for a particular hit. 2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence). 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e -117 meaning that a sequence with a similar score is very unlikely to occur simply by chance. 4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [45] Sequence Analysis ‘ X’ residues denote low-complexity sequence fragments that are ignored

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [46] Sequence Analysis Alignment Bit Score S is the raw alignment score The bit score (‘bits’) B has a standard set of units The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment and K and are the statistical parameters of the scoring system (BLOSUM62 in Blast). See Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices) B = ( S – ln K) / ln 2

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [47] Sequence Analysis What is the statistical significance of an alignment To get a null model: extract local alignments from random sequences P-value The probability of obtaining the result by pure chance An alignment giving a lower P-value than a threshold value set by the user is considered a hit.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [48] Sequence Analysis Normalised sequence similarity The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences. This probability follows the Poisson distribution (Waterman and Vingron, 1994): P(x, n) = 1 – e -n  P(S  x), where n is the number of sequences in the database Depending on x and n (fixed)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [49] Sequence Analysis E-value The concept of P-value applies to single comparisons What with searching in a large database? Task. Having a protein, we want to find similar ones in a large database (1mln sequences). We are interested in P-value < 0.01 Count the number of hits we’ll get by chance alone.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [50] Sequence Analysis Normalised sequence similarity Statistical significance The E-value is defined as the expected number of non- homologous sequences with score greater than or equal to a score x in a database of n sequences: E(x, n) = n  P(S  x) For example, if E-value = 0.01, then the expected number of random hits with score S  x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database. if the E-value of a hit is 5, then five fortuitous hits with S  x are expected within a single database search, which renders the hit not significant.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [51] Sequence Analysis A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [52] Sequence Analysis Extreme Value Distribution Probability density function for the extreme value distribution resulting from parameter values  = 0 and = 1, [y = 1 – exp(-e -x )], where  is the characteristic value and is the decay constant. y = 1 – exp(-e - (x-  ) )

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [53] Sequence Analysis Extreme Value Distribution (EDV) You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit. real data EDV approximation

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [54] Sequence Analysis Extreme Value Distribution The probability of a score S to be larger than a given value x can be calculated following the EDV as: E-value: P(S  x) = 1 – exp(-e - (x-  ) ), where  =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices).

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [55] Sequence Analysis Extreme Value Distribution Using the equation for  (preceding slide), the probability for the raw alignment score S becomes P(S  x) = 1 – exp(-Kmne - x ). In practice, the probability P(S  x) is estimated using the approximation 1 – exp(-e -x )  e -x, which is valid for large values of x. This leads to a simplification of the equation for P(S  x): P(S  x)  e - (x-  ) = Kmne - x. The lower the probability (E value) for a given threshold value x, the more significant the score S.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [56] Sequence Analysis Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [57] Sequence Analysis Words of Encouragement “There are three kinds of lies: lies, damned lies, and statistics” – Benjamin Disraeli “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination” “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [58] Sequence Analysis Database Search Algorithms: Sensitivity, Selectivity Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to the similar to the query, but rejected. Sensitivity = TP / (TP+FN) Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity = TP / (TP + FP) Sensitivity Selectivity Courtesy of Gary Benson (ISSCB 2003)

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [59] Sequence Analysis Dot-plots a simple way to visualise sequence similarity Can be a bit messy, though... Filter: 6/10 residues have to match...

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [60] Sequence Analysis Dot-plots, what about... Insertions/deletions -- DNA and proteins Duplications (e.g. tandem repeats) – DNA and proteins Inversions -- DNA

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [61] Sequence Analysis Dot-plots, self-comparison Direct repeat Tandem repeat Inverted repeat

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [62] Sequence Analysis The amount of genetic information in organisms Name# genes Escherichia coli Homo sapiens Zea mays Genome size (Mb) Mycoplasma genitalium Saccharomyces cerevisiae Drosophila melanogaster Caenorhabtitis elegans

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [63] Sequence Analysis END