From Pairwise Alignment to Database Similarity Search.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Last lecture summary.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
From Pairwise Alignment to Database Similarity Search.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to bioinformatics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
From Pairwise Alignment to Database Similarity Search.
Protein Sequence Comparison Patrice Koehl
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Database Searching BLAST and FastA.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

From Pairwise Alignment to Database Similarity Search

Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments

Best score for aligning part of sequences Dynamic programming Algorithm: Smith-Waterman Table cells never score below zero Best score for aligning the full length sequences Dynamic programming Algorithm: Needelman- Wunch Table cells are allowed any score Global Local Pairwise Alignment Summary

Available tools for Sequence Alignments ALIGNALIGN -- GLOBAL (N-W)/ LOCAL (S-W) BLAST2SEQBLAST2SEQ – Only LOCAL using word match spideyspidey -- aligns mRNAs to genomic sequence est2genomeest2genome -- aligns ESTs to genomic sequence

Gap Scores >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Biologically, indels occur in groups we want our gap score to reflect this

Gap Scores Standard solution: affine gap model –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

Affine Gap Penalty w x = g + r(x-1) w x : total gap penalty; g: gap open penalty; r: gap extend penalty;x: gap length gap penalty chosen –Gaps not excluded –Gaps not over included –Typical Values: g=-12; r = -4

Is this good enough ???

Drawbacks to DP Approaches Compute intensive Memory Intensive

Complexity Complexity is determined by size of table –Aligning a sequence of length m against one of length n requires calculating (m  n) cells Estimate: we calculate 10 8 cells per second –Aligning two mRNA sequences of 8,000 bp requires 64,000,000 cells –Aligning an mRNA and a 10 7 bp chromosome requires ~10 11 cells

Searching databases Goal: Find homologue sequences in database to query input.

new sequence ? Sequence Database similar function ≈ Similar function

Searching databases Goal: Find homologue sequences in database to query input. Naïve solution: Use exact algorithm to compare each sequence in the database to query.

Complexity for genomes Human genome contains 3  10 9 base pairs –Searching an mRNA against HG requires ~10 13 cells -Even efficient exact algorithms will be extremely slow when preformed millions of times. -Running the computations in parallel is expensive.

So what can we do?

Searching databases Solutions: 1.Use a heuristic (approximate) algorithm to discard most irrelevant sequences. 2.Perform the exact algorithm on the small group of remaining sequences.

Heuristic strategy Homologous sequences are expected to contain common short segments (probably with substitutions, but without ins/dels) Preprocess database (DB) into new data structure to enable fast accession Remove low-complexity regions that are not useful for meaningful alignments

AAAAAAAAAAA ATATATATATATA Transposable elements (LINEs, SINEs) Low Complexity Sequences

Whats wrong with them? produce artificial high scoring alignments. So what do we do?: We apply Low Complexity masking to the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Low Complexity Sequences Complexity is calculated as: Where N=4 in DNA (4 bases), L is the length of the sequence And n i the number of each residue in the sequence K=1/L log N (L!/Π n i !) all i For the sequence GGGG: L! =4x3x2x1=24 n g =4 n c =0 n a =0 n t =0 Πn i =24x1x1x1=24 K =1/4 log 4 (24/24)=0 For the sequence CTGA: L! =4x3x2x1=24 ng =1 nc =1 na =1 nt =1 Πni =1x1x1x1 K =1/4 log 4 (24/1)=0.573

Heuristic (approximate solution) Methods: FASTA and BLAST FASTA (Lipman & Pearson 1985) –First fast sequence searching algorithm for comparing a query sequence against a database BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) –improvement of FASTA: Search speed, ease of use, statistical rigor –Gapped BLAST (Altschul et al 1997)

FASTA and BLAST Common idea - a good alignment contains subsequences of absolute identity: –First, identify very short (almost) exact matches. –Next, the best short hits from the 1st step are extended to longer regions of similarity. –Finally, the best hits are optimized using the Smith- Waterman algorithm.

FastA (fast alignment) Assumption: a good alignment probably matches some identical ‘words’ Example: Aligning a query sequence to a database Database record: ACTTGTAGATACAAAATGTG Query sequence: A-TTGTCG-TACAA-ATCTG Matching words of size 4

Preprocess of all the sequences in the database. Find short words and organize in dictionaries. Process the query sequence and prepare a dictionary. –ATGGCTGCTCAAGT…. ATGGTGGCGGCT… … FastA Query

FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches. For DNA sequences the word length used is 6. seq1 seq2

The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix. seq1 seq2

FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. seq1 seq2

The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. seq1 seq2

FastA final stage Apply an exact algorithm of local alignment on surviving records, computing the final alignment score. Calculate an Alignment score (S) Evaluate the statistical significance

Assessing Alignment Significance Determine probability of alignment occurring at random Ideal No Good Random Related

FastA at EMBL

FastA A set of programs for database searching. FastA EMBLEMBL

FASTA SEQUENCE FORMAT This format contains a one line header followed by lines of sequence data. Sequences in fasta formatted files are preceded by a line starting with a" >" symbol. The first word on this line is the name of the sequence. The remaining lines contain the sequence itself. Blank lines in a FASTA file are ignored,

FastA at EMBL Output, general information: –Z’ score = deviation (in sd) of the actual score from the mean of random scores Z=(x-mean)/sd –Opt: the number of optimized scores observed. Lower limit –E( ): the number of sequences expected in the score range.

FastA at EMBL

BLAST Basic Local Alignment Search Tool Developed to be as sensitive as FastA but much faster. Also searches for short words. –Protein 3 letter words –DNA 11 letter words. –Words can be similar, not only identical

Word Search -BLAST Identity - CAT : CAT Similarity – CAT : CAT, CAY, HAT … But even CAT : XTX can be similar For each three letter word there are many similar words (depending on the alphabet). Similar words are only the ones that have a minimum cut-off score (T). Y= C or T H= A, C, T X=A or T or C or G

BLAST Find matching word pairs Extend word pairs as much as possible, i.e., as long as the total weight increases Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

BLAST Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

Score and E-value The score is a measure of the similarity of the query to the sequence shown. The E-value is a measure of the reliability of the score. E-value is the probability due to chance, that there is another alignment with a similarity greater than the given S score.

Bit score (S) : –Similar to alignment score –Normalized –Higher means more significant Score (S):  (identities + mismatches)-  gaps BLAST- Score

BLAST- E value: Expected # of alignments with score at least S Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment –K,λ: statistical parameters dependent upon scoring system and background residue frequencies m = length of query ; n= length of database ; s= score

What is a Good E-value - thumb rules E values of less than show that sequences are almost always homologues. Greater E values, can represent homologues as well. Generally the decision whether an E-value is biologically significant depends on the size of database that is searched

Significance of Gapped Alignments Gapped alignments use same statistics and K cannot be easily estimated Empirical estimations and gap scores determined by looking at random alignments

BLAST Blast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastX BlastN - nt versus nt database BlastP - protein versus protein database BlastX - translated nt versus protein database tBlastN - protein versus translated nt database tBlastX - translated nt versus translated nt database Query:DNAProtein Database:DNAProtein

BLAST at NCBI Output –Graphical out put of top results –The alignments for top scores –Scores for each alignment: 1.E value 2.Bits score: a score normalized with respect to the scoring system. Can be used to compare different searches.