Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Measuring the degree of similarity: PAM and blosum Matrix
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics and Phylogenetic Analysis
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Local alignment and BLAST
Alignment IV BLOSUM Matrices
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Doug Raiford Lesson 5

 Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear: next best Polynomial (n 2 ): not bad Exponential (3 n ): very bad

 BLAST fast (linear)  But not as sensitive Speed Sensitivity

 Similarity matrix  Especially with amino acids  Some amino acids have similar chemical characteristics  Similarity to all 8, mers calculated  Usually ~50 are above a threshold  All of these ~50 are considered hits when searching Matrices PAM (Point Accepted Mutation) Built from observed substitution rates in closely related proteins BLOSOM (BLOck SUbstitution Matrix) Built from observed substitution rates in evolutionarily divergent proteins

 PSI-BLAST (Position Specific Iterative)  Align using default similarity matrix  At each query location build a Position Specific Scoring Matrix (PSSM) based upon observed search and alignment results  Repeat with new matrix until results no longer change Build sensitivity by specifying allowed similarity at each position Slower, but still faster than local alignment PSI-BLAST

 Central to bioinformatics  Need for  Phylogeny  Protein function  Protein structure ▪ Structure  function  Drug discovery

 Some parts of proteins are very important to maintain function  Must be similar from species to species  Can we spot these regions through alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac

 Often conserved regions are near active sights  Ligand binding sights (docking)  Protein-to-protein interface  Important regions for tertiary structure Ligand: small molecule, target of protein, e.g. O 2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts Ligand: small molecule, target of protein, e.g. O 2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts

 What if we look at more proteins  Increase our confidence?  But how to go about performing multiple sequence alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctac t-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aa t--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tat aattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag

 Hyper-dimensional dynamic programming  Becomes exponential with respect to number of sequences  O(n L ) with L = number of sequences

 Determine all pair-wise distances  Fast: number of l-mer matches  Slower: full global alignments  Start with closest pair and aligns  Then aligns the next closest to those two  And so on.. ClustalW: cluster-alignment

 Profile: matrix of real values, representing the probability of amino acids at each position in a corresponding multiple sequence alignment  A modification of the Smith/Waterman algorithm  Degree to which an aa is preferred is the degree of match between the profile and the sequence Consensus1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33 Consensus1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

 Mistakes early in a progressive approach propagated throughout process  Once aligned not revisited  Iterative methods devised to revisit  Newest version of ClustalW (version 2) includes iteration Other MSA apps T-Coffee PSalign DIALIGN MUSCLE Other MSA apps T-Coffee PSalign DIALIGN MUSCLE

 Height of letter represents how prevalent that letter is at that position

Database Searches16  Scores are affected by sequence lengths  If want scores that can be compared across different query lengths need to normalize  Term “bit” comes from fact that probabilities are stored as log 2 values (binary, bit)  Done so can add across length of sequence instead of multiply