Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, 1999-2009. All rights reserved.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Scaffold Download free viewer:
Sequence comparison: Local alignment
Lecture invitation AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,
Sequencing a genome and Basic Sequence Alignment
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment.
© Wiley Publishing All Rights Reserved.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Pairwise & Multiple sequence alignments
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Alineamiento Matricial (Harr Plot, Matrix Plot, Dot Plot, Dot Matrix)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment1 BCB 444/544 Lecture 6 Try to Finish Dynamic Programming Global & Local Alignment.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics.
Lecture 2: Introduction to Computational Biology
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Last lecture summary. identity vs. similarity homology vs. similarity gap penalty affine gap penalty gap penalty high fewer gaps, if investigating related.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Pairwise sequence Alignment.
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.

Sequence Alignment Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences  Pair-wise alignment: compare two sequences  Multiple sequence alignment: compare more than two sequences

Example sequence alignment Task: align “abcdef” with “abdgf” Task: align “abcdef” with “abdgf” Write second sequence below the first Write second sequence below the firstabcdefabdgf Move sequences to give maximum match between them Move sequences to give maximum match between them Show characters that match using vertical bar Show characters that match using vertical bar

Example sequence alignment abcdef||abdgf Insert gap between b and d on lower sequence to allow d and f to align Insert gap between b and d on lower sequence to allow d and f to align

Example sequence alignment abcdef || | | ab-dgf

Example sequence alignment abcdef || | | ab-dgf Note e and g don’t match Note e and g don’t match

Matching Similarity vs. Identity Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters More on how to define similarity later More on how to define similarity later

Global vs. Local Alignment We distinguish We distinguish  Global alignment algorithms which optimize overall alignment between two sequences  Local alignment algorithms which seek only relatively conserved pieces of sequence  Alignment stops at the ends of regions of strong similarity  Favors finding conserved patterns in otherwise different pairs of sequences

Global vs. Local Alignment Global GlobalLGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA Local Local GKG ||| ||| GKG

Global vs. Local Alignment Global GlobalLGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA Local Local TGKG ||| ||| AGKG

Why do sequence alignments? To find whether two (or more) genes or proteins are evolutionarily related to each other To find whether two (or more) genes or proteins are evolutionarily related to each other To find structurally or functionally similar regions within proteins To find structurally or functionally similar regions within proteins

Origin of similar genes Similar genes arise by gene duplication Similar genes arise by gene duplication Copy of a gene inserted next to the original Copy of a gene inserted next to the original Two copies mutate independently Two copies mutate independently Each can take on separate functions Each can take on separate functions All or part can be transferred from one part of genome to another All or part can be transferred from one part of genome to another

Methods for Pairwise Alignment Dot matrix analysis Dot matrix analysis Dynamic Programming Dynamic Programming Word or k-tuple methods (FASTA and BLAST) Word or k-tuple methods (FASTA and BLAST)

Sequence comparison with dot matrices Goal: Graphically display regions of similarity between two sequences (e.g., domains in common between two proteins of suspected similar function) Goal: Graphically display regions of similarity between two sequences (e.g., domains in common between two proteins of suspected similar function)

Sequence comparison with dot matrices Basic Method: For two sequences of lengths M and N, lay out an M by N grid (matrix) with one sequence across the top and one sequence down the left side. For each position in the grid, compare the sequence elements at the top (column) and to the left (row). If and only if they are the same, place a dot at that position. Basic Method: For two sequences of lengths M and N, lay out an M by N grid (matrix) with one sequence across the top and one sequence down the left side. For each position in the grid, compare the sequence elements at the top (column) and to the left (row). If and only if they are the same, place a dot at that position.

Examples for protein sequences (Demonstration A6, Sequence 1 vs. 2) (Demonstration A6, Sequence 1 vs. 2)abcdaefghbijklcmnopdabcdaefghbijklcmnopd

Interpretation of dot matrices Regions of similarity appear as diagonal runs of dots Regions of similarity appear as diagonal runs of dots Reverse diagonals (perpendicular to diagonal) indicate inversions Reverse diagonals (perpendicular to diagonal) indicate inversions Reverse diagonals crossing diagonals (Xs) indicate palindromes Reverse diagonals crossing diagonals (Xs) indicate palindromes

Examples for protein sequences (Demonstration A6, Sequence 4 vs. 4) (Demonstration A6, Sequence 4 vs. 4) abcdeedcbafghijklmno abcdeedcbafghijklmno

Interpretation of dot matrices Can link or "join" separate diagonals to form alignment with "gaps" Can link or "join" separate diagonals to form alignment with "gaps"  Each a.a. or base can only be used once  Can't trace vertically or horizontally  Can't double back  A gap is introduced by each vertical or horizontal skip

Examples for protein sequences (Demonstration A6, Sequence 2 vs. 3) (Demonstration A6, Sequence 2 vs. 3)abcdaefghbijklcmnopdabcdefghijklmnopqrst

Uses for dot matrices Can use dot matrices to align two proteins or two nucleic acid sequences Can use dot matrices to align two proteins or two nucleic acid sequences Can use to find amino acid repeats within a protein by comparing a protein sequence to itself Can use to find amino acid repeats within a protein by comparing a protein sequence to itself  Repeats appear as a set of diagonal runs stacked vertically and/or horizontally

Examples for protein sequences (Demonstration A6, Sequence 5 vs. 5) (Demonstration A6, Sequence 5 vs. 5)abcdabcdabcdabcdabcdabcdabcdabcdabcdabcd

Uses for dot matrices Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to itself complemented and reversed Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to itself complemented and reversed Excellent approach for finding sequence transpositions Excellent approach for finding sequence transpositions

Filtering to remove “noise” A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A) A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A) Solution use a window and a threshold Solution use a window and a threshold  compare character by character within a window (have to choose window size)  require certain fraction of matches within window in order to display it with a “dot”

Example spreadsheet with window (Demonstration A7) (Demonstration A7)

How do we choose a window size? Window size changes with goal of analysis Window size changes with goal of analysis  size of average exon  size of average protein structural element  size of gene promoter  size of enzyme active site

How do we choose a threshold value? Threshold based on statistics Threshold based on statistics  using shuffled actual sequence  find average (m) and s.d. (  ) of match scores of shuffled sequence  convert original (unshuffled) scores (x) to Z scores Z = (x - m)/ Z = (x - m)/   use threshold Z of of 3 to 6  using analysis of other sets of sequences  provides “objective” standard of significance

Dot matrix analysis with Matlab bioinformatics toolbox Get phage cI and phage P22 c2 repressor sequences from Genbank (X00166 and V01153 respectively) Get phage cI and phage P22 c2 repressor sequences from Genbank (X00166 and V01153 respectively) Use window size of 11 and stringency of 7 Use window size of 11 and stringency of 7

Matlab code getgenbank('X00166', 'TOFILE', 'HGENBANKX00166.GBK'); getgenbank('V01153', 'TOFILE', 'HGENBANKV01153.GBK'); seq1 = genbankread('HGENBANKX00166.GBK'); seq2 = genbankread('HGENBANKV01153.GBK'); window=11; num=7; seqdotplot(seq1,seq2,window,num)xlabel('X00166');ylabel('V01153'); title('Window 11 Num 7');

Dot matrix Note set of diagonals in lower right that do not line up due to insertion near 475 on cI Note set of diagonals in lower right that do not line up due to insertion near 475 on cI

Dot matrix analysis with Dotmatcher Get the corresponding protein sequence of phage cI and phage P22 c2 repressor sequences (CAA24991 and CAA24470 respectively) Get the corresponding protein sequence of phage cI and phage P22 c2 repressor sequences (CAA24991 and CAA24470 respectively) Use Emboss Dotmatcher online: Use Emboss Dotmatcher online: emboss.bioinformatics.nl emboss.bioinformatics.nl under ‘ ALIGNMENT DOT PLOTS’ under ‘ ALIGNMENT DOT PLOTS’ Use window size of 10 and threshold of 23 BLOSUM62 units (default parameters) Use window size of 10 and threshold of 23 BLOSUM62 units (default parameters)

Dot matrix analysis with Dotmatcher

Dot matrix Similarity in the carboxy- terminal domains of the proteins agrees with the similarity in 3’ends of the two DNA sequences. Similarity in the carboxy- terminal domains of the proteins agrees with the similarity in 3’ends of the two DNA sequences.

Dot matrix analysis with Matlab bioinformatics toolbox Get human LDL receptor protein sequence from Genbank (P01130) Get human LDL receptor protein sequence from Genbank (P01130) Use window size of 1 and stringency of 1 Use window size of 1 and stringency of 1 Use window size of 23 and stringency of 7 Use window size of 23 and stringency of 7

Matlab code getgenpept('P01130', 'TOFILE', 'HGENBANKP01130.GBK'); getgenpept('P01130', 'TOFILE', 'HGENBANKP01130.GBK'); seq5 = genbankread('HGENBANKP01130.GBK'); seq5 = genbankread('HGENBANKP01130.GBK'); window=1; num=1; seqdotplot(seq5,seq5,window,num) window=1; num=1; seqdotplot(seq5,seq5,window,num) xlabel('P01130 Human LDL receptor'); xlabel('P01130 Human LDL receptor'); ylabel('P01130 Human LDL receptor'); ylabel('P01130 Human LDL receptor'); title('Window 1 Num 1'); title('Window 1 Num 1'); window=23; num=7; seqdotplot(seq5,seq5,window,num) window=23; num=7; seqdotplot(seq5,seq5,window,num) xlabel('P01130 Human LDL receptor'); xlabel('P01130 Human LDL receptor'); ylabel('P01130 Human LDL receptor'); ylabel('P01130 Human LDL receptor'); title('Window 23 Num 7'); title('Window 23 Num 7');

Dot matrix W=1 S=1 W=1 S=1 Note set of stacked diagonals in upper left Note set of stacked diagonals in upper left

Dot matrix W=23 S=7 W=23 S=7 Note set of stacked diagonals in upper left Note set of stacked diagonals in upper left