Sequence Alignment.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Algorithms Sequence Alignment.
We continue where we stopped last week: FASTA – BLAST
Sequence Analysis Tools
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence Alignment.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Introduction to sequence alignment Mike Hallett (David Walsh)
Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Local alignment and BLAST
BNFO 602 Lecture 2 Usman Roshan.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Presentation transcript:

Sequence Alignment

OUTLINE Sequence Alignment Scoring Alignments and Substitution Matrices Inserting Gaps Dynamic Programming Database Searches

Sequence Alignment Comparing sequences for Similarity Homology Prediction of function of genes and proteins Construction of phylogeny Finding motifs

Sequence Alignment - HOMOLOGY Orthologues : any gene pairwise relation where the ancestor node is a speciation event. Often have similar function Paralogues : any gene pairwise relation where the ancestor node is a duplication event. Paralogs tend to have different functions

Sequence Alignment - HOMOLOGY

Sequence Alignment - HOMOLOGY

Sequence Alignment - PHYLOGENY

Sequence Alignment – PROTEIN FUNCTIONS

Sequence Alignment – FINDING MOTIFS

Scoring Alignments and Substitution Matrices The quality of an alignment is measured by giving it a quantitative score The simplest way of quatifying similarity between two sequences is percentage identity. Simply measured by counting the number of identical bases or amino acids matched between the aligned sequences.

Scoring Alignments and Substitution Matrices The dot-plot gives a visual assesment of similarity based on identity. [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Scoring Alignments and Substitution Matrices Percentage identity is a relatively crude measure and does bot give a complete picture of the degree of similarity of two sequences. Scoring identical matches 1 and mismatches as 0 ignores the fact that the type of amino acids involved is highly significant.

Scoring Alignments and Substitution Matrices Genuine matches may not be identical: Seq1: T H I S I S A S E Q U E N C E Seq1: T H A T _ _ _ S E Q U E N C E Isoleucine – Alanine: both hydrophobic Serine – Threonine : both polar

Scoring Alignments and Substitution Matrices Scoring pairs of amino acids: with similar properties  higher scores With different properties  lower scores

Scoring Alignments and Substitution Matrices To assign scores for alignmens use SUBSTITUTION MATRICES [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Scoring Alignments and Substitution Matrices Different types of substitution matrices are being used based on: The number of mutations required for convertion of one amino acid to the other Similarities in physicochemical properties.

Scoring Alignments and Substitution Matrices PAM substitution matrices: Use closely related protein sequences to derive substitution frequencies Accepted Point Mutations per 100 residues 250 PAM  250 mutation on 100 residues

Scoring Alignments and Substitution Matrices BLOSUM substitution matrices: BLOcks of Amino Acid SUbstitution Matrix Use mutation data from highly conserved local regions BLOSUM 62  62% identity

Scoring Alignments and Substitution Matrices Which matrix to use ? Depends on the problem properties, Distantly related sequences : PAM 250 – BLOSUM 50 Closely related sequences: PAM 120, BLOSUM 80

Scoring Alignments and Substitution Matrices Which matrix to use ? Some special purpose matrices (SLIM and PHAT are designed for membrane proteins) The length of the sequende is important Short sequences  PAM 40 or BLOSUM 80 Long sequences  PAM 250 or BLOSUM 50

Scoring Alignments and Substitution Matrices BLOSUM – 62 and PAM 120 [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Inserting Gaps Gap insertion requires a scoring penalty (gap penalty). To achieve correct matches gaps are required Alignment programs use gap penalties to limit the introduction of gaps in the alignments

Inserting Gaps Insertions tend to be several residues long rather than just a single residue long Fewer insertions and deletions occur in sequences of structural importance Smaller penalty on lengthening an existing gap (gap extension penalty) than introducing a new gap Gap penaly is high  the number of gaps will be decreased Gap penalty is low  more and large gaps will be inserted.

Inserting Gaps Choosing gap penalties: Linear Affine Gap open penalty Gap extension penlty

Dynamic Programming Global and Local alignments Pairwise and Multiple alignments [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming For a pair of sequences there is a large number of possible alignments. 2 sequences of length 1000 have appriximately 10600 different alignments.

Dynamic Programming Dynamic Programming: Problem can be divided into many smaller parts. Optimal alignment will not contain parts that are not themselves optimal. Start from sufficiently short sub-sequences. Alignement is additive:

Dynamic Programming Needleman and Wunsch were the first to propose this method. Find optimal global alignments. Align sequences: Seq1: x (x1x2x3…xm) Seq1: y (y1y2y3…yn)

Dynamic Programming s(a,b) = score of aligning a and b F(i,j) = optimal similarity of X(1:i) and Y(1:j) Recurrence relation: – F(i,0) = Σ s(X(k), gap), 0 <= k <= i – F(0,j) =Σ s(gap, B(k)), 0 <= k <= j – F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] – Assume linear gap penalty

Dynamic Programming Sequences: Gaps: Scoring matrix (BLOSUM-62)  X = THISLINE, Y = ISALIGNED Gaps: Linear gap penalty (E=8) Scoring matrix (BLOSUM-62) 

Dynamic Programming Matrix S of optimal scores of sub-sequence alignments. [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming S(I, T) = -1,

Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming I -- H I -- gap gap -- H [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming S(I, H) = -3, S(I, gap) = -8, S(gap, H) = -8 Recurrence relation: F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming S(I, H) = -3, S(I, gap) = -8, S(gap, H) = -8 Recurrence relation: F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming Linear gap penalty (E=4) [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Dynamic Programming Example:

Dynamic Programming Example (cont.):

Dynamic Programming Example (cont.):

Dynamic Programming Semi – global alignment: When we treat terminal gaps differently than internal gaps How to modify dynamic programming to be able to make semi – global alignment ?

Dynamic Programming Local alignment: If we compare a sequence to whole genome Find sub-strings whose optimal global alignment value is maximum

Dynamic Programming

Dynamic Programming What is the difference between global and local alignment ? Can we define the recuernce relation of local alignment similar to global alignment ?

Dynamic Programming Recurrence relation of GLOBAL ALIGNMENT: (Needleman & Wunsch) – F(i,0) = Σ s(X(k), gap), 0 <= k <= i – F(0,j) =Σ s(gap, B(k)), 0 <= k <= j – F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)]

Dynamic Programming Recurrence relation of LOCAL ALIGNMENT: (Smith-Waterman) – F(i,0) = 0 – F(0,j) = 0 – F(i,j) = max [ 0, F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)]

Dynamic Programming

Dynamic Programming

Database Searches FASTA and BLAST Use some heuristics Dynamic Programming Complexity Time O(n*m) Space O(n*m)

Database Searches FASTA Good local alignment should have some exact match subsequence. Find all k-tuples. (k=1-2 for proteins, 3-6 for DNA sequences) Protein k – tuples  nc, sp, … (k = 2) Nucleotide k – tuples  TAAA, CTCC,…(k = 4)

Database Searches FASTA If k = 3 for nucleotide sequences. There will be 64 possible k – tuples Assign a number e( ): e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3 Each 3 – tuples are represented as  xi xi+1xi+2 Assign a number to each 3 – tuple Ci = e(xi)42 + e(xi+1)41 + e(xi+2)40 For example: AAA  AAA 042 + 041 + 040 = 0 CAA 142 + 041 + 040 = 16

Database Searches FASTA Find each occurance of k – tuples in the sequences. Chaining  Look – Up Tables Consider TAAAACTCTAAC (if k = 3): 3 - tuples Position AAA (0) 2, 3 AAC (1) 4, 10 AAG (2) AAT (3) …

Database Searches FASTA Find hot – spots on same diagonal [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Database Searches FASTA Consider the highes scoring alignments [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

Database Searches FASTA Join the best ones. You can use DP

Database Searches BLAST Use short words to search the database sequence. Searches for k – mers that will score above a threshold (T) value when aligned with query k - mer (Remember FASTA looks for k – tuples which are identical). Use a scheme based on finite state automata (Remember FASTA use hashing and chaining fot rapid identification of k - tuples)

Database Searches BLAST From Query Sequence, create query words (for protein sequences word size is 3)

Database Searches BLAST Blast uses a list of high scoring words created from words similar to query words. Considers the words with a score bigger than a threshold value.

Database Searches BLAST Scan each database sequence for an exact match to the list of words. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S".

Database Searches BLAST Keep only the extended matches that have a score at least S. Determine statistical significance of each remaining match.

Database Searches BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi Database Searches BLAST

Database Searches BLAST

Database Searches BLAST

Database Searches BLAST

Database Searches BLAST

Database Searches BLAST [Loxodonta cyclotis] [Elephas maximus maximus] [Elephas maximus indicus]

Database Searches HISTORY 1970: NW 1980: SW 1985: FASTA 1989: BLAST

References M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland Science Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical guide to the analysis of genes and proteins”, 2001, Wiley. NCBI HelpDesk - Field Guide Course Materials