Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Sequence Alignment.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
From Pairwise Alignment to Database Similarity Search.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Heuristic Approaches for Sequence Alignments
Multiple Sequence Alignments
Protein Sequence Comparison Patrice Koehl
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Pairwise alignment Computational Genomics and Proteomics.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
DNA, RNA and protein are an alien language
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Heuristic Alignment Algorithms Hongchao Li Jan
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
Sequence Alignment 11/24/2018.
SMA5422: Special Topics in Biotechnology
In Bioinformatics use a computational method - Dynamic Programming.
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA Homology vs. similarity What is pair-wise sequence alignment? Why pair-wise alignment?

Some concepts Optimal alignment Global alignment Gaps Local alignment Gap penalty Substitution matrix

Dotplot What dotplot shows What dotplot does not show A simplified representation

Sequence Alignment Dynamic programming a method for some optimization problems determine a scoring scheme best solution based on a scoring scheme Total number of possible alignments for length n ~ 2 2n / sqrt(2  n) Needleman-Wunsch - global

Questions How does it work? How to come up with a DP approach to an exponential problem? How to implement a DP approach?

Dynamic Programming Algorithm F(i,j) = max Break a problem into subproblems Solve each subproblem separately F(i-1,j-1) + s(x i, y j ) F(i,j-1) + g F(i-1,j) + g s(x i, y j ) : substitution score for aligning x i with y j g : gap penalty F(i,j) : The max score for aligning 1 st i symbols of sequence 1 with 1 st j symbols of sequence 2

Example Initialization matrix filling (scoring) Trace back ACTCG ACAGTAG Match: 1 Mismatch: 0 Gap: -1

A C A G T A G A C T C G i=0 i=1 i=2 i=3 i=4 i=5 j =0, 1, 2, 3, 4, 5, 6, 7

Local Alignment: Smith- Waterman Biological significance F(i,j) = max F(i-1,j-1) + s(x i, y j ) F(i,j-1) + g F(i-1,j) + g 0 O(n 2 ) time

A A C C T A T A G C T G C G A T A T A |||| GCGATATA Local Alignment

Issues in alignment Different ways to fill the table Multiple optimal alignments s(xi, yj) – from substitution matrix gap penalty: linear: w(k) = gk Affine: w(k) = h + gk, k>=1 0, k=0

Gap models New gap vs. gap extension A gap of length k vs. k gaps of length 1 1 insersion / deletion event vs. k events gap penalty: linear: w(k) = gk Affine: w(k) = h + gk, k>=1 0, k=0

Affine Gap Penalty M( i, j ) : best score when xi aligned with yj I x (i, j) : best score when xi aligned with a gap I y (i, j) : best score when yj aligned with a gap Aligning 1 st i symbols of x with 1 st j symbols of y ? Wrong with the F(i,j) formula if AGP is used Three matrices

DP for global alignment for AGP M (i, j) = max M(i-1, j-1) + s(xi, yj) Ix (i-1, j-1) + s(xi, yj) ly (i-1, j-1) + s(xi, yj) Ix (i, j) = max M(i-1, j) + h + g Iy(i-1, j) + h + g lx (i-1, j) + g Iy (i, j) = max M(i, j-1) + h + g Ix(i, j-1) + h + g ly (i, j-1) + g

DP for global alignment using AGP Initialization M(0, 0) =0 Ix(i, 0) = h+gi ly(0, j) = h+gj all other cases: -  Start at the largest element in the three matrices M(m, n), Ix(m, n), ly(m, n) Traceback to (0,0)

DP for local alignment for AGP M (i, j) = max M(i-1, j-1) + s(xi, yj) Ix (i-1, j-1) + s(xi, yj) ly (i-1, j-1) + s(xi, yj) 0 Ix (i, j) = max M(i-1, j) + h + g Iy(i-1, j) + h + g // ignored lx (i-1, j) + g Iy (i, j) = max M(i, j-1) + h + g Ix(i, j-1) + h + g // ignored ly (i, j-1) + g

DP for Local Alignment for AGP Initialization M(0, 0) =0 Ix(i, 0) = 0 ly(0, j) = 0 all other cases: -  Start at the largest M(i, j), Ix(i, j), ly(i, j) Traceback till M(i, j) = 0

Database searching methods Need more efficient methods Dynamic programming - O(n 2 L), L: size of database Why DP is slow? Ideas: Regions that are similar likely to share short identical subsequences Quick search for the regions, then check carefully locally

FASTA related methods Word, word size (2,6), sensitivity vs. speed What are the words in the query also in target Pre-computed table that stores locations of words – “hashing” Heuristic approximation 1. Quick initial “guess” – common subsequences An example

FASTA related methods Use Smith-Waterman method in a band, 32 aa wide around the best score 2. Find the region with high population of common words Process diagonals, rescore, join regions, using gaps 3. Local alignment (DP) in the region identified

Limitation of FASTA Speed vs. sensitivity Can miss biologically significant similarity some proteins do not share identical a.a. initial step Different codons encodes same protein Identical words

BLAST Previous 2 kinds approaches 1. Word list Incorporate similarity measurement for words – PAM120 e.g. ACDE Theoretically sound search for common subsequences Scan for word occurrences hash table Finite state machine (Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) )

BLAST 2. Extend words to HSP (locally optimal pairs) Find additional words within threshold Merge within distance A 3. Select significant HSPs, use DP in banded region

Mini Presentations 1.Previous BLAST 2.Major concepts in BLAST 3.Statistical issue 4.Gapped local alignment –Gapped 5.Position-specific scoring matrix (PSSM) – overall idea, architecture, multiple - alignment construction 6.PSSM – target frequency estimation, application to BLAST (Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) )

Multiple Sequence Alignment Motivation What is MSA? How do we extend knowledge of pair-wise alignment? An example: AGAC, AC, AG AGAC --AC AGAC AG-- AC AG Some possibilities AG-- --ACAGAC Fix pair-wise alignment and then add? Evaluate all the possible alignment of N sequences?

Sum of pairs (SP) scoring methods Given a alignment of N sequences, each of which has length L, in the LxN alignment: Pair-wise sum for each column, then sum all columns Scoring MSA Example (c(match)=1, c(mismatch)=-1, c(gap)=-2, c(gap,gap) =0 SP 4 =SP(I,-,I,V) = =-7 SP = SP 1 +SP 2 + … + SP 8 AQPILLLV ALR-LL—- AK-ILLL- CPPVLILV SP tends to overweight a single mutation SP(A,A,A,C) = 0, SP(A,A,A,A) = 6

DP of N dimensions using SP Time: in the order of (L N )(2 N -1)N 2 ~ O((2L) N N 2 ) Extension of DP for N sequences Extend F(i,j) for N dimensions

STAR method DP provide optimal solution but costly Heuristic methods – STAR, CLUSTALW, … Progressive alignment STAR - pair-wise - build similarity matrix - find a “star” sequence - use “star” to align other sequence - once gap, all time gap

STAR method Example

CLUSTAL family Build Similarity tree – “clustering” Alignment starts at most similar sequences What are the disadvantages of STAR method? 1.Pair-wise alignment --> distance matrix Fast approximate approach or DP

CLUSTALW 2. Construct similarity tree, “the guide tree” Start with most similar sequences Align group with group using pair-wise alignment e.g. 3. Progressive alignment UPGMA ( un-weighted pair-group method using arithmetic average)