Sequence Alignment Xuhua Xia

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
Sequence Alignment.
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Structural bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
©CMBI 2005 Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Analysis Tools
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Xuhua Xia Sequence Alignment Xuhua Xia
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Introduction to Phylogenetics
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Xuhua Xia Sequence Alignment Xuhua Xia
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence alignment with constant and Affine function gap penalties
Sequence evolution and homology identification
SMA5422: Special Topics in Biotechnology
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool (BLAST)
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

Sequence Alignment Xuhua Xia

Xuhua Xia Slide 2 Fundamental concepts The purpose of sequence alignment: –Identification of sequence homology and homologous sites –Homology: similarity that is the result of inheritance from a common ancestor (identification and analysis of homologies is central to phylogenetic systematics). –An Alignment is an hypothesis of positional homology between bases/Amino Acids. The fundamental assumptions: Shared ancestry that is computationally identifiable. –The criterion: maximum alignment score given a scoring scheme –The algorithm: dynamic programming

Xuhua Xia Slide 3 Normal and Thalassemia HBb Are the two genes homologous? What evolutionary change can you infer from the alignment? What is the consequence of the evolutionary change? |----|----|----|----|----|----|----|----|----|----|----|-- Normal AUGGUGCACCUGACUCCUGAGGAGAAGUCUGCCGUUACUGCCCUGUGGGGCAAGGUGAACGU Thalass. AUGGUGCACCUGACUCCUGAGGAGAAGUCUGCCGUUACUGCCCUGUGGGGCAAGGUGAACGU ************************************************************** |----|----|----|----|----|----|----|----|----|----|----|---- Normal GGAUGAAGUUGGUGGU-GAGGCCCUGGGCAGGUUGGUAUCAAGGUUACAAGACAGG Thalass. GGAUGAAGUUGGUGGUUGAGGCCCUGGGCAGGUUGGUAUCAAGGUUACAAGACAGG **************** ***************************************

Xuhua Xia Slide 4 Janeka, JE et al Science 318:792

Xuhua Xia Slide 5 Testing phylogenetic hypotheses OutGroup “Reptilian” Mammal Bird OutGroup Mammal Bird “Reptilian” “Using molecular sequence data to determine the phylogenetic relationships of the major groups of organisms has yielded some spectacular successes but has also thrown up some conundrums. One such is the relationship of birds to the rest of the tetrapods. Morphological data and most molecular studies have placed the birds closer to the crocodiles than to any other tetrapod group, but analysis of sequence data from 18S ribosomal RNA (rRNA) has persistently allied the birds more closely to the mammals. There have been several attempts to account for this niggling doubt, and Xia et al. (2003, Syst. Biol. 52:283) now show that the discrepancy arose because of methodological flaws in the analysis of 18S rRNA data, which caused, among other things, misalignment of sequences from the different taxa. When structure-based alignment is carried out, the resulting phylogeny matches those obtained by other means, with the birds allied to the crocodiles via a common reptilian ancestor.” Science 301:279 (Editors’ Choice)

Xuhua Xia Slide 6 An alignment is a hypothesis Compare the Favorite with Favourite After alignment, we have Favo-rite Favourite Assumption and inference we have implicitly made: –The two words share ancestry, with one being evolved from the other or both from a common ancestor. –If “Favorite” is ancestral, then an insertion event of “u” between sites 4 and 5 has happened during the evolution of the word –If “Favourite” is ancestral, then a deletion event of “u” at site 5 has happened during the evolution of the word –Note the importance of knowing the ancestral states: assuming a wrong ancestral state will lead to a wrong inference of evolutionary events. Or alternatively: Favor Favour FavoriteFavourite Much of modern science is about formulating and testing hypotheses.

Xuhua Xia Slide 7 Type of Alignment Dynamic programming to obtain optimal alignments –Global alignment Representative: Needleman & Wunsch 1970 Application: aligning homologous genes, e.g., between normal and mutant sequences of human hemoglobin β-chain –Local alignment Representative: Smith & Waterman 1981 Application: searching local similarities, e.g., homeobox genes Heuristic alignment methods –D. E. Knuth 1973: author of Tex –D. J. Lipman & W. R. Pearson 1985: FASTA algorithm –S. Altschul et al BLAST algorithm. All heuristic alignment methods currently in use are for local sequence alignment, i.e., searching for local similarities This lecture focuses on global alignment by dynamic programming

Xuhua Xia Slide 8 What is an optimal alignment? An alignment is the stacking of two or more sequences: –Alignment1: Favorite- Favourite –Alignment2: ---Favorite Favourite-- –Alignment3: Favorite Favourite –Alignment4: Favo-rite Favourite An optimal alignment –One with maximum number of matches and minimum number of mismatches and gaps –Operational definition: one with highest alignment score given a particular scoring scheme (e.g., match: 2, mismatch: -1, gap: -2) Which of the 4 alignments above is the optimal alignment? We need a criterion.

Xuhua Xia Slide 9 Importance of scoring schemes Two alternative alignments: Alignment 1: ACCCAGGGCTTA ACCCGGGCTTAG Alignment 2: ACCCAGGGCTTA- ACCC-GGGCTTAG Scoring scheme 1: Match: 2, mismatch: 0, gap: -5 Scoring scheme 2: Match: 2, mismatch: -1, gap: -1 Which of the two is the optimal alignment according to scoring scheme 1? Which according to scoring scheme 2? Importance of biological input concerning scoring schemes

Xuhua Xia Slide 10 Dynamic Programming Constant gap penalty: Scoring scheme: Match (M): 2 Mismatch (MM): -1 Gap (G): -2 For each cell, compute three values: Upleft value + IF(Match, M, MM) Left value + G Up value + G

Xuhua Xia Slide 11 Scoring Schemes We have used a very simple scoring schemes: Match (S m ): 2; mismatch (S mm ): -1; Gap (S g ): -3 Different scoring schemes differ in two ways: –Match-mismatch matrices –Treatment of gap penalties

Xuhua Xia Slide 12 Match-Mismatch Matrices Nucleotide: –Identity matrix –International IUB matrix –Transition bias matrix Amino acid: –BLOSUM (BLOcks SUbstitution Matrix, Henikoff and Henikoff 1992) –PAM (Dayhoff 1978) –StadenPep –StructGapPep –Gonnet –JTT92 –JohnsonOverington

Xuhua Xia Slide 13 Gap penalty Different scoring schemes differ in two ways: –treatment of gap penalties: Simple: –Match (S m ): 2; mismatch (S mm ): -1; Gap (S g ): -3 –Alignment score = N m * S m + N mm * S mm + N g * S g Affine function: –Gap open: q, gap extension r (q and r < 0) –Gap penalty: q + N r * r –Alignment score = N m *S m + N mm *S mm + N q *q + N r *r

Xuhua Xia Slide 14 Alignment with secondary structure Kjer, 1995; Notredame et al., 1997; Hickson et al., 2000; Xia 2000; Xia et al Sequences: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAATCCGTG Correct alignment: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAAT-CCGTG Seq1: CACGA ||||| GUGCU Seq2: CACGA ||||| GUGCU Seq2: CACGG ||||| GUGCC Missing link Conventional alignment: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAATC-CGTG Deletion Correlated substitution

Xuhua Xia Slide 15 Type of alignment Align two sequences: pairwise alignment by dynamic programming Align one sequence against a profile (i.e., a set of aligned sequences): Profile alignment Align two sequence profiles: Profile-profile alignment

Xuhua Xia Slide 16 Multiple Alignment: Guide Tree Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 Seq9 Seq10 Seq11 Seq12 Seq13 Seq14 Seq15 Seq16 ATTCCAAG ATTTCCAAG ATTCCCAAG ATCGGAAG ATCCGAAG ATCCAAAG AATTCCAAG AATTTCCAAG AATTCCCAAG AAGTCCAAG AAGTCAAG ATT-CC-AAG ATTTCC-AAG ATTCCC-AAG AT--CGGAAG AT-CCG-AAG AT-CCA-AAG AATTCC-AAG AATTTCCAAG AATTCCCAAG AAGTCC-AAG AAGTC--AAG

Xuhua Xia Slide 17 Aligned FOXL2 Sequences

Align nuc seq. against aling AA seq Xuhua Xia Slide 18 S1 ATG CCG GGA CAA S2 ATG CCC GGG ATT CAA S1 MPGQ S2 MPGIQ S1 MPG-Q S2 MPGIQ Alignment 1S1 ATG CCG GGA --- CAA S2 ATG CCC GGG ATT CAA *** ** ** *** Match: 1, Mismatch: -3; Gap open: 5, Gap extension: 2 Alignment 2S1 ATG CC- GGG A-- CAA S2 ATG CCC GGG ATT CAA *** ** *** * *** MatchMismatchGOGEScore Alignment =-5 Alignment =0