Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Previous Lecture: Sequence Alignment Concepts
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence alignment, Part 2
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Sequence Analysis Alan Christoffels
Presentation transcript:

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP Position within the alignment = columns

Sequence Alignment The alignment is a hypothesis: The positions with identical nt/AA were present in the common ancestor Differences represent the nt/AA that have diverged since the common ancestor Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

From extant sequences to evolution

Constructing and evaluating alignments how to identify regions of sequence similarity between two sequences? How to evaluate the degree of similarity? What is the biological significance of the alignment?

Dot Plots: visualization of an alignment 1)Take two English words: 2)place the two sequences on vertical and horizontal axes of graph 3)put dots wherever there is a match 4)diagonal line is the region of identity –local alignment THISSEQUENCE and THATSEQUENCE

Alignments reveal insertions and deletions a gap in Seq1 accounts for the insertion of ISA into Seq2 seq1 THIS---SEQUENCE seq2 THISISASEQUENCE

THISSEQUENCE THATSEQUENCE –How many substitutions? –Are all substitutions equal? –If these were real AA sequences in two extant organisms, how can we determine whether they reflect evolutionary ancestry? Would two unrelated sequence share this level of identity? Alignments reveal substitutions

Need a methods for evaluating the likelihood of - A to V (alanine to valine) - R to F (Arginine to Phenylalanine) Substitutions

Scoring schemes to assess similarity Percent identity = number of identical amino acids Percent similarity (biochemical equivalence) Substitution matrices –value assigned based on the probability of substitution –score the alignment …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

Substitution matrices How might one construct a scoring matrix? –what types of sequence events should we consider? DNA level? Transition vs. transversion Amino acid level? Biochemical equivalence Using known proteins? –comparing protein homologs –post hoc determination of probabilities –should we use the same substitution matrix for two very closely related proteins vs. proteins that diverged long ago? Probability of substitutions increases over time Probability that multiple substitutions occurred in a single position

Substitution matrix Each cell represents the likelihood of substitution of each possible pair of amino acids Sum up the score = 52 THISSEQUENCE THATSEQUENCE

PAM and BLOSUM matrices for AA sequences Most protein alignment matrices are empirically derived: PAM Scoring Matrices –compared full length of closely related proteins –Measured the frequency of all possible substitution pairs BLOSUM Scoring Matrices –compared highly conserved regions of proteins –blocks

How to score gaps? THISISASEQUENCE vs THATSEQUENCE? THISISASEQUENCE TH----ATSEQUENCE THA---TSEQUENCE TH---ATSEQUENCE TH-A-T-SEQUENCE Scoring the alignment must take into account 1) Substitutions 2) Gaps Gap penalties: 1) start a new gap (-4) 2) extend an existing gap (-1) Score all, choose highest score More than one possible alignment

Alignments Finding regions of sequence identity or similarity Inserting gaps to reflect indels Scoring the possible alignments to find the optimal alignment by

Common tools that produce alignments BLAST to identify similar sequences, given a query sequence ClustalW to align two or more sequences across their entire length

the blast algorithm Uses PAM or BLOSUM matrix divides query sequence into short strings, called words searches through the database to find subject sequences that contain similar words When finds similar words, it extends and scores the alignment Output consists of all subject sequences that align to the query at or above a threshold score If no words are similar, then no alignment

BLAST Algorithm divide entire length into words (segments of X length)

Extend hits one base at a time S is the alignment score: If it falls below a threshold, the extension processes ends

HSPs are Aligned Regions High scoring segment pairs = the original word match plus the extension –high scoring = score of the alignment above threshold –segment = the region of the query sequence aligned to the subject –pair = alignment between two sequences (query and subject) BLAST often produces several short HSPs rather than a single aligned region

BLAST Results report local alignments >gi| |ref|NP_ | Predicted CDS, phosphatidylinositol transfer protein [Caenorhabditis elegans] Score = 283 bits (723), Expect = 8e-75 Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%) Query: 48 KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101 K+ RV+LP+SV+EYQVGQL+SVAEASK P++ +G+ KGQYT Sbjct: 70 KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129 Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160 HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Sbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189 Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN H L+ + E V I+IA+ + L S D + P+KF+S KTGRGPL N Sbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249 Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279 WK + P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304 Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309 LTM DIR +E + +++L+E R+ V+GM Sbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334 Query was the entire protein sequence (position 1 to 749) Score,E-value, Identities, Positives, Gaps

BLAST Statistics E-value is equivalent to a P value smaller numbers are more significant –1e -4 = 1 x = –1e -50 = 1 x E-value is calculated from the alignment score (S) how many alignments of that score would likely occur by chance if you query a database of that size? –if GenBank contains 10 million sequences, there is a good probability that the sequence “MAGAV” will occur multiple times in sequences that are NOT evolutionarily related The E-value represents the likelihood that the observed alignment is due to chance alone

Interpretation of output very low E-values (e-100) represent sequences that are very close to being identical moderate E-values are related genes (homologs) long list of gradually declining of E-values indicates a large gene family you must examine the results when e-value is in the to -5 range –examine sequences a few AA matches in a long sequence? many AA matches in a very short sequence?

Evaluating Blast results Alignment: –colored bar alignments (region of alignment, score along length) –sequence alignments (region of alignment, AA information) Exploring potential function from significant blast hits –use accession link to go to the record page for each hit published papers full sequence information annotation Blast is linked to a protein domain tool