Sequence similarity.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Substitution matrices
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Pairwise Sequence Analysis-III
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Alignment IV BLOSUM Matrices
Presentation transcript:

Sequence similarity

Sequence Comparison Much of bioinformatics involves sequences DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: alphabet of 4 letters Protein: alphabet of 20 letters

Sequence Comparison - Motivation Nucleotide Learn about evolutionary relationships Finding genes, domains, signals … Protein Classify protein families (function, structure) Identify common domains (function, structure)

Calculation of an alignment score

How do we align two sequences? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1 Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10 matches | , 3 mismatches 12 matches |, 2 gaps -

Which alignment is better? Solution 1 Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10X1+3X(-1) = 7 12X1+2X(-2) = 8 10 matches, 3 mismatches 12 matches, 2 gaps We will use a scoring scheme Match +1 +1 Mismatch –1 0 Indel(gap) -2 -2 10X1+3X(0) = 10 12X1+2X(-2) = 8

Scoring Alignments - intuition Similar sequences evolved from a common ancestor Evolution changed the sequences from this ancestral sequence by mutations: Replacements: one letter replaced by another Deletion: deletion of a letter Insertion: insertion of a letter Scoring of sequence similarity should examine how many operations took place

Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion

Gaps • Positions at which a letter is paired with a null are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.

Gap Opening The gap-opening penalty defines the cost for opening a gap in one of the sequences. If you raise the gap-opening penalty above default, local alignments that contain gaps may be split into several shorter alignments.

Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC In nature, a series of indels often come as a single event rather than a series of single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely. Normal scoring would give the same score for both alignments Gap = Gapopen + Len * Gapextend

Gap penalties lead to: Increasing penalties for gaps opening and extension The alignment will contain fewer gaps and more mismatches Decreasing penalties for gaps opening and extension The alignment will contain more gaps (of varied lengths) and fewer mismatches Holding same score of penalty for gap opening and increasing penalty for gap extension Very long gaps will not be tolerated – they will be replaced with additional gaps of medium length and with mismatches.

Sequence similarity

Global alignment A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Global Local _____ _______ __ ____ __ ____ ____ __ ____

Local alignment Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B. This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins can be identified as being related. Global Local _____ _______ __ ____ __ ____ ____ __ ____

Global vs. Local: Global Local Jack Leunissen

Global vs. Local: Use global alignment if Use local alignment if You expect, based on some biological information, that your sequences will match over the entire length. Your sequences are of similar length. Use local alignment if You expect that only certain parts of two sequences will match (as in the case of conserved segment that can be found in many different proteins). Your sequences are very different in length. You want to search a sequence database (we will talk about it in details later).

Emboss [best solution] vs. Lalign (Embnet) [several solutions] If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" one of the two copies if using local alignment, which presents only the best scoring alignment. Emboss [best solution] vs. Lalign (Embnet) [several solutions]

Comparing nucleotides Every match got the same score Every mismatch got the same score Gaps- we decided but default usually good. However

In the case of aa Not all matches are the same Different mismatches get different scores

Serine (S) and Threonine (T) have similar physicochemical properties Amino acid properties Serine (S) and Threonine (T) have similar physicochemical properties Aspartic acid (D) and Glutamic acid (E) have similar properties => Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities

So how can we score matches and mismatches? Each aa is characterized by a combination of features (size, charge, etc.). The relative importance of each feature may vary according to the aa role in the 3-D structure and function of the protein. So how can we score matches and mismatches?

Amino Acids Substitution Matrices The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other. These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein families These scoring systems have a probabilistic foundation.

PAM series - Percent Accepted Mutation (Accepted by natural selection) All the PAM data come from alignments of closely related proteins (>85% amino acid identity) from 71 protein families (total of 1572 protein sequences). PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions. Some of the protein families are: Ig kappa chain Kappa casein Lactalbumin Hemoglobin a Myoglobin Insulin Histone H4 Ubiquitin

Various degrees of conservation The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity).

PAM series - Percent Accepted Mutation (Accepted by natural selection) Varying degrees of conservation *

Blocks Substitution Matrices- THE BLOSUM Family of Matrices Blocks Substitution Matrices- Henikoff and Henikoff, 1992 Blocks are short conserved patterns of 3-60 aa long. Proteins can be divided into families by common blocks. Different BLOSUM matrices emerge by looking at sequences with different identity percentage. Example: BLOSUM62 is derived from an alignment of sequences that share no less than 62% identity. Block A B C D

The Blocks Database Gapless alignment blocks

Blosum62 scoring matrix

Summary: BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

PAM versus BLOSUM Based on an explicit evolutionary model Derived from small, closely related proteins with ~15% divergence Higher PAM numbers to detect more remote sequence similarities Errors in PAM 1 are scaled 250X in PAM 250 Based on empirical frequencies Uses much larger, more diverse set of protein sequences (30-90% ID) Lower BLOSUM numbers to detect more remote sequence similarities Errors in BLOSUM arise from errors in alignment

Guidelines Lower PAMs and higher Blosums find short local alignment of highly similar sequences Higher PAMs and lower Blosums find longer weaker local alignment No single matrix answers all questions

Guidelines BLOSUM is generally better than PAM for local alignments. The default matrix is often identity matrix for DNA and BLOSUM 62 for proteins When using BLOSUM80 instead of BLOSUM45, local alignments tend to be shorter. Low PAMs have same effects as high Blosums. BLOSUM indicates percent identity while PAM is proportional to the percent of accepted mutations.