Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Bioinformatics in Biosophy
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Sequence comparison: Significance of similarity scores
Ab initio gene prediction
Pairwise Sequence Alignment
Sequence comparison: Local alignment
Pairwise Alignment Global & local alignment
Sequence comparison: Significance of similarity scores
Alignment IV BLOSUM Matrices
Presentation transcript:

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED. Consider the last step in the best alignment path to node  below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those nodes. We can reach node  by three possible paths: an A-B match, a gap in sequence A or a gap in sequence B: The best-scoring path to  is the maximum of: X + match Y + gap Z + gap FYI - informal inductive proof of best alignment path

Local alignment - review ACGT A C 2 -5 G -72 T AAG 0000 A020 G0004 C (no arrow means no preceding alignment) d = -5

Local alignment Two differences from global alignment: –If a score is negative, replace with 0. –Traceback from the highest score in the matrix and continue until you reach 0. Global alignment algorithm: Needleman- Wunsch. Local alignment algorithm: Smith- Waterman.

dot plot of two DNA sequences overlay of the global DP alignment path

Quantitatively represent the degree of conservation of typical amino acid residues over evolutionary time. All possible amino acid changes are represented (matrix of size at least 20 x 20). Most commonly used are several different BLOSUM matrices derived for different degrees of evolutionary divergence. DNA score matrices are simpler (and conceptually similar). Protein score matrices

regular 20 amino acids BLOSUM62 Score Matrix ambiguity codes and stop # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62

Amino acid structures Hydrophobic Polar Charged phenylalanine F

BLOSUM62 Score Matrix Good scores – chemically similar Bad scores – chemically dissimilar

Amino acid structures HydrophobicPolarCharged

Find sets of sequences whose alignment is thought to be correct (this is partly bootstrapped by alignment). Measure how often various amino acid pairs occur in the alignments. Normalize this to the expected frequency of such pairs randomly in the same set of alignments. Derive a log-odds score for aligned vs. random. Deriving BLOSUM scores

Example of alignment block (the BLO part of BLOSUM) 31 positions (columns) 61 sequences (rows) Thousands of such blocks go into computing a single BLOSUM matrix. Represent full diversity of sequences. Results are summed over all columns of all blocks.

Pair frequency vs. expectation DEDNDDDEDNDD 6 D-D pairs 4 D-E pairs 4 D-N pairs 1 E-N pair Sample column from an alignment block: Actual aligned pair frequency: Randomly expected pair frequency: (a multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2.) etc. this is called the sum of pairs (the SUM part of BLOSUM)

Log-odds score calculation (so adding scores == multiplying probabilities) For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: (computers can add integers faster than floats)

BLOSUM62 matrix (half-bit scores) Frequency of C residue over all proteins: (you have to look this up) C-C Reverse calculation of aligned C-C pair frequency in BLOSUM data set: thus ( 9 half-bits = 4.5 bits )

Constructing Blocks Blocks are ungapped alignments of multiple sequences, usually 20 to 100 amino acids long. Cluster the members of each block according to their percent identity. Make pair counts and score matrix from a large collection of similarly clustered blocks. Each BLOSUM matrix is named for the percent identity cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

Probabilistic Interpretation of Scores (ungapped) By converting scores back to probabilities, we can give a probabilistic interpretation to an alignment score. VHRDLKPENLLLASK ( ) this 15 amino acid alignment has a score of 75, meaning that it is ~ times more likely to be seen in a real alignment than in a random alignment(!!). FIAP FLSP this alignment has a score of 16 ( ) by BLOSUM 62, meaning an alignment with this score or more is 2 8 (256) times more likely than expected from a random alignment. (BLOSUM62)

Randomly Distributed Gaps (probability of a gap at each position in the sequence) [note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear] if then

log-linear plot Distribution of real alignment gap lengths in large set of structurally-aligned proteins Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)

What you should know How a score matrix is derived What the scores mean probablistically Why gap penalties should be affine