S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3.

Slides:



Advertisements
Similar presentations
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Advertisements

Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sequence Alignment.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Expected accuracy sequence alignment
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
BNFO 602 Multiple sequence alignment Usman Roshan.
Class 3: Estimating Scoring Rules for Sequence Alignment.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Alignment II Dynamic Programming
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Reuters Another hat tossed into the million genome ring (so to speak)…
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Introduction to the theory of sequence alignment Yves Moreau Master of Artificial Intelligence Katholieke Universiteit Leuven
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Comp. Genomics Recitation 3 The statistics of database searching.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Expected accuracy sequence alignment Usman Roshan.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
DNA, RNA and protein are an alien language
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Reuters Another hat tossed into the million genome ring (so to speak)…
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its.
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Two proteins sharing a common ancestor are said to be homologs.
Presentation transcript:

S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter

S. Maarschalkerweerd & A. Tjhang2 Overview Probability Theory -Maximum Likelihood -Bayes Theorem Pairwise Alignment -The Scoring Model -Alignment Algorithms

S. Maarschalkerweerd & A. Tjhang3 Probability Theory

S. Maarschalkerweerd & A. Tjhang4 Probability Theory What is a probabilistic model? Simple example: What is probability of base sequence x 1 x 2 …x n ?  p(x i ), p(x 1 ), p(x 2 )…p(x n ) independent of each other If p C = 0.3; p T = 0.2 and sequence is CTC: P(CTC)=0.3*0.2*0.3=0.018

S. Maarschalkerweerd & A. Tjhang5 Maximum Likelihood Estimation Estimate parameters of the model from large sets of examples (training set) – For example: P(T) and P(C) are estimated from their frequency in a database of residues Avoid overfitting – Database too small, model also fits to noise in the training set

S. Maarschalkerweerd & A. Tjhang6 Probability Theory Conditional Probability -P(X,Y) = P(X|Y) P(Y) (joint probability) -P(X) =  Y P(X,Y) =  Y P(X|Y) P(Y) (marginal probability)

S. Maarschalkerweerd & A. Tjhang7 Bayes’ Theorem P(X|Y) = - Posterior probability Example: P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01 P(X|C) = 0.9; P(X|¬C) = On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer? P(Y|X) P(X) P(Y)

S. Maarschalkerweerd & A. Tjhang8 Pairwise Alignment

S. Maarschalkerweerd & A. Tjhang9 Pairwise Alignment Goal: determine whether 2 sequences are related (homologous). Issues regarding pairwise alignment: 1. What sorts of alignment should be considered? 2. The scoring system used to rank alignments. 3. The algorithm used to find optimal (or good) scoring alignments. 4. The statistical methods to evaluate significance of an alignment score.

S. Maarschalkerweerd & A. Tjhang10 Example You need a ‘smart’ scoring model to distinguish b from c.

S. Maarschalkerweerd & A. Tjhang11 The Scoring Model

S. Maarschalkerweerd & A. Tjhang12 The Scoring Model When sequences are related, then both sequences have to be from a common ancestor. – Due to mutation sequences can change. Substitutions Gaps (insertions or deletions) – Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest)

S. Maarschalkerweerd & A. Tjhang13 The Scoring Model Total score of an alignment: – Sum of terms for each aligned pair of residues – Terms for each gap Take the sum of those terms

S. Maarschalkerweerd & A. Tjhang14 Substitution Matrices We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids) We can compute these score by: s(a,b) = log( ) p ab = probability that residues a and b have been derived independently from some unknown original residue c. q a = frequency of a p ab qaqbqaqb

S. Maarschalkerweerd & A. Tjhang15 BLOSUM50

S. Maarschalkerweerd & A. Tjhang16 Gap Penalties  (g) = -gd (linear score)  (g) = -d-(g-1)e (affine score) – d = gap-open penalty – e = gap-extension penalty – g = gap length P(gap) = f(g)  q xi i in gap

S. Maarschalkerweerd & A. Tjhang17 Alignment Algorithms

S. Maarschalkerweerd & A. Tjhang18 Alignment Algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Repeated matches Overlap matches Hybrid match conditions

S. Maarschalkerweerd & A. Tjhang19 Dynamic Programming Enormous amount of possible alignments Algorithm for finding optimal alignment: Use Dynamic Programming Save sub-results for later reuse, avoiding calculation of same problem

S. Maarschalkerweerd & A. Tjhang20 Needleman-Wunsch Algorithm Global alignment For sequences of size n and m, make (n+1)x(m+1) matrix Fill in from top left to bottom right F(i-1, j-1) + s(x i,y j ) F(i,j) = maxF(i-1, j) – d F(i, j-1) – d Keep pointer to cell that is used to derive F(i,j) Takes O(nm) time and memory {

S. Maarschalkerweerd & A. Tjhang21 Matrix

S. Maarschalkerweerd & A. Tjhang22 Matrix Traceback

S. Maarschalkerweerd & A. Tjhang23 Smith-Waterman Algorithm Local alignment Two differences with Needleman-Wunsch: 0 F(i-1, j-1) + s(x i,y j ) F(i-1, j) – d F(i, j-1) – d 2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell) 1. F(i,j) = max {

S. Maarschalkerweerd & A. Tjhang24 Matrix

S. Maarschalkerweerd & A. Tjhang25 Smith-Waterman Algorithm Expected score for a random match s(a,b) must be negative There must be some s(a,b) greater than 0 or no alignment is found

S. Maarschalkerweerd & A. Tjhang26 Repeated Matches Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them Find parts of sequence in the other sequence Not every alignment is useful threshold

S. Maarschalkerweerd & A. Tjhang27 Repeated Matches F(i, 0) F(i-1, j-1) + s(x i,y j ) F(i-1, j) – d F(i, j-1) – d F(i-1, 0) F(i-1, j) – T, j = 1,…m; F(i,j) = max { { F(i,0) = max

S. Maarschalkerweerd & A. Tjhang28 Matrix Threshold T = 20

S. Maarschalkerweerd & A. Tjhang29 Overlap Matches Find match between start of a sequence and end of a sequence (can be the same) Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border

S. Maarschalkerweerd & A. Tjhang30 Overlap Matches F(0,j) = 0, for j = 1,…,m F(i,0) = 0, for i = 1,…,n F(i-1, j-1) + s(x i,y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d {

S. Maarschalkerweerd & A. Tjhang31 Matrix

S. Maarschalkerweerd & A. Tjhang32 Hybrid Match Conditions Different types of alignment can be created by – adjusting rhs of this formula: F(i,j) = max {…. – adjusting the traceback Example: – We want to align two sequences from the beginning of both the sequences until local alignment has been found.

S. Maarschalkerweerd & A. Tjhang33 Summary Probability theory is important for sequence analysis Goal: determine whether 2 sequences are related For that, we need to find an optimal alignment between those sequences using algorithms Scoring model is required to rank different alignments Different algorithms for different types of alignments – use dynamic programming