DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Last lecture summary.
Sequence Alignment.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Molecular Evolution Revised 29/12/06
Sequencing a genome and Basic Sequence Alignment Lecture 10 1Global Sequence.
Bioinformatics Sequence Analysis I
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Sequencing a genome and Basic Sequence Alignment Lecture 8 1Global Sequence.
Finding prokaryotic genes and non intronic eukaryotic genes
Sequencing a genome and Basic Sequence Alignment
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
An Introduction to Bioinformatics
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity, BLAST alignments & multiple sequence alignments
In Bioinformatics use a computational method - Dynamic Programming.
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Presentation transcript:

DNA sequences alignment measurement Lecture 13

Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement of alignment gaps

Measurement of aligned sequences When aligning sequences (DNA/AA ) it is assumed that: – they have a common ancestor; – the differences between the sequences is the result of mutations – important areas like coding sequences (CDS) will be conserved. There is a bias “against” mutations in these areas – Furthermore there is a bias in the types of mutations: substitutions more likely that insertions/deletions…. The dot plot gives a visual representation of sequence alignment regions. But how do we measure the strength of these alignments.

Measurement of aligned sequences One way is to count the mismatches: the “difference” between the sequences. – Hamming distance; : The distance corresponds to mismatches for strings of equal length. – agtc – cgta Distance is 2 (give another example) If the sequences (strings) are not of equal length then use: – The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another: ag- tcc cgctca what is the levensthein distance? The latter technique has the advantage of allowing the inclusions of gaps

Measurement of matching But what about the biological plausibility of these approaches to measuring “differences” between sequences (strings): DNA sequences (string mismatches) are different: – due to the probability of substitution; insertions, deletions is not the same. – Certain types of mutations like inversions; translocations; duplications …. Complicate the assessment of similarity; e.g. how would you treat tandem repeats; inverted repeats….

Nucleic Acid mutations In sequence alignment we are trying to determine have the differences (similarity) occurred due to: – chance (random mutations) – They had a common origin (degree of conservatism) One approach would be to count the percentage of matches but there is now a need to include the bias associated with possible substitutions. However, similarity does not necessarily imply common ancestor or visa versa Zvelebil and Baum (2008 p. 74) suggest this can occur in convergent evolution/divergent evolution. So the results need to be contextualised the findings of alignment tests. (bat and bird both have wings…)

Alignment Scoring methods In general sequences are given a score at each matching position and the one with the largest score is optimal and is chosen; however suboptimal may also need to be considered. The most basic approach is obtained by measuring the percentage of similarity. Given that not all “changes” occur with equal chance there is a need to develop: – A nucleotide substitution matrix

Nucleotide scoring Matrix While it is know that certain mutations are more likely to occur than others: e.g. transitions a g is more common than transversions c t. However since the probability of such difference is insignificant in relation to the chance of a mutation itself the differences are mostly ignored. The following shows a typical scoring matrix for nucleotides. Adapted from Baxevanis p. 303

Nucleic acid scoring Matrix The values are based on the probability of a type of substitution occurring (expected value); this includes a nucleotide substituting with itself. These expected values are calculated by getting the ratio of : – number of “observed changes” /number of changes “due to chance” These values are obtained by examining large numbers of DNA sequences.

Nucleic acid scoring Matrix Then calculate 10*log 10 (“expected value”). This ensures that adjacent nucleotides expected values can now be added as opposed to being multiplied in determining the alignment score.

Nucleic acid scoring Matrix A expected value greater of 1 indicates the substitution has the same change of occurrence as it is was occurring randomly. A value greater than 1 indicates a bias in favour or the substitution A values less than 1 indicates a bias against the substitution. A value of 5 will give what expected value?

Measuring Protein similarity Deriving a matrix for proteins is more complex because: There are 20 amino acids so much higher set of substitutions. The amino acids have properties that affect the structure and so the protein functionality. Therefore substitutions can be conserved or semi- conserved Observations shows that conserved substitutions e.g. Hydrophobic hydrophobic mutations are more common semi conserved; e.g. hydrophilic hydrophobic

Dot plot Matrix: imperfect match Some alignments require gaps to increase the matching score; the gaps are used represent inclusion/deletion mutations The diagram shows that most of the 2 sequences are aligned. Where there are gaps indicates areas of non-alignment or mismatches: gaps or substitutions 13 Adapted from: dotplot exampledotplot example

Measurement of alignment gaps Gaps represents insertions and deletions Baxevanis (2005) suggest that no more than “one gap in 20 pairs is a good rule of thumb”. Gaps in alignments are penalised; given a negative scoring value. The penalty associated with the using gaps is dependent on – Opening the gap (introducing an insertion or deletion) – Extending the gap (as opposed to opening a new gap) – The length of the gap (the number of deletions/insertions).

Gap penalties There is no overall agreement on what values should be assigned to gap penalties (Zvelebil e Baum 2008). The purpose of an insertion is to increase the strength of the alignment. So choosing a high score will eliminate sequences with gaps while of the score is too low then alignments with more and larger gaps will be chosen. The value should also be dependent on how closely “related” the alignments must be : – So sequences with a very strict match would use a high gap score. – Alignment between distantly related species would use a low gap score.

Potential Exam Questions What is the purpose of measuring the strength of an alignment (3 marks) Explain two differences between analysing a string (sequence) and a DNA string. (4 marks) Describe how you would measure the similarity between two DNA sequences (10 marks) Discuss the use of gap penalties in a sequence alignment score (13 marks)

References Baxevanis A.D Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley Lesk, A. 2008; Introduction to bioinformatics, 3 rd edition, oxford university press Zvelebil e Baum (2008) Understanding Bioinformatics