Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Slides:



Advertisements
Similar presentations
Sequence Alignments.
Advertisements

Global Sequence Alignment by Dynamic Programming.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Sequence Alignments and Database Searches Introduction to Bioinformatics.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 3: SEQUENCE ALIGNMENT * Chapter 3: All in the family.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Needleman Wunsch Sequence Alignment
Sequence Alignment.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Arun Goja MITCON BIOPHARMA
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Applied Bioinformatics Week 3

Theory I Similarity Dot plot

3 Introduction to Bioinformatics LECTURE 3: SEQUENCE ALIGNMENT On sequence alignment Sequence alignment is the most important task in bioinformatics!

4 LECTURE 3: SEQUENCE ALIGNMENT 3.2 On sequence alignment Sequence alignment is important for: * prediction of function * database searching * gene finding * sequence divergence * sequence assembly

LECTURE 3: SEQUENCE ALIGNMENT 3.3 On sequence similarity Homology: genes that derive from a common ancestor-gene are called homologs Orthologous genes are homologous genes in different organisms Paralogous genes are homologous genes in one organism that derive from gene duplication Gene duplication: one gene is duplicated in multiple copies that therefore free to evolve and assume new functions

6 HOMOLOGOUS and PARALOGOUS

7 HOMOLOGOUS and PARALOGOUS

8 HOMOLOGOUS and PARALOGOUS versus ANALOGOUS

? globin plants Ath-g analogs

LECTURE 3: SEQUENCE ALIGNMENT: sequence similarity Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: A T A → A G A) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → A G A) deletion: at a certain location one existing nucleotide is deleted (e.g.: AC T G → AC - G) indel: an in sertion or a del etion

Similarity We can only measure current similarity We can form hypothesi

Similarity Searching DotPlot Needleman-Wunsch Smith-Waterman FASTA BLAST

Dot Plot Writing one sequence horizontally Writing the other vertically At each intersection with equal nucleotides make a dot in the matrix

Dot Plot

Messy? Strong similarities can be visually enhanced Select a window size and a similarity score for that window (e.g. 10 and 8) Create a new matrix with dots where the window score >= 8

Dot Plot

Dot Plot Interpretation

Creating a Dot Plot

End Theory I Mindmapping 10 min break

Practice I Dot plot

Dot Plot ACGTGTGCGTTTGAAC GGGTGTTCGTTTAAAC Make a Dot plot for the two sequences above Use a window of 3 to refine the view Can you use Excel? Get any two DNA sequences and try the tool below –

Similarity Searching DotPlot Needleman-Wunsch Smith-Waterman FASTA BLAST

Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful. Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.

How can we find an optimal alignment? ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT How many possible alignments? C(27,7) gap positions = ~888,000 possibilities Dynamic programming: The Needleman & Wunsch algorithm 1 27

Time Complexity Consider two sequences: AAGT AGTC How many possible alignments the 2 sequences have?   2nn = (2n)!/(n!) 2 =  (2 2n /  n ) =  (2 n ) = 70

Scoring a sequence alignment Match/mismatch score:+1/+0 Open/extension penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Open: 2 × (–2) Extension: 5 × (–1) Score = +9

Pairwise Global Alignment Computationally: –Given: a pair of sequences (strings of characters) –Output: an alignment that maximizes the similarity

Needleman-Wunsch Alg

Which Alignment is better? For scoring use: –Match 1 –Mismatch 0 –Gap open -2 –Gap extension -1 How can substitution matrices be integrated?

Needleman & Wunsch Place each sequence along one axis Place score 0 at the up-left corner Fill in 1 st row & column with gap penalty multiples Fill in the matrix with max value of 3 possible moves: –Vertical move: Score + gap penalty –Horizontal move: Score + gap penalty –Diagonal move: Score + match/mismatch score The optimal alignment score is in the lower-right corner To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.

Three steps in Needleman-Wunsch Algorithm Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2) Pooja Anshul Saxena, University of Mississippi

Scoring Scheme Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix ACGT A1 C 1 G 1 T 1 Pooja Anshul Saxena, University of Mississippi

Initialization Step Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty TCG A T-2 C-3 G-4 Pooja Anshul Saxena, University of Mississippi

Scoring The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j) = -1 scoreup = C(i-1, j) + g = -2 scoreleft = C(i, j-1) + g = -2 where S(i, j) is the substitution score for letters i and j, and g is the gap penalty TCG A T-2 C-3 G-4 Max -> C(i,j) S(T,A)

Scoring …. Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = = -1 scoreup = C(i-1, j) + g = = -2 scoreleft = C(i, j-1) + g = = -2 TCG A T-2 C-3 G-4 Pooja Anshul Saxena, University of Mississippi

Scoring …. Final Scoring Matrix TCG A -2-3 T-20-2 C-310 G Pooja Anshul Saxena, University of Mississippi

Trace back The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the last cell, i.e. position X, Y in the matrix Gives alignment in reverse order Pooja Anshul Saxena, University of Mississippi

Trace back …. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors Pooja Anshul Saxena, University of Mississippi

Trace back …. The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G TCG A -2-3 T-20-2 C-310 G Pooja Anshul Saxena, University of Mississippi

Trace back …. Final Trace back Best Alignment: A T C G | | | | _ T C G TCG A -2-3 T-20-2 C-310 G Pooja Anshul Saxena, University of Mississippi

Similarity Searching DotPlot Needleman-Wunsch Smith-Waterman FASTA BLAST

Local Alignment Problem first formulated: –Smith and Waterman (1981) Problem: –Find an optimal alignment between a substring of s and a substring of t Algorithm: – is a variant of the basic algorithm for global alignment

Motivation Searching for unknown domains or motifs within proteins from different families –Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) –Identifying active sites of enzymes Comparing long stretches of anonymous DNA Querying databases where query word much smaller than sequences in database Analyzing repeated elements within a single sequence

Smith-Waterman Alg Very similar to Needleman-Wunsch Determines local instead of global alignment Scores can drop and increase Alignments are calculated between 0 and 0 scores

Three steps in Smith-Waterman Algorithm Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2) Pooja Anshul Saxena, University of Mississippi

Scoring Scheme Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix ACGT A1 C 1 G 1 T 1 Pooja Anshul Saxena, University of Mississippi

Initialization Step Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled with 0s TCG 0000 A0 T0 C0 G0 Pooja Anshul Saxena, University of Mississippi

Scoring The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(I, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g And 0 (here S(I, j) is the substitution score for letters i and j, and g is the gap penalty) Pooja Anshul Saxena, University of Mississippi

Scoring …. Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = = -1 scoreup = C(i-1, j) + g = = -1 scoreleft = C(i, j-1) + g = = -1 TCG 0000 A00 T0 C0 G0 Pooja Anshul Saxena, University of Mississippi

Scoring …. Final Scoring Matrix Note: It is not mandatory that the last cell has the maximum alignment score! TCG 0000 A0000 T0100 C0021 G0013 Pooja Anshul Saxena, University of Mississippi

Trace back The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the cell with maximum value in the matrix Gives alignment in reverse order Pooja Anshul Saxena, University of Mississippi

Trace back …. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors. This continues till cell with value 0 is reached. Pooja Anshul Saxena, University of Mississippi

Trace back …. The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G TCG 0000 A0000 T0100 C0021 G0013 Pooja Anshul Saxena, University of Mississippi

Trace back …. Final Trace back Best Alignment: T C G | | | T C G TCG 0000 A0000 T0100 C0021 G0013 Pooja Anshul Saxena, University of Mississippi

LECTURE 3: GLOBAL ALIGNMENT

LECTURE 3: GLOBAL ALIGNMENT

Significance of Sequence Alignment Consider randomly generated sequences. What distribution do you think the best local alignment score of two sequences of sample length should follow? 1.Uniform distribution 2.Normal distribution 3.Binomial distribution (n Bernoulli trails) 4.Poisson distribution (n , np= ) 5.others

Extreme Value Distribution Y ev = exp(- x - e -x )

Extreme Value Distribution vs. Normal Distribution

“Twilight Zone” Some proteins with less than 15% similarity have exactly the same 3-D structure while some proteins with 20% similarity have different structures. Homology/non-homology is never granted in the twilight zone.

End of Theoretical Part 2 Mindmapping 10 min break

Needleman-Wunsch ACGTGTGCGTTTGAAC GGGTGTAGTCGTTTAAAC Apply the Needleman-Wunsch algorithm to these two sequences Score the alignments

Alignments Explanation for alignment algorithms – Alignment of 2 sequences – Get any two amino acid sequences and try –