Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Final presentation Final presentation Tandem Cyclic Alignment.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Outline 1. General Design and Problem Solving Strategies 2. More about Dynamic Programming – Example: Edit Distance 3. Backtracking (if there is time)
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Alignment Tutorial #2
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction To Bioinformatics Tutorial 2. Local Alignment Tutorial 2.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
BNFO 602 Multiple sequence alignment Usman Roshan.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Local alignment
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
DNA, RNA and protein are an alien language
Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Pairwise Sequence Alignment and Database Searching
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Sequence Alignment Using Dynamic Programming
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
#7 Still more DP, Scoring Matrices
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment
CSE 589 Applied Algorithms Spring 1999
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Presentation transcript:

Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

Agenda More about Shared Pattern Discovery Edit Distance – Recap – What you need to know for the next quiz Alignment – More details – More examples

Shared Pattern Discovery I have 10 rats that all have green eyes I have 10 rats that all have blue eyes What exactly do the 10 rats have in common that give them green eyes?

Shared Pattern Discovery Multiple Alignment can be used to measure the strength a genomic pattern found in a set of sequences – First, completely align the 10 green-eyed rats – Then, align green-eyed rats with blue-eyed rats – Finally, compare the statistical difference Initially, this is how genes were pin-pointed

Shared Pattern Discovery Multiple alignment of 10 green-eyed rats Alignment of blue-eyed rat and green-eyed rat 99.2% similar 99.4% similar 99.1% similar 94.5% similar 99.3% similar 95.2% similar 99.2% similar 94.7% similar

Recap: Exact string matching Its important to know why exact matching doesn’t work. – Target: CGTACGAC – Pattern: CGTACGTACGTACGTTCA Problem: Target can NOT be found in the pattern even though there is a near-match Sequences either match or don’t match There is no ‘in-between’

Recap: Edit Dist. for Local Search Question: How many edits are needed to exactly match the target with part of the pattern – Target: CGTACGAC – Pattern: CGTACGTACGTACGTTCA Answer: 1 deletion Example of local search Gene finding

Recap: Edit Dist. for Global Comp. Question: How many edits are needed to exactly match the ENTIRE target the WHOLE pattern – Target: CGTACGAC – Pattern: CGTACGTACGTACGTTCA Answer: 10 deletions Example of global comparison (whole genome comparison)

Quiz coming up! You need to be able to compute optimal edit distance. You need to fill-in the table.

Edit Distance – Dynamic Programming ACGTCGCAT A C G T G T G C Optimal edit distance for TG and TCG Optimal edit distance for TG and TCGA Optimal edit distance for TGA and TCG Final Answer Optimal edit distance for TGA and TCGA

Edit Distance int matrix[n+1][m+1]; for (x = 0; x <= n; x++) matrix[x][0] = x; for (y = 1; y <= m; y++) matrix [0][y] = y; for (x = 1; x <= n; x++) for (y = 1; y <= m; y++) if (seq1[x] == seq2[y]) matrix[x][y] = matrix[x-1][y-1]; else matrix[x][y] = max(matrix[x][y-1] + 1, matrix[x-1][y] + 1); return matrix[n][m];

Why Edit Distances Stinks for Genetic Data? DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC…. …GATCTCCCAGATAGAAGCAGTATTCAGTCA… … CCTATCAGCAGGATCAAGTATGTCATACTAC… The edit distance between rat and virus is smaller than rat and fruit bat. This is a gene in the rat genome This is the same gene in the fruit bat This is a totally unrelated region of the AIDS virus

Alignment We need a more robust way to measure similarity Alignment meets several requirements 1. It rewards matches 2. It penalizes mismatches 3. Different strategies for penalizing gaps 4. It helps visualize similarity.

Alignment Two examples Seq1GCTAGTATGCCGATACTGA Seq2GCTAGATGCAGATACTTGA Seq3GCTAGTATGCCGATACGA Seq4GATAGACGCAGATGCTTGT What’s more similar – Seq1 & Seq2, or – Seq3 & Seq4

Alignment Three steps in the dynamic programming algorithm for alignment 1. Initialization 2. Matrix fill (scoring) 3. Traceback (alignment)

Initialization

Matrix Fill For each position, Mi,j is defined to be the maximum score at position i,j Mi,j = MAX[Mi-1, j-1 + Si,j (match/mismatch), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2) ]

Matrix Fill Mi,j = MAX[Mi-1, j-1 + Si,j (match/mismatch), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2) ] Si,j = 1 if symbols match, otherwise Si,j = 0 w = 0 (no gap penalty)

Matrix Fill The score at position 1,1 can be calculated. The first residue in both sequences is a G Thus, S 1,1 = 1 Thus, M 1,1 = MAX[M 0,0 + 1, M 1,0 + 0, M 0,1 + 0] = MAX[1, 0, 0] = 1.

Matrix Fill

Tracing Back (Seq #1) A | (Seq #2) A

Tracing Back (Seq #1) A | (Seq #2) A

Tracing back the alignment (Seq #1) TA | (Seq #2) A

Tracing Back (Seq #1) TTA | (Seq #2) A

Tracing Back (Seq #1) GAATTCAGTTA | | || | | (Seq #2) GGA_TC_G__A

Robust Scoring Mi,j = MAX[Mi-1, j-1 + Si,j (match/mismatch), Mi,j-1 + w1 (gap in sequence #1), Mi-1,j + w2 (gap in sequence #2) ] S i,j ACGT A w1-0.5C w2-0.7G T1.2

Alignment Scoring S i,j ACGT A w1-0.5C w2-0.7G T1.2 Seq1GTACTACGAC Seq2GAACGTAGAC score Alignment score = 8.4

Alignment Scoring S i,j ACGT A w1-0.5C w2-0.7G T1.2 Seq1GTACTACGAC Seq2GAACGTAGAC score Can you find a better alignment?

Alignment Scoring S i,j ACGT A w1-0.5C w2-0.7G T1.2 Seq1GTACTACGAC Seq2GAACGTAGAC score Alignment score = 7.8

Alignment Scoring Summary: We have a way of rewarding different types of matches and mismatches We have a separate way of penalizing gaps We could choose not to penalize gaps – if we knew that didn’t affect biological similarity We could even reward some types of mismatches – if we knew they were still biological similarity

Alignment scoring Process 1. Experts (chemists or biologist) look at sequence segments that are known to be biologically similar and compare them to sequence segments that are biologically disimilar. 2. Use direct observation and statistics to develop a scoring scheme 3. Given the scoring scheme, develop an algorithm to compute the maximum scoring alignment.

Alignment – Algorithmic Point of View Align the symbols of two strings. – Maximize the number of symbols that match. – Minimize the number of symbols that do NOT match Gaps can be inserted to improve alignments. A scoring system is used to measure the quality of an alignment. Gap penalty T G 4-3C A TGCA Scoring matrix In practice: – Scoring matrices and gap penalties are based on biological knowledge and statistical analysis

Local Alignment and Global Alignment In Global Alignment the two strings must be entirely aligned (every aligned pair of symbols is scored). In Local Alignment segments from each string are aligned and the rest of the string can be ignored Global alignment is used to compare the similarity of entire organisms Local alignment is used to search for genes AGAGTACTCAGTATCTGAT ACATACTACAGTATCCA AGAGTACTCAGTATCTGAT ACATACTACAGTATCCA

Alignment Scoring Revisited Given a scoring system, the alignment score is the sum of the scores for each aligned pair of symbols plus the gap penalties Local Alignment AGAGTACTCAGTATCTGAT ACATACTACTGTATCCA ACGT A C-32-4 G 2-3 T Total Score = 15

Alignment - Computer Science Perspective Given two input strings and a scoring system, find the highest scoring local alignment among all possible alignments. Fact: The number of possible alignments grows exponentially with the length of the input strings Solving this problem efficiently was an open problem until Smith and Waterman (1980) designed an efficient dynamic programming algorithm The algorithm takes O(nm) time where n and m are the lengths of the two input strings

Interesting History The Smith Waterman algorithm for computing local alignment is considered one of the most important algorithms in computational biology. However, the algorithm is merely a generalization of the edit distance algorithm, which was already published and well- known in computer science. Converting the edit distance algorithm to solve the alignment problem is “trivial.” Smith and Waterman are consider almost legendary for this accomplishment. It is a perfect example of “being in the right place at the right time.”

Smith Waterman Algorithm T 0 C 0 43G A 0 C0C0 G0G0 C0C0 A0A00 Dynamic programming table D[i][j]=MAX(0, M[i-1][j-1] + S(i,j), M[i-1][j] + w, M[i][j-1] + w );i j T G 7-3C A TGCA S(i,j) -5 w -4-5

Smith Waterman Algorithm 0 A0A0 C0C0 G0G0 C0C0 T0T0 A0A0 A 0 G 0 C 0 T 0 C 0 A 0 ACGT A C-35-4 G 4-3 T A0A0 C0C0 G0G0 C0C0 T0T0 A0A0 A G C T C A A0A0 C0C0 G0G0 C0C0 T0T0 A0A0 A 06 G 0 C 0 T 0 C 0 A 0 0 A0A0 C0C0 G0G0 C0C0 T0T0 A0A0 61 G 0 C 0 T 0 C 0 A 0 0 A0A0 C0C0 G0G0 C0C0 T0T0 A0A G 0 C 0 T 0 C 0 A 0 0 A0A0 C0C0 G0G0 C0C0 T0T0 A0A G 01 C 0 T 0 C 0 A 0 0 A0A0 C0C0 G0G0 C0C0 T0T0 A0A G 012 C 0 T 0 C 0 A 0 0 A0A0 C0C0 G0G0 C0C0 T0T0 A0A G 0125 C 0 T 0 C 0 A 0

Smith Waterman Algorithm 0 A0A0 C0C0 G0G0 C0C0 T0T0 A0A0 A G C T C A A AC T T C C G G CA A