Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

MSc Bioinformatics for H15: Algorithms on strings and sequences
Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.
1 String Matching of Bit Parallel Suffix Automata.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Global Alignment: Dynamic Progamming Table s 1 : acagagtaac s 2 : acaagtgatc -acaagtgatc - a c a g a g t a a c j s2s2 i s1s1 Scores: match=1, mismatch=-1,
The chromosomes contains the set of instructions for alive beings
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Algorismes de cerca Algorismes de cerca: definició del problema (text,patró) depèn de què coneixem al principi: Cerca exacta: Cerca aproximada: 1 patró.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Suffix Trees ALGGEN: Algorithmics and genetics group Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Dr. Xavier Messeguer.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
Backward Nondeterministic DAWG Matching Algorithm
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Novel computational methods for large scale genome comparison PhD Director: Dr. Xavier Messeguer Departament de Llenguatges i Sistemes Informàtics Universitat.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:
Suffix Trees ALGGEN: Algorithmics and genetics group Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Dr. Xavier Messeguer.
Lecture 15 Algorithm Analysis
Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch ( LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
. Sequence Alignment Author:- Aya Osama Supervision:- Dr.Noha khalifa.
Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch ( LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Bioinformatics Overview
Text Search ~ k A R B n f u j ! k e
Advanced Data Structure: Bioinformatics
INTRODUCTION TO BIOINFORMATICS
Exact string matching: one pattern (text on-line)
Recuperació de la informació
String matching.
Dynamic Programming Computation of Edit Distance
Lecture 14 Algorithm Analysis
Tècniques i Eines Bioinformàtiques
Recuperació de la informació
Bioinformatics Algorithms and Data Structures
String Matching 11/04/2019 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns.
Chap 3 String Matching 3 -.
Tècniques i Eines Bioinformàtiques
Multiple Sequence Alignment
15-826: Multimedia Databases and Data Mining
Multiple Sequence Alignment
Text Search ~ k A R B n f u j ! k e
Presentation transcript:

Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch ( Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 4. Approximate string matching (Dynamic programming) 5. Pairwise and multiple alignment 6. Suffix trees 3. Extended string matching and regular expressions

Master Course Second lecture: First part: Extended string matching

There are characters in the text that represent sets of simbols 1. Classes of characters in the tetx. There are characters in the text that represent sets of simbols 2. Classes of characters in the pattern. There are classes of characters represented by one Symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)

Classes in the text Algorismes més eficients (Navarro & Raffinot) |  | Long. patró Horspool BNDM BOM w

Classes in the text :Horspool example Given the pattern ATGTA the shift table is: A 4 C 5 G 2 T 1 R ? … N ?

Classes in the text :Horspool example Suposem que el patró és ATGTA La taula de salts seria: A 4 C 5 G 2 T 1 R 2 … N ?

Classes in the text :Horspool example Given the pattern ATGTA and the shift table: A 4 C 5 G 2 T 1 R 2 … N 1 Given the taxt :G T A R T R N A A G G A … A T G T A

Classes in the text :Horspool example Given the pattern ATGTA and the shift table: A 4 C 5 G 2 T 1 R 2 … N 1 IGiven the text :G T A R T R N A A G G A... A T G T A …

Classes in the text Algorismes més eficients (Navarro & Raffinot) |  | Long. patró Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

Alg. Cerca exacta d’un patró (text on-line) Algorismes més eficients (Navarro & Raffinot) |  | Long. patró Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

Classes in the text: BOM Com es determina la següent posició de la finestra? Com fa la comparació? Text : Patró : Autòmata: Factor Oracle Comproba si el sufix és factor del patró Però primer analitzem com fa la comparació…

Classes in the text: BOM example Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG I la cerca sobre el text :G T A R T R N A A T G… A T G T A T G Com fa la comparació? GGATT AT T A G No és possible cap millora!

Alg. Cerca exacta de molts patrons |  | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC

Classes in the text: Set Horspool Search for the patterns ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T In the text: ARTGNCTATGTGACA… <it’s not possible any improvment!

Classes in the text |  | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC

Classes in the pattern Algorismes més eficients (Navarro & Raffinot) |  | Long. patró Horspool BNDM BOM w

Classes in the text |  | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC

Alg. Cerca exacta de molts patrons |  | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC

Alg. Cerca exacta de molts patrons |  | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC

Master Course Second lecture: Second part: Regular expressions matching

Expressions regulars Una expressió regular ℛ és una cadena sobre Σ U { ε, |, ·, *, (, ) } definida recursivament com: ε és una expressió regular Un caràcter de Σ és una expressió regular ( ℛ ) és una expressió regular ℛ 1 · ℛ 2 és una expressió regular ℛ * és una expressió regular ℛ 1 | ℛ 2 és una expressió regular

Llenguatge regular El llenguatge representat per una expressió regular és el conjunt dels mots que es poden construir a partir de l’expressió regular. El problema de buscar una expressió regular dins el text és el de buscar tots els factors que pertanyen al respectiu llenguatge regular.

Master Course Second lecture: Third part: Approximate string matching

For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel

Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings

Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel

Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel

Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings

Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel

Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings

Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.

Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel

Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings

Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel

Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel

Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings

Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel

Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel

Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings

Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A ?

Edit distance and alignment of strings C T A C T A C T A C G T 0 A C T G A ?

Edit distance and alignment of strings C T A C T A C T A C G T 0 1 A C T G A - C ?

Edit distance and alignment of strings C T A C T A C T A C G T A C T G A - - CT ?

Edit distance and alignment of strings C T A C T A C T A C G T … A C T G A CTACTA

Edit distance and alignment of strings C T A C T A C T A C G T … A ? C ? T ? G A

Edit distance and alignment of strings C T A C T A C T A C G T … A 1 C 2 T 3 G… A ACT - - -

C T A C T A C T A C G T … A 1 C 2 T 3 G A C T A C T A C T A C G T A C T G A Edit distance and alignment of strings BA(AC,CTA) - C BA(A,CTA) CCCC BA(A,CTAC) C - BA(AC,CTAC)= best d(AC,CTAC)=min d(AC,CTA)+1 d(A,CTA) d(A,CTAC)+1

C T A C T A C T A C G T A C T G A Edit distance and alignment of strings d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA) …..+1 d(AC,CTA)+1 C T A C T A C T A C G T … A 1 C 2 T 3 G A

Edit distance and alignment of strings Connect to and use the global method.

Edit distance and alignment of strings How this algorithm can be applied to the approximate search? to the K-approximate string searching?

K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell …

K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters

K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters

Master Course Second lecture: Fourth part: Pairwise and multiple alignment

Bioinformatics Pairwise and multiple alignment

Pairwise alignment Edit distance: match=0mismatch=1 indel=1 d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1 Similarity: match=1 mismatch=-1indel=-2 s(A,CTAC)-2 s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2 - +

C T A C T A C T A C G T A C T C T A C T A C T A C G T … A-2 C-4 T-6 Similarity: match=1 mismatch=-1indel=-2 Pairwise alignment s(A,CTAC)-2 s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2 - +

Pairwise alignment Connect to Links to TEACHING EMBER LePA

A C A __ Pairwise to multiple alignment What happens with three strings? Let n be their lenght, then the cost becomes S3S3 S2S2 S1S1 O(n 3 )O(2 3 )O(3 2 ) And with k strings? O(n k 2 k k 2 )

Multiple alignment Programs of multialignment use different heuristics: n Clustal (Progressive alignment) Clustal n TCoffee (Progressive alignment + data bases) TCoffee n HMM (Hidden Markov Models)

Multiple alignment Connect to and follow the links TEACHING EMBER.