Presentation is loading. Please wait.

Presentation is loading. Please wait.

Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Similar presentations


Presentation on theme: "Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:"— Presentation transcript:

1 Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns: The algorithm depends on k, |p| and |  | Second week Second week: Alignment of sequences. – Edit distance between two strings: dynamic programming – Alignment of sequences: – 2 sequences – 3 or more sequences Third week Third week: dealing with long sequences.

2 Distance between words Which is the distance between the words: – table, maple – able, table – announce, pronounce – ACCTG, ACTT … and between – ACGG, ACTGTGG -AATCTACTAGCGTACTACTC, ACTACTACGTACTACG

3 Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel

4 Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel 012 312

5 Edit distance and alignments The alignment that gives the distance can be represented: And the score of the alignment is the addition of the scores of the columns: – 0 if both chars are the same – 1 otherwise ACCGTGAT ACCG -GAT * * * * * * * ACCG -TGAT ACCGATGAT * * * * * * * * ACCGTGAT ACCGAGAT * * * * * * * ACCGTGTTATGTGTATG- - TGA - - AT ACCG -GAT- - GTGT -TGTTTGAGTAT * * * * * * * * * * * * * * * * *

6 Edit distance and alignments But there are many alignments between two sequences Given ACCG ACT : Then the Edit distance is the score of the best alignment ACCG- - AC -T ACCG AC - T * * ACCG ACT - * * ACCG- - - - - - - ACT so, we can find the distance by generating all alignments and picking up the one with smallest score. the one with smallest score.

7 Edit distance and Pairwise alignment Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) from the alphabet {a,c,t,g} we say that A* and B* from {a,c,t,g,-} are aligned iff i) A* and B* become A and B if gaps ( – ) are removed. ii) |A*|=|B*| iii) For all i, it is not possible that a i = b i = - Write all alignments between AA and AC...

8 Edit distance and Pairwise alignment To blackboard

9 Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

10 Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

11 Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.

12 Edit distance and alignment of strings C T A C T A C T A C G T A C T G A ?

13 Edit distance and alignment of strings C T A C T A C T A C G T 0 A C T G A ?

14 Edit distance and alignment of strings C T A C T A C T A C G T 0 1 A C T G A - C ?

15 Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 A C T G A - - CT ?

16 Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A C T G A - - - - - - CTACTA

17 Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A ? C ? T ? G A

18 Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A 1 C 2 T 3 G… A ACT - - -

19 C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A 1 C 2 T 3 G A Edit distance and alignment of strings BA(AC,CTA) - C BA(A,CTA) CCCC BA(A,CTAC) C - BA(AC,CTAC)= best d(AC,CTAC)=min d(AC,CTA)+1 d(A,CTA) d(A,CTAC)+1

20 Bioinformatics Pairwise alignment

21 Best alignment How can an alignment be scored? Catcactactgacgactatcgtagcgcggctat acatctacgccaa- ctac-t- gtgtagatcgccgg c-tgactgc-- acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc- cgg---- * * *** * ************* ********* **** ******* * **** ** * *** Gap: worst case Mismatch: unfavorable Match: favorable Then we assign a score for each case, for example 1,-1,-2.

22 Pairwise alignment Edit distance: match=0mismatch=1 indel=1 d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1 Similarity: match=1 mismatch=-1indel=-2 s(A,CTAC)-2 s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2 - +

23 Pairwise alignment Connect to alggen tool

24 Best alignment accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Given the maximum score, how can the best alignment be found? Quadratic cost in space and time Up to 10,000 bps sequences in length Download alggen tool

25 Some preconceived ideas We have developed the theory according to the following principles: 1) Both sequences have a similar length (global). 2) The model of gaps is linear If there are k consecutive gaps the penalty scores k(-2).

26 Assume that we have sequences with different length S 1 S 2 Semiglobal pairwise alignment It is meaningless to introduce gaps until both sequences have similar length …. The most probable alignment should be How can these alignments be found? Final gaps Initial gaps

27 Semiglobal pairwise alignment C T A C T A C T A C G T A C T Initial gaps Note that Final gaps

28 Semiglobal pairwise alignment C T A C T A C T A C G T A C T The cell contains the score of the best alignment of CTA with the empty sequence. Given a cell 0 0 0 0 0 0 0

29 Semiglobal pairwise alignment C T A C T A C T A C G T 0 0 0 0 0 0 0… A C T The contribution of the initial gaps is disregarded, then C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 but, what happens with the final gaps?

30 Semiglobal pairwise alignment C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 … by checking the last row for the best score. How does the algorithm search for the best alignment?

31 Affine-gap model score Given the following alignments that have the same score … a g t a c c c c g t a g a g t - c c - - g t a - a g t a c c c c g t a g a g t - c - c - g t a - a g t a c c c c g t a g a g t - c - - c g t a - a g t a c c c c g t a g a g t - - c c - g t a - a g t a c c c c g t a g a g t - - c - c g t a - a g t a c c c c g t a g a g t - - - c c g t a - Which is the most reliable case from a biological point of view?

32 Affine-gap model score Then, how can we distinguish between consecutive gaps and separated gaps? a g t a c c c c g t a g a g t - - c - c g t a - a g t a c c c c g t a g a g t - - - c c g t a - By scoring the opening gaps greater than the extension gaps, for instance, -10 and -0.5. Then, the penalty of k consecutive gaps becomes OG + (k-1) EG which is an affine-gap function. How is the best alignment found?.

33 C T A C T A C T A C G T A C T G A Affine-gap model score Smallest arrows: refer to the introduction of an opening gap. Largest arrows: refer to the introduction of an extension gap. But from which cell do the largest arrows originate?

34 Local alignment Given two sequences, we can consider the alignments of all their substrings… …how can the best of them be found? Two questions arise: - how can the alignments be compared? - how can the best one be selected?

35 Bioinformatics Multiple alignment

36 A C A __ Pairwise to multiple alignment What happens with three strings? Let n be their lenght, then the cost becomes S3S3 S2S2 S1S1 O(n 3 )“O(2 3 )”“O(3 2 )” And with k strings? O(n k 2 k k 2 )

37 Multiple alignment Programs of multialignment use different heuristics: Clustal (Progressive alignment) Clustal http://www.ebi.ac.uk/clustalw TCoffee (Progressive alignment + data bases) TCoffee http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi HMM (Hidden Markov Models)

38 Multiple alignment Connect to alggen tool

39 Advanced Data Structure: Bioinformatics First week First week: Algorithms for exact string matching. Second week Second week: Alignment of sequences. Third week Third week: Dealing with long sequences.


Download ppt "Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:"

Similar presentations


Ads by Google