Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorismes de cerca Algorismes de cerca: definició del problema (text,patró) depèn de què coneixem al principi: Cerca exacta: Cerca aproximada: 1 patró.

Similar presentations


Presentation on theme: "Algorismes de cerca Algorismes de cerca: definició del problema (text,patró) depèn de què coneixem al principi: Cerca exacta: Cerca aproximada: 1 patró."— Presentation transcript:

1 Algorismes de cerca Algorismes de cerca: definició del problema (text,patró) depèn de què coneixem al principi: Cerca exacta: Cerca aproximada: 1 patró ---> L’algorisme depèn de la llargada i |  | k patrons ---> L’algorisme depén del nombre k, la llargada i |  | Només el text ----> Estructurar el text (suffix tree) Només el/s patró/ns ---> Estructurar el/els patró/ns depèn de la llargada del patró l Algorisme Myers l>w (llargada paraula) ---> Programació dinàmica Extensions Expressions regulars Cerca probabilista

2 2.2 Pairwise alignment Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) from the alphabet {a,c,t,g} we say that A* and B* from {a,c,t,g,-} are aligned iff i)A* and B* become A and B if gaps ( – ) are removed. ii)|A*|=|B*| iii)For all i, it is not possible that a i = b i = - Which is the best alignment? How many alignments of two sequences exist? MALIG (an example)MALIG

3 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 #(a 1,b 1 )

4 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111

5 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111 3?

6 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111 35 7 5757 ?

7 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) then: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with ( a n, -) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with ( -, b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with ( a n, b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111 35 7 5757 1325 25 63 But, what is the assymptotic value?

8 2.2 Assymptotic value > Σ ( ) ( ) k=0 K=n k n k n As = ( ) n 2n #(a 1 a 2...a n,b 1 b 2...b n ) and n! ~ n n e -n (Stirling approximation) then #(a 1 a 2...a n,b 1 b 2...b n ) > 2 2n

9 2.2 Best alignment How can an alignment be scored? catcactactgacgactatcgtagcgcggctatacatctacgccaa- ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg---- * * *** * * ** * ******* * * **** **** ******* * **** ** * *** How can the best alignment be found? Gap: worst case Mismatch: unfavorable Match: favorable Then we assign a score for each case, for example 1,-1,-2.

10 2.2 Edit distance and alignment of strings The best alignment of two strings … …is related with the edit distance, first discussed in 1966... The most efficient algorithm was proposed in 1968 and in 1970 using the technique called “Dynamic programming”

11 2.2 Best alignment C T A C T A C T A C G T A C T G A

12 2.2 Best alignment C T A C T A C T A C G T A C T G A

13 2.2 Best alignment C T A C T A C T A C G T A C T G A The cell contains the score of the best alignment of AC and CTACT.

14 C T A C T A C T A C G T A C T G A 2.2 Best alignment C T A C T A C T A C G T 0 -2 -4-6 -8 … A-2 C-4 T -6 G A BA(AC,CTA) - C BA(A,CTA) CCCC BA(A,CTAC) C - BA(AC,CTAC)= best s(AC,CTAC)=max s(AC,CTA)-2 s(A,CTA)+1 s(A,CTAC)-2

15 Best alignment accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Given the maximum score, how can the best alignment be found? Quadratic cost in space and time Up to 10,000 bps sequences in length Download alggen tool

16 2.2 Some slides revisited We have developed the theory according to the following principles: 1) Both sequences have a similar length (global). 2) The model of gaps is linear If there are k consecutive gaps the penalty scores k(-2).

17 Assume that we have sequences with different length S 1 S 2 2.2 Semiglobal pairwise alignment It is meaningless to introduce gaps until both sequences have similar length …. The most probable alignment should be How can these alignments be found? Final gaps Initial gaps

18 2.2 Semiglobal pairwise alignment C T A C T A C T A C G T A C T Initial gaps Note that Final gaps

19 2.2 Semiglobal pairwise alignment C T A C T A C T A C G T A C T The cell contains the score of the best alignment of CTA with the empty sequence. Given a cell 0 0 0 0 0 0 0

20 2.2 Semiglobal pairwise alignment C T A C T A C T A C G T 0 0 0 0 0 0 0… A C T The contribution of the initial gaps is disregarded, then C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 but, what happens with the final gaps?

21 2.2 Semiglobal pairwise alignment C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 Practice with the alggen tool. … by checking the last row for the best score. How does the algorithm search for the best alignment?

22 2.2 Affine-gap model score Given the following alignments that have the same score … a g t a c c c c g t a g a g t - c c - - g t a - a g t a c c c c g t a g a g t - c - c - g t a - a g t a c c c c g t a g a g t - c - - c g t a - a g t a c c c c g t a g a g t - - c c - g t a - a g t a c c c c g t a g a g t - - c - c g t a - a g t a c c c c g t a g a g t - - - c c g t a - Which is the most reliable case from a biological point of view?

23 2.2 Affine-gap model score Then, how can we distinguish between consecutive gaps and separated gaps? a g t a c c c c g t a g a g t - - c - c g t a - a g t a c c c c g t a g a g t - - - c c g t a - By scoring the opening gaps greater than the extension gaps, for instance, -10 and -0.5. Then, the penalty of k consecutive gaps becomes OG + (k-1) EG which is an affine-gap function. How is the best alignment found?.

24 C T A C T A C T A C G T A C T G A 2.2 Affine-gap model score Smallest arrows: refer to the introduction of an opening gap. Largest arrows: refer to the introduction of an extension gap. But from which cell do the largest arrows originate?

25 C T A C T A C T A C G T A C T G A 2.2 Affine-gap model score In both cases we know which cell contributes with the minimum penalty score. Acces to clustalW: http://www.ebi.ac.uk/clustalwhttp://www.ebi.ac.uk/clustalw

26 2.2 Local alignment Given two sequences, we can consider the alignments of all their substrings… …how can the best of them be found? Two questions arise: - how can the alignments be compared? - how can the best one be selected?

27 2.2 Local alignment Given a path Imagine the graph of the scores: can the best subalignments be detected? accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t … It suffices to compare the value of each cell with zero!


Download ppt "Algorismes de cerca Algorismes de cerca: definició del problema (text,patró) depèn de què coneixem al principi: Cerca exacta: Cerca aproximada: 1 patró."

Similar presentations


Ads by Google