Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (<10.000 bps) 4 Sequence assembly 3 Comparison.

Similar presentations


Presentation on theme: "Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (<10.000 bps) 4 Sequence assembly 3 Comparison."— Presentation transcript:

1 Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (<10.000 bps) 4 Sequence assembly 3 Comparison of large sequences (up to 250 000 000) 5 Efficient data search structures and algorithms 6 Proteins...

2 2. Comparison of short sequences (<10.000 bps) Summary (more or less) 2.1 Dot matrix 2.2 Pairwise alignment. 2.3 Hash algorithms. 2.4 Multiple alignment.

3 2.2 Pairwise alignment Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) from the alphabet {a,c,t,g} we say that A* and B* from {a,c,t,g,-} are aligned iff i)A* and B* become A and B if gaps ( – ) are removed. ii)|A*|=|B*| iii)For all i, it is not possible that a i = b i = - Which is the best alignment? How many alignments of two sequences exist?

4 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 #(a 1,b 1 )

5 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111

6 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111 3?

7 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111 35 7 5757 ?

8 2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) then: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with ( a n, -) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with ( -, b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with ( a n, b m ) a1a2a3a1a2a3 b 1 b 2 b 3 1111 111111 35 7 5757 1325 25 63 But, what is the assymptotic value?

9 2.2 Assymptotic value > Σ ( ) ( ) k=0 k=min(n,m) k m k n As = ( ) k n + m #(a 1 a 2...a n,b 1 b 2...b m ) and n! ~ n n e -n (Stirling approximation) then #(a 1 a 2...a n,b 1 b 2...b n ) > 2 2n

10 2.2 Best alignment How can an alignment be scored? catcactactgacgactatcgtagcgcggctatacatctacgccaa- ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg---- * * *** * * ** * ******* * * **** **** ******* * **** ** * *** How can the best alignment be found? Gap: worst case Mismatch: unfavorable Match: favorable Then we assign a score for each case, for example 1,-1,-2.

11 2.2 Edit distance and alignment of strings The best alignment of two strings … …is related with the edit distance, first discussed in 1966... The most efficient algorithm was proposed in 1968 and in 1970 using the technique called “Dynamic programming”

12 2.2 Best alignment C T A C T A C T A C G T A C T G A

13 2.2 Best alignment C T A C T A C T A C G T A C T G A

14 2.2 Best alignment C T A C T A C T A C G T A C T G A The cell contains the score of the best alignment of AC and CTACT.

15 2.2 Best alignment C T A C T A C T A C G T 0 A C T G A ?

16 2.2 Best alignment C T A C T A C T A C G T 0 -2 A C T G A - C ?

17 2.2 Best alignment C T A C T A C T A C G T 0 -2 -4 A C T G A - - CT ?

18 2.2 Best alignment C T A C T A C T A C G T 0 -2-4-6 -8 … A C T G A - - - - - - CTACTA

19 2.2 Best alignment C T A C T A C T A C G T 0 -2-4-6 -8 … A ? C ? T ? G A

20 2.2 Best alignment C T A C T A C T A C G T 0 -2-4-6 -8 … A-2 C-4 T -6 G… A ACT - - -

21 C T A C T A C T A C G T A C T G A 2.2 Best alignment C T A C T A C T A C G T 0 -2 -4-6 -8 … A-2 C-4 T -6 G A BA(AC,CTA) - C BA(A,CTA) CCCC BA(A,CTAC) C - BA(AC,CTAC)= best s(AC,CTAC)=max s(AC,CTA)-2 s(A,CTA)+1 s(A,CTAC)-2

22 Best alignment accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Given the maximum score, how can the best alignment be found? Quadratic cost in space and time Up to 10,000 bps sequences in length

23 2.2 Best alignment Connect to http://alggen.lsi.upc.es/docencia/ember/lepa/Tfc1.htm and use the global method.


Download ppt "Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (<10.000 bps) 4 Sequence assembly 3 Comparison."

Similar presentations


Ads by Google