Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.

Similar presentations


Presentation on theme: "Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4."— Presentation transcript:

1 Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

2 4.1 Biological Background The ideal case Approximation = 10 bases The consensus sequence = TTACCGTGC Answer = 9 bases ( ∴ close ) The four sequences Fragment assembly ACCGT CGTGC TTAC TACCGT ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T A C C G T ㅡㅡ T T A C C G T G C

3 4.1 Biological Background Substitution There was a substitution error in the second position of the last fragment, where A was replaced by G. The consensus is still correct because of majority voting. The four sequences Fragment assembly ACCGT CGTGC TTAC TGCCGT ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T G C C G T ㅡㅡ T T A C C G T G C

4 4.1 Biological Background Insertion There was an insertion error in the second position of the second fragment. Base A appeared where there should be none. The consensus is still correct. The four sequences Fragment assembly ACCGT CAGTGC TTAC TACCGT ㅡㅡ A C C ㅡ G T ㅡㅡ ㅡㅡㅡㅡ C A G T G C T T A C ㅡㅡㅡㅡㅡㅡ ㅡ T A C C ㅡ G T ㅡㅡ T T A C C ㅡ G T G C

5 4.1 Biological Background Deletion There was a deletion in the third ( or fourth) base in the last fragment. The consensus is still correct. The four sequences Fragment assembly ACCGT CGTGC TTAC TACGT ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T A C ㅡ G T ㅡㅡ T T A C C G T G C

6 4.1 Biological Background Chimera The last fragment in this input set is a chimera. The four sequences Fragment assembly ACCGT CGTGC TTAC TACCGT TTATGC ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T A C C G T ㅡㅡ T T A C C G T G C T T A ㅡㅡㅡ T G C

7 4.1 Biological Background Unknown Orientation Fragments can come from any of the DNA strands and we generally do not know to which strand a particular fragment belongs. We do know, however, that whatever the strand the sequence read goes from 5’ to 3’. Because of the complementarity and opposite orientation of strands. Using A fragment ( substring of one strand ) is equivalent to its reverse complement(substring of the other).

8 4.1 Biological Background Fragment assembly with unknown orientation Initially we do not know the orientation of fragments. Input Answer CACGT ACGT ACTACG GTACT ACTGA CTGA →→←←→→→→←←→→ CACGTXXXXXXXX XACGTXXXXXXXX XXCGTAGTXXXXX XXXXXAGTACXXX XXXXXXXXACTGA XXXXXXXXXCTGA CACGTAGTACTGA

9 4.1 Biological Background Fragment assembly with unknown orientation Repeated regions Repeated regions or repeats are sequences that appear two or more times in the target molecule. If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors. The blocks marked X1 and X2 are approximately the same sequence. X1X2

10 4.1 Biological Background Fragment assembly with unknown orientation The kinds of problems (Repeats) Target sequence leading to ambiguous assembly because of repeats of the form XXX. A X B X C X D A X C X B X D

11 4.1 Biological Background Fragment assembly with unknown orientation The kinds of problems (Repeats) Target sequence leading to ambiguous assembly because of repeats of the form XYXY. A X B Y C X D Y E A X D Y C X B Y E

12 4.1 Biological Background Fragment assembly with unknown orientation The kinds of problems (Repeats) Inverted repeats, which are repeated regions in opposite strands, can also occur and are potentially more dangerous. Target sequence with inverted repeat. X X Rotate 180 0

13 4.2 Models Shortest Common Superstring Problem : Shortest Common Superstring(SCS) Input : A collection F of strings Output : A shortest possible string S such that for every f ∈ F, S is a superstring of f. Example F={ACT, CTA,AGT} S=ACTAGT is the SCS of F. CTA is a substring of S. ACT CTA AGT ACTAGT

14 4.2 Models Shortest Common Superstring Problem X Target sequence with long repeat that contains many fragments.

15 4.2 Models Reconstruction To deal with errors Errors and unknown orientation Substring edit distance S(b) = The set of all substrings of b d is the classical edit distance ds(a,b) ≠ ds(b,a) : asymmetric

16 4.2 Models Reconstruction Example Optimal alignment for substring edit distance, which does not charge for end deletions in the first string. - - - - - G C – G A T A G - - - - C A G T C G C T G A T C G T A C G d s (a,b)=2

17 4.2 Models Reconstruction An error tolerance  f is an approximate substring Permission :  for each base in f. Input : A collection F of strings and an error tolerance  between 0 and 1. Output : A shortest possible string S such that for ever f  F

18 4.2 Models Multicontig --TAATG TGTAA-- GTAC3-contig TAATG--- ---TGTAA GTAC2-contig TGTAA----- --TAATG--- ------GTAC 1-contig

19 4.3 Algorithms Overlap multigraph PATH1 = abc GACA-------- ---ACCC----- ------CTAAAG PATH2 = abcd a= TACGA----------- b= ----ACCC-------- c= -------CTAAAG--- d= ------------GACA b ACCC TACGA CTAAAG GACA 1 1 1 12 d c a Overlap between fragment c and d

20 4.3 Algorithms The greedy Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.

21 4.3 Algorithms The greedy Example S=AGTATTGGCAATCGATGCAAACCTTTTGGCAATCACT w=AGTATTGGCAATC z=AATCGATG u=ATGCAAACCT x=CCTTTTGG y=TTGGCAATCACT This solution has length 36 and is generated by the Greedy algorithm. However, its weakest link is zero.

22 4.3 Algorithms Acyclic Hamiltonian path 4 3 3 4 This solution has length 37. Its weakest link is 3.

23 4.4 Heuristics Alignment and consensus Suppose we have a path f-> g-> h f=CATAGTC g=TAACTAT h=AGACTATCC C A T A G T C - - - - - - - T A – A C T A T - - - - - A G A C T A T C C C A T A G A C T A T C C

24 4.4 Heuristics Alignment and consensus Two layouts for the same sequences ACT-GG ACTTGG AC-TGG ACT-GG AC-TGG ACTTGG ACT-GG ACTTGG AC-TGG ACT-GG AC-TGG ACTTGG T- TT -T T- -T TT T- TT -T T- -T TT Using a sum-of pairs scoring


Download ppt "Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4."

Similar presentations


Ads by Google