Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Set Cover 資工碩一 簡裕峰. Set Cover Problem 2.1 (Set Cover) Given a universe U of n elements, a collection of subsets of U, S ={S 1,…,S k }, and a cost.
Longest Common Subsequence
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,
DNA Computing COMP308 I believe things like DNA computing will eventually lead the way to a “molecular revolution,” which ultimately will have a very dramatic.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Lecture 14 Genome sequencing projects
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Fragment Assembly CIS 667 Spring 2004 February 18.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Genome sequencing and assembling
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Class 2: Basic Sequence Alignment
Sequence comparison: Local alignment
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
Sequencing a genome and Basic Sequence Alignment
Order of Operations Problems. Use Parenthesis in different ways! By inserting zero, one or two pairs of parenthesis, list all the numbers you can make.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
De-novo Assembly Day 4.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Algorithms and Running Time Algorithm: Well defined and finite sequence of steps to solve a well defined problem. Eg.,, Sequence of steps to multiply two.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Sequencing a genome and Basic Sequence Alignment
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
1 NETTAB 2012 FILTERING WITH ALIGNMENT FREE DISTANCES FOR HIGH THROUGHPUT DNA READS ASSEMBLY Maria de Cola, Giovanni Felici, Daniele Santoni, Emanuel Weitschek.
Chapter 2 Greedy Strategy I. Independent System Ding-Zhu Du.
Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
1 Application of Algorithm Research to Molecular Biology R. C. T. Lee Dept. Of Computer Science National Chinan University.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Outline Today’s topic: greedy algorithms
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Analyzing Sequences. Sequences: An Evolutionary Perspective Evolution occurs through a set of modifications to the DNA These modifications include point.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Genome sequence assembly
Challenges in Creating an Automated Protein Structure Metaserver
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithms in Bioinformatics
CSE 589 Applied Algorithms Spring 1999
Phylogeny.
Fragment Assembly 7/30/2019.
Presentation transcript:

Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4

4.1 Biological Background The ideal case Approximation = 10 bases The consensus sequence = TTACCGTGC Answer = 9 bases ( ∴ close ) The four sequences Fragment assembly ACCGT CGTGC TTAC TACCGT ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T A C C G T ㅡㅡ T T A C C G T G C

4.1 Biological Background Substitution There was a substitution error in the second position of the last fragment, where A was replaced by G. The consensus is still correct because of majority voting. The four sequences Fragment assembly ACCGT CGTGC TTAC TGCCGT ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T G C C G T ㅡㅡ T T A C C G T G C

4.1 Biological Background Insertion There was an insertion error in the second position of the second fragment. Base A appeared where there should be none. The consensus is still correct. The four sequences Fragment assembly ACCGT CAGTGC TTAC TACCGT ㅡㅡ A C C ㅡ G T ㅡㅡ ㅡㅡㅡㅡ C A G T G C T T A C ㅡㅡㅡㅡㅡㅡ ㅡ T A C C ㅡ G T ㅡㅡ T T A C C ㅡ G T G C

4.1 Biological Background Deletion There was a deletion in the third ( or fourth) base in the last fragment. The consensus is still correct. The four sequences Fragment assembly ACCGT CGTGC TTAC TACGT ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T A C ㅡ G T ㅡㅡ T T A C C G T G C

4.1 Biological Background Chimera The last fragment in this input set is a chimera. The four sequences Fragment assembly ACCGT CGTGC TTAC TACCGT TTATGC ㅡㅡ A C C G T ㅡㅡ ㅡㅡㅡㅡ C G T G C T T A C ㅡㅡㅡㅡㅡ ㅡ T A C C G T ㅡㅡ T T A C C G T G C T T A ㅡㅡㅡ T G C

4.1 Biological Background Unknown Orientation Fragments can come from any of the DNA strands and we generally do not know to which strand a particular fragment belongs. We do know, however, that whatever the strand the sequence read goes from 5’ to 3’. Because of the complementarity and opposite orientation of strands. Using A fragment ( substring of one strand ) is equivalent to its reverse complement(substring of the other).

4.1 Biological Background Fragment assembly with unknown orientation Initially we do not know the orientation of fragments. Input Answer CACGT ACGT ACTACG GTACT ACTGA CTGA →→←←→→→→←←→→ CACGTXXXXXXXX XACGTXXXXXXXX XXCGTAGTXXXXX XXXXXAGTACXXX XXXXXXXXACTGA XXXXXXXXXCTGA CACGTAGTACTGA

4.1 Biological Background Fragment assembly with unknown orientation Repeated regions Repeated regions or repeats are sequences that appear two or more times in the target molecule. If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors. The blocks marked X1 and X2 are approximately the same sequence. X1X2

4.1 Biological Background Fragment assembly with unknown orientation The kinds of problems (Repeats) Target sequence leading to ambiguous assembly because of repeats of the form XXX. A X B X C X D A X C X B X D

4.1 Biological Background Fragment assembly with unknown orientation The kinds of problems (Repeats) Target sequence leading to ambiguous assembly because of repeats of the form XYXY. A X B Y C X D Y E A X D Y C X B Y E

4.1 Biological Background Fragment assembly with unknown orientation The kinds of problems (Repeats) Inverted repeats, which are repeated regions in opposite strands, can also occur and are potentially more dangerous. Target sequence with inverted repeat. X X Rotate 180 0

4.2 Models Shortest Common Superstring Problem : Shortest Common Superstring(SCS) Input : A collection F of strings Output : A shortest possible string S such that for every f ∈ F, S is a superstring of f. Example F={ACT, CTA,AGT} S=ACTAGT is the SCS of F. CTA is a substring of S. ACT CTA AGT ACTAGT

4.2 Models Shortest Common Superstring Problem X Target sequence with long repeat that contains many fragments.

4.2 Models Reconstruction To deal with errors Errors and unknown orientation Substring edit distance S(b) = The set of all substrings of b d is the classical edit distance ds(a,b) ≠ ds(b,a) : asymmetric

4.2 Models Reconstruction Example Optimal alignment for substring edit distance, which does not charge for end deletions in the first string G C – G A T A G C A G T C G C T G A T C G T A C G d s (a,b)=2

4.2 Models Reconstruction An error tolerance  f is an approximate substring Permission :  for each base in f. Input : A collection F of strings and an error tolerance  between 0 and 1. Output : A shortest possible string S such that for ever f  F

4.2 Models Multicontig --TAATG TGTAA-- GTAC3-contig TAATG TGTAA GTAC2-contig TGTAA TAATG GTAC 1-contig

4.3 Algorithms Overlap multigraph PATH1 = abc GACA-------- ---ACCC----- ------CTAAAG PATH2 = abcd a= TACGA----------- b= ----ACCC-------- c= -------CTAAAG--- d= ------------GACA b ACCC TACGA CTAAAG GACA d c a Overlap between fragment c and d

4.3 Algorithms The greedy Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.

4.3 Algorithms The greedy Example S=AGTATTGGCAATCGATGCAAACCTTTTGGCAATCACT w=AGTATTGGCAATC z=AATCGATG u=ATGCAAACCT x=CCTTTTGG y=TTGGCAATCACT This solution has length 36 and is generated by the Greedy algorithm. However, its weakest link is zero.

4.3 Algorithms Acyclic Hamiltonian path This solution has length 37. Its weakest link is 3.

4.4 Heuristics Alignment and consensus Suppose we have a path f-> g-> h f=CATAGTC g=TAACTAT h=AGACTATCC C A T A G T C T A – A C T A T A G A C T A T C C C A T A G A C T A T C C

4.4 Heuristics Alignment and consensus Two layouts for the same sequences ACT-GG ACTTGG AC-TGG ACT-GG AC-TGG ACTTGG ACT-GG ACTTGG AC-TGG ACT-GG AC-TGG ACTTGG T- TT -T T- -T TT T- TT -T T- -T TT Using a sum-of pairs scoring