1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang Institute of Applied Mathematics,
Interval Graph Test.
Solving LP Models Improving Search Special Form of Improving Search
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
1 Transportation problem The transportation problem seeks the determination of a minimum cost transportation plan for a single commodity from a number.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1/44 A simple Test For the Consecutive Ones Property.
Part 3: The Minimax Theorem
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Phylogenetic reconstruction
Chapter 9 Gauss Elimination The Islamic University of Gaza
Molecular Evolution Revised 29/12/06
Geometric reasoning about mechanical assembly By Randall H. Wilson and Jean-Claude Latombe Henrik Tidefelt.
Heuristic alignment algorithms and cost matrices
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Analysis of Algorithms CS 477/677
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Dynamical Systems Analysis II: Evaluating Stability, Eigenvalues By Peter Woolf University of Michigan Michigan Chemical Process Dynamics.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Algorithmic Problems in Algebraic Structures Undecidability Paul Bell Supervisor: Dr. Igor Potapov Department of Computer Science
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Two Discrete Optimization Problems Problem #2: The Minimum Cost Spanning Tree Problem.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 25 Instructor: Paul Beame.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Physical Mapping of DNA Shanna Terry March 2, 2004.
MAPS OF DNA AND INTERVAL GRAPHS by Akshita Gurram.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
A Test for the Consecutive Ones Property 1/39. Outline Consecutive ones property PQ-trees Template operations Complexity Analysis –The most time consuming.
Finding dense components in weighted graphs Paul Horn
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Characterizing Matrices with Consecutive Ones Property
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
1/27 Discrete and Genetic Algorithms in Bioinformatics 許聞廉 中央研究院資訊所.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Chapter 9 Gauss Elimination The Islamic University of Gaza
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Bundle Adjustment A Modern Synthesis Bill Triggs, Philip McLauchlan, Richard Hartley and Andrew Fitzgibbon Presentation by Marios Xanthidis 5 th of No.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
Design and Analysis of Algorithms (09 Credits / 5 hours per week) Sixth Semester: Computer Science & Engineering M.B.Chandak
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
PC-Trees vs. PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
Interval Graph Test Wen-Lian Hsu.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1/44 A simple Test For the Consecutive Ones Property Without PC-trees!
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Data Mining Lab Student performance evaluation. Rate of learning varies from student to student May depend on similarity of the problem Is it possible.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
PC-Trees Based on a paper by Hsu and McConnell. Talk Outline We Define the consecutive ones and circular ones problems We show PQ Trees – the traditional.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
Design and Analysis of Algorithms (09 Credits / 5 hours per week)
Dr. Unnikrishnan P.C. Professor, EEE
Backtracking and Branch-and-Bound
Clustering.
Presentation transcript:

1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

2/55 Discrete Algorithms ‧ Discrete Math. lies in the foundation of modern computer science ‧ Most algorithms we have learned in computer science are discrete ‧ Discrete algorithms emphasize “worst case analysis” ‧ Many sequence manipulation algorithms in bioinformatics are discrete

Error-Tolerant Algorithms ‧ Many recognition problems in nature need algorithm to remove noises automatically to get the correct information : –Optical character recognition ( OCR ) –Human face recognition –Voice recognition –Style checker

Design of Algorithms ‧ Optimization problems –can define “approximation algorithms” ‧ Decision problems (isomorphism, recognition, etc ) ‧ one can consider the “least # of changes” needed to yield a “yes” answer But, this often makes the problem much harder ‧ even if one can find a solution above, it might not make any practical sense ‧ no easy way to measure the “deviation”

5/55 A New Paradigm Error-Tolerant Algorithms ‧ Real life data always contain some errors (say 5%) ‧ The Challenge: Discover the 95% “correct” information versus the 5% “incorrect” information automatically ‧ Robustness (difficult to define) ‧ Similar in nature to voice recognition and character recognition algorithms

6/55 Natural Problems (1) ‧ Natural problems: Problems arised from nature, which are guaranteed to have feasible solutions if data is collected accurately. – But because of noises in sampled data, such solutions are hard to come by. ‧ To tackle these problems one should focus on real data rather than worst case analysis.

7/55 Natural Problems (2) ‧ Techniques taking advantage of the natural constraints of these problems do not necessarily work for general data (especially the worst case), but could perform very well for those well- structured problems. Constraints  Structures  Knowledge

An Error-Tolerant Algorithm for the Consecutive Ones Property Wen-Lian Hsu Academia Sinica

Human Genome Project ‧ DNA sequencing (could be over 10 million bp) sequences of 4 letters A,G,C,T ‧ Topics of human genome project : –Cutting and reassembling DNA sequence –Sequence comparison –Gene finding –Transcription mechanism of DAN sequence –Prediction of the structure of proteins –Phylogenetic trees

Cutting and reassembling for DNA sequence ‧ Cut a DNA sequence into small pieces in different ways and reassemble them together ‧ the “small” pieces (called clones) are still too large to find complete sequences ‧ biologically, use “probe”to mark the clones –each probe could mark several clones clone could contain several probes

Probe-Clone (0,1)-Matrix ‧ Each probe can be regarded as a column; each clone can be regarded as a row of probes ‧ If each probe hits the DNA sequence only once (unique probe) and there is no error in the probe-clone matrix, then one can use the consecutive ones test to order the clones

Consecutive Ones Property (C1P) ‧ Booth & Lueker [1976] linear time, on-line –made use of a data structure called PQ-trees ‧ Hsu [1992] decomposition, off-line –did not use PQ-trees ‧ However, these algorithms do not work on data that contain errors

13/55 C1P Testing with Good Row Ordering

14/55 Exact Algorithm for Consecutive Ones Testing 1. Construct G’’, a spanning tree of G’ ( the strictly overlapping graph ). Each connected component corresponds to a prime submatrix. ( matrix decomposition ) 2. Decide the topological ordering of prime matrix. 3. For each prime matrix determine the ordering of columns, using the set partition strategy, according to the preorder traversal of the corresponding connected component of G’’. ( good ordering )

Problems in Lab Data ‧ False positives : a “1” should actually be a “0” ‧ False negatives : a “0” should actually be a ”1” ‧ The probes are not necessarily unique –there are a lot of repeating subsequences in a DNA sequence ‧ Chimeric clones : two clones stick together at the end ‧ In STOC, Karp[1993] posed this as the problem that needs major breakthrough in computational biology ‧ How to deal with it? -- neighborhood consensus

False Positives and False Negatives false positive false negative

17/55 Non-unique Probe 0

18/55 Non-unique Probe 0

19/55 Remote False Positives 0

20/55 Chimeric Clone 0

21 An Error-Tolerant Algorithm for the C1P test The idea is derived from the off-line C1P test based on Good row ordering

22/55 Strategy of Fault-tolerant Algorithm for Consecutive Ones Testing 1. Detecting and correcting the four types of errors to construct G’’. 2. Decide the topological ordering of prime matrix. 3. Using heuristic set partition strategy to determine the ordering of columns. There will be bad rows, lost columns, which indicate the corresponding clones, probes are bad, and the additional lab work is needed.

23/55 A Matrix Satisfying the C1P

24/55 A Matrix Mixed with All Four Type of Errors

25/55 Monotone Property in a Consecutive Ones Matrix

26/55 u E(u) A(u) B(u) C(u) D(u) STA(u) STA’(u) Processing row u (I) -Errorless case LL RR

27/55 Processing row u (II) -Errorless case ‧ At the end, row u is shrunk to 2 columns, representing the left and right parts ‧ At the end of the algorithm, we can rewind the rows to restore all the shrunk rows

28/55 u False Negatives of Row u

29/55 u False Negatives of C(u)

30/55 A General Error-Tolerant Algorithm for constructing G’’ (I) 1. Determine, for each probe, whether it is unique, and remove the remote false positives. 2. Determine, for each clone, whether it is chimerical, and remove the remote false positives. 3. Detect certain false negatives using a global technique 4. Partition STA’(u) (STA(u) – E(u)), C(u) and D(u) based on the containment relationship and partition A(u) and B(u) from STA’(u).

31/55 A General Error-Tolerant Algorithm for constructing G’’ (II) 5. For each row u, detect those local false negatives and false positives. 6. Make u adjacent to every row in A(u) and B(u). 7. Delete row u, construct a special row [u] such that CL([u]) = {v 1,v 2 } and Proceed to the next regular row.

32/55 u ? ? ? E(u) A(u) B(u) C(u) D(u) STA(u) STA’(u) Neighborhood Clustering LL RR

Non-Unique Probes

34/55 Chimeric Clones

35/55 Remote False Positive Remote False positive

36/55 ? ? ? False Negative (Global Method) Rows “close” to the above rows

37/55 u ? Avoid False Negatives of Row u Where would the false negatives go -to the left or right?

38/55 u ? ? Avoid False Negatives of C(u)

39/55 Monotone Property in a Consecutive Ones Matrix

40/55 Local False Positives and False Negatives false positive false negative

41/55 A Heuristic for Local False Positives and Negatives Fill-in Try the columns one by one to see which has the minimum fill-ins

42/55 Ordering Probes False negatives False positives

43/55 Bad Row for Partition

44/55 Islands of probes Island 1Island 2 Bad row

45/55 Order of Islands Island 1 Island 2Island 3

Jump Column of Result Matrix

47/55 Simulation Results (I) 100x100(total 50matrices)

48/55 Simulation Results (II) 200x200(total 50matrices)

49/55 Simulation Results (III) 400x400(total 50matrices)

50/55 A 50x50 matrix with error rate 5% N11 111N N11111N N N N N 11111N N N N N N N N N11111 P N111N N1N11111 P N11N N N N1N P N N N F1F1 111F F11111F F F F F F111 11FF F1111F F F F F F F F F11F F F F F F F F F F

51/55 A 50x50 matrix with error rate 10% N11N11 1N1N1111 P N 1N1NN11N N P N N N N N NN N N11 111N N 11111N1N NN N P 1N N111 1N1N1111N1111N N N N N N N111 1N NN1111 N111N N1N N11N N P 11N1N11111 N1N11N 1N111N N N N1111 P N N N P F11F11 11F1FF FF FFFF1F FF FF F F F F FF F11 1F11FFF F1F FF F111 1F111F11F1111F1FF F F FF F F F F1111 1F11111F111 11F FF F F FF F11F FFF1F11111FF111F1 111FF11F FF F FF11F111111F F F F F F