Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

Similar presentations


Presentation on theme: "Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation."— Presentation transcript:

1 Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas ctas@cs CMSC 838 Presentation

2 CMSC 838T – Presentation Motivation u DNA microarrays techniques are used intensely for identification of biological agents  Gene Expression Studies  Diagnostic Purposes l Identification of Microorganisms in samples  Item Extraction u Complex Problem  Find the necessary probes and the temperature  Probe sets should be reliably detect and differentiate target sequences  Large Databases  NEW!! Homologous Genes (how to find specific probes)

3 CMSC 838T – Presentation Talk Overview u Overview of talk  Motivation  Problem Statement  Algorithm  Mathematical Aspects  Experimentation  Discussion

4 CMSC 838T – Presentation Problem Statement u Specificity vs. Sensitivity  Specificity: # of non-target match is minimized  Sensitivity: # of selected target sequences is maximized. u Original Problem:

5 CMSC 838T – Presentation Problem Statement u Positive Probes  Database set S 0  Target S 1  For each sequence in S 1, find at least one probe  For S 0 - S 1 try to avoid it (but do not care if happens)  High Specificity: # of non-target matches are minimized  High Sensitivity: # of covered target seq. is maximized S0S0 S1S1

6 CMSC 838T – Presentation Problem Statement u Negative Probes  Determine as few as possible probes which together hybridizes with all sequences in S 0 - S 1 but with NONE in S 1.  High Specificity: No seq. in S 1 may hybridize  High Sensitivity: Max # of seq. in S 0 - S 1 be covered S1S1 S0S0

7 CMSC 838T – Presentation Problem cont. u Extend Problem u Specificity vs. Sensitivity  Specificity: No seq. in S 1 may cross-hybridize with any negative probe  Sensitivity: # of seq. covered in B must be maximized.

8 CMSC 838T – Presentation Probe Design Constraints u Sequence Related  Length of probes  Deviation of melting temperature of probe-target hybrids must be low (for physical reasons)  No self complementary regions longer than four nucleotides (not descriptive enough)  Melting temperatures of target and non-target seq. must be larger than a predefined (too close, too hard to identify) l Ensuring a minimum number of mismatches is enough (homologous sequences) u System Related  Execution Time  Usability

9 CMSC 838T – Presentation Algorithm u Overview  Probe Generation  Hybridization Prediction  Probe Selection

10 CMSC 838T – Presentation Algorithm Probe Generation u Subproblem:  Generate probe candidates for the sequences  Keep the set as small as possible without losing any optimal candidate (exclude infeasible ones) u Suffix Tree  Why? l Allows fast recognition of repetitive subsequences l Identifies non-unique probes (i.e. with more than one target) l Efficient for memory and for T computation (reduce time)  How? l Tree is constructed from the sequences l Traversed (Watson-Crick complement)

11 CMSC 838T – Presentation Suffix Tree u Input: TACTACA  TACTACA  ACTACA  CTACA  TACA  ACA  CA  A u $ denotes end of string u Constructed in linear time

12 CMSC 838T – Presentation Probe Generation u Further Improvements  Filters applied for cut off l Probe length (predefined) l G-C content (for temperature) l Self-complementarity u Probes should not contain complements as subsequences  Finally, remove highly conserved (non-specific) regions  Insert into hashtables according to their lengths

13 CMSC 838T – Presentation Algorithm

14 CMSC 838T – Presentation Algorithm Hybridization Prediction u Subproblem:  Search for the right probe  Search is expensive, Intelligent Hashing used u Design  A frame is moved over target and nontarget seqs. with several lengths l Previous algorithm (Kaderali 2002): Use the suffix tree  At each step, hash values are calculated. If hit, predict melting temperature, store in hybridization matrix.  If there are too many hits for a probe, then it is not unique, remove it  Why intelligent? l Hash time is linear l Allows inexact matching because of hashing (No analysis) u Parallelization  Several threads are searching for probe targets. l Tree and hashtables are fixed.  One thread writes to the final matrix

15 CMSC 838T – Presentation Hybridization Prediction u Empirical Simulation:  One million random probe-target pairings generated  Four mismatches or one insertion or deletion plus one strong central mismatch chosen  T<20 C for 93%  Complexity is O ( |S 0 | |S 1 | ) l Possible probe candidates is |S 1 | (linear) l Each position of database S 0 must be checked

16 CMSC 838T – Presentation Algorithm

17 CMSC 838T – Presentation Algorithm Hybridization Prediction u Complexity u In-exact equality  Only the inner three bands of DP matrix are computed  O(l) where l is length

18 CMSC 838T – Presentation Algorithm Probe Selection u Subproblem:  Use the hybridization matrix to finalize the probe selection l We have positive probes and negative probes to proceed u Algorithm Analysis:  For each probe candidate l g: #of matches in S 1 l b: #of matches in S 0 - S 1 l t: highest melting point in S 1  Probes for which g or b values is too large, are removed  Sort according to g,b and t.  Apply Depth First Search u Advantages  Performs well (No comparison though)  Guarantees to choose all specific probes if any were found. u Disadvantages  can NOT guarantee optimal selection in terms of coverage

19 CMSC 838T – Presentation Negative Probe Selection u Let S 2 =S 0 - S 1 and B subset of S 2. The probes that detect S 1 also detects some of B elements. u Algorithm for Negative Probes  Apply probe generating and preselection for B.  Conduct hybridization on B U S 1.  Remove the probes which hybridizes with S 1.  Sort the remaining probes according to their hit number.  Successively select the probes which covers most target seq. u Guarantees optimal solution for coverage and probe number usage

20 CMSC 838T – Presentation Algorithm Probe Selection

21 CMSC 838T – Presentation Mathematical Aspects

22 CMSC 838T – Presentation Mathematical Aspects

23 CMSC 838T – Presentation Mathematical Aspects

24 CMSC 838T – Presentation Experimentation u Parallelized on SMP platform  Classic worker-producer  Intel Dual Pentium III 933 MHz, 1 GB memory u Test data  ssu rRNA of ARB project  20.282 ssu rRNA sequences  1.401 < lengths < 4.179  %97 of them are shorter than 2.000 bases

25 CMSC 838T – Presentation Discussion u High Performance  Execution is linear with size of database, decreases if longer probes are used u Low Memory Consumption  Depends on the size of the sequence selection, NOT database size u Automatic Design of Group Probes and negative probes u High Quality Probe Design

26 CMSC 838T – Presentation Discussion u Comparison with previous work  vs. ARB l Not suited for large scale probe design  vs. LCF l Does not consider highly conserved data l Memory consumption is high l Works well with short probes only  vs. others l Mostly can not deal with insertion and deletions l Execution is slow


Download ppt "Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation."

Similar presentations


Ads by Google