Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

Slides:



Advertisements
Similar presentations
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Aki Hecht Seminar in Databases (236826) January 2009
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.
Probe design for microarrays using OligoWiz. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Reduced Support Vector Machine
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Hash Tables1 Part E Hash Tables  
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Real-Time Primer Design for DNA Chips Annie Hui CMSC 838 Presentation.
A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
IN THE NAME OF GOD. PCR Primer Design Lecturer: Dr. Farkhondeh Poursina.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Robust Estimators.
1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Advanced Sorting 7 2  9 4   2   4   7
Sections 10.5 – 10.6 Hashing.
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Lecture 4: Probe & primer design
Database Design and Programming
Regression Testing.
Fourier Transform of Boundaries
B-Trees.
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
DATA STRUCTURES-COLLISION TECHNIQUES
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation

CMSC 838T – Presentation Motivation u DNA microarrays techniques are used intensely for identification of biological agents  Gene Expression Studies  Diagnostic Purposes l Identification of Microorganisms in samples  Item Extraction u Complex Problem  Find the necessary probes and the temperature  Probe sets should be reliably detect and differentiate target sequences  Large Databases  NEW!! Homologous Genes (how to find specific probes)

CMSC 838T – Presentation Talk Overview u Overview of talk  Motivation  Problem Statement  Algorithm  Mathematical Aspects  Experimentation  Discussion

CMSC 838T – Presentation Problem Statement u Specificity vs. Sensitivity  Specificity: # of non-target match is minimized  Sensitivity: # of selected target sequences is maximized. u Original Problem:

CMSC 838T – Presentation Problem Statement u Positive Probes  Database set S 0  Target S 1  For each sequence in S 1, find at least one probe  For S 0 - S 1 try to avoid it (but do not care if happens)  High Specificity: # of non-target matches are minimized  High Sensitivity: # of covered target seq. is maximized S0S0 S1S1

CMSC 838T – Presentation Problem Statement u Negative Probes  Determine as few as possible probes which together hybridizes with all sequences in S 0 - S 1 but with NONE in S 1.  High Specificity: No seq. in S 1 may hybridize  High Sensitivity: Max # of seq. in S 0 - S 1 be covered S1S1 S0S0

CMSC 838T – Presentation Problem cont. u Extend Problem u Specificity vs. Sensitivity  Specificity: No seq. in S 1 may cross-hybridize with any negative probe  Sensitivity: # of seq. covered in B must be maximized.

CMSC 838T – Presentation Probe Design Constraints u Sequence Related  Length of probes  Deviation of melting temperature of probe-target hybrids must be low (for physical reasons)  No self complementary regions longer than four nucleotides (not descriptive enough)  Melting temperatures of target and non-target seq. must be larger than a predefined (too close, too hard to identify) l Ensuring a minimum number of mismatches is enough (homologous sequences) u System Related  Execution Time  Usability

CMSC 838T – Presentation Algorithm u Overview  Probe Generation  Hybridization Prediction  Probe Selection

CMSC 838T – Presentation Algorithm Probe Generation u Subproblem:  Generate probe candidates for the sequences  Keep the set as small as possible without losing any optimal candidate (exclude infeasible ones) u Suffix Tree  Why? l Allows fast recognition of repetitive subsequences l Identifies non-unique probes (i.e. with more than one target) l Efficient for memory and for T computation (reduce time)  How? l Tree is constructed from the sequences l Traversed (Watson-Crick complement)

CMSC 838T – Presentation Suffix Tree u Input: TACTACA  TACTACA  ACTACA  CTACA  TACA  ACA  CA  A u $ denotes end of string u Constructed in linear time

CMSC 838T – Presentation Probe Generation u Further Improvements  Filters applied for cut off l Probe length (predefined) l G-C content (for temperature) l Self-complementarity u Probes should not contain complements as subsequences  Finally, remove highly conserved (non-specific) regions  Insert into hashtables according to their lengths

CMSC 838T – Presentation Algorithm

CMSC 838T – Presentation Algorithm Hybridization Prediction u Subproblem:  Search for the right probe  Search is expensive, Intelligent Hashing used u Design  A frame is moved over target and nontarget seqs. with several lengths l Previous algorithm (Kaderali 2002): Use the suffix tree  At each step, hash values are calculated. If hit, predict melting temperature, store in hybridization matrix.  If there are too many hits for a probe, then it is not unique, remove it  Why intelligent? l Hash time is linear l Allows inexact matching because of hashing (No analysis) u Parallelization  Several threads are searching for probe targets. l Tree and hashtables are fixed.  One thread writes to the final matrix

CMSC 838T – Presentation Hybridization Prediction u Empirical Simulation:  One million random probe-target pairings generated  Four mismatches or one insertion or deletion plus one strong central mismatch chosen  T<20 C for 93%  Complexity is O ( |S 0 | |S 1 | ) l Possible probe candidates is |S 1 | (linear) l Each position of database S 0 must be checked

CMSC 838T – Presentation Algorithm

CMSC 838T – Presentation Algorithm Hybridization Prediction u Complexity u In-exact equality  Only the inner three bands of DP matrix are computed  O(l) where l is length

CMSC 838T – Presentation Algorithm Probe Selection u Subproblem:  Use the hybridization matrix to finalize the probe selection l We have positive probes and negative probes to proceed u Algorithm Analysis:  For each probe candidate l g: #of matches in S 1 l b: #of matches in S 0 - S 1 l t: highest melting point in S 1  Probes for which g or b values is too large, are removed  Sort according to g,b and t.  Apply Depth First Search u Advantages  Performs well (No comparison though)  Guarantees to choose all specific probes if any were found. u Disadvantages  can NOT guarantee optimal selection in terms of coverage

CMSC 838T – Presentation Negative Probe Selection u Let S 2 =S 0 - S 1 and B subset of S 2. The probes that detect S 1 also detects some of B elements. u Algorithm for Negative Probes  Apply probe generating and preselection for B.  Conduct hybridization on B U S 1.  Remove the probes which hybridizes with S 1.  Sort the remaining probes according to their hit number.  Successively select the probes which covers most target seq. u Guarantees optimal solution for coverage and probe number usage

CMSC 838T – Presentation Algorithm Probe Selection

CMSC 838T – Presentation Mathematical Aspects

CMSC 838T – Presentation Mathematical Aspects

CMSC 838T – Presentation Mathematical Aspects

CMSC 838T – Presentation Experimentation u Parallelized on SMP platform  Classic worker-producer  Intel Dual Pentium III 933 MHz, 1 GB memory u Test data  ssu rRNA of ARB project  ssu rRNA sequences  < lengths <  %97 of them are shorter than bases

CMSC 838T – Presentation Discussion u High Performance  Execution is linear with size of database, decreases if longer probes are used u Low Memory Consumption  Depends on the size of the sequence selection, NOT database size u Automatic Design of Group Probes and negative probes u High Quality Probe Design

CMSC 838T – Presentation Discussion u Comparison with previous work  vs. ARB l Not suited for large scale probe design  vs. LCF l Does not consider highly conserved data l Memory consumption is high l Works well with short probes only  vs. others l Mostly can not deal with insertion and deletions l Execution is slow