An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Genetic Algorithms Contents 1. Basic Concepts 2. Algorithm
Video Shot Boundary Detection at RMIT University Timo Volkmer, Saied Tahaghoghi, and Hugh E. Williams School of Computer Science & IT, RMIT University.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
COFFEE: an objective function for multiple sequence alignments
Evaluating Search Engine
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Probe design for microarrays using OligoWiz. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
PCR Primer Design Guidelines
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
BLAST What it does and what it means Steven Slater Adapted from pt.
Development and Evaluation of a Comprehensive Functional Gene array for Environmental Studies Zhili He 1,2, C. W. Schadt 2, T. Gentry 2, J. Liebich 3,
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Modeling and Simulation Random Number Generators
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
Design of oligonucleotides for microarrays and perspectives for design of multi-transcriptome arrays Henrik Bjorn Nielsen, Rasmus Wernersson and Steen.
K Phone: Web: A Software Package for the Design and Analysis of Microbial Functional.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
Cluster validation Integration ICES Bioinformatics.
A Software Tool for Generating Non-Crosshybridizing libraries of DNA Oligonucleotides Russell Deaton, junghuei Chen, hong Bi, and John A. Rose Summerized.
From Smith-Waterman to BLAST
Construction of Substitution matrices
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Heuristic Alignment Algorithms Hongchao Li Jan
CS 6293 AT: Current Bioinformatics HW2 Papers 1
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Department of Computer Science
Homology Search Tools Kun-Mao Chao (趙坤茂)
November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen
Maximize read usage through mapping strategies
Russell Deaton, junghuei Chen, hong Bi, and John A. Rose
Reseeding-based Test Set Embedding with Reduced Test Sequences
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering Kyungpook National University, South Korea

Motivation Issues for designing oligonucleotides – To minimize the cross-hybridizations – To minimize the computing time Seeding (or indexing) have been widely used for concurring those issues by means of pre- screening unreliable sequence regions before calculating cross-hybridizations. Although many types of seeding methods have been proposed, measure of evaluating the seeds regarding how adequate and efficient they are in the oligonucleotide design is not yet proposed.

Difference between alignment and oligonucleotide design Alignment – To find all possible alignments which have enough scores. – Sensitivity is important, while specificity is usually guaranteed by seed’s own specificity. Oligoncleotide design – To find optimal oligonucleotides to differentiate target sequences from the others. – Specificity should be considered as well as sensitivity for checking cross-hybridization.

Objectives We propose novel measures of evaluating the seeds based on the discriminability and the efficiency. We examine five seeding methods in oligonucleotide design. – continuous, spaced, transition-constrained, BLAT, and Vector seed We provide a software package SeedChooser which enables users to get the adequate seeds under their own experimental conditions.

What is Seed? Seeding process – Filtering step: short fixed-length common words which are found at both query and target sequences are selected. – Extension step: the selected words are extended to the size of oligonucleotide and be checked the cross-hybridization. Seed = the filtering template of the fixed-length words

Seeding methods (1/2) Continuous seed: a seed to find k-length exact matches – BLAST employs 11-bp length seed Spaced seed: allowing don’t care letter labeled ‘0’ in the seed – 18-bp-length seed containing 11-bp matches is used at PatternHunter. Transition-constrained seed: adopting transition (A G, C T) letter in the seed – YASS used such seed it consists of 18-bp length, 10-bp matches and 2 transitions.

Seeding methods (2/2) Blat seed: a continuous seed allowing one or two mismatches at any positions of the seed. Vector seed: a generalized seed by combining the idea of BLAT seed and spaced seed. BLAT seed and Vector seed allow some mismatches in any positions. – They greatly increase the sensitivity but spends much more computing time than the previous seeds.

The Issues of seeds for oligo design An ideal seed should filter all regions as fast as possible that have no possibility of being chosen as an oligo. a seed should find as many oligos as possible a seed should avoid to find non-oligo region a seed should minimize the cost of indexing to find oligos

Discriminability The discriminability is a balance between precision and recall to minimize both false positives and false negatives. jumpalpha

Efficiency The efficiency is the proportion of useful regions filtered by a seed. – the duplication ratio of generated indices – the average number of indices in each oligo jump beta, gamma

Efficient discriminability The efficient discriminative seed is the seed that has the maximum efficient discriminability value for the given

Experiments Empirically chosen seeds were evaluated by three measures, discriminability, efficiency, and efficient discriminability, respectively. We tested the seeds for designing the 50mer oligos. – The parameters are set to 1 for evaluation. Simulated data set – A set of random sequences which are generated by OligoGenerator in SeedChooser. Biological data set – Ecologically important genes involved in the nitrogen and carbon cycles. – nirS: nitrite reductase gene set – pmoA: methane monooxygenase gene set

Discriminability of the five seeding methods

Efficiency of the five seeding methods

Efficient Discriminability the five seeding methods

Evaluation results of pmoA data set

Evaluation results of nirS data set

SeedChooser: Seed Evaluation and Recommendation Tools SeedChooser : To recommend best seeds by the evaluation parameters. It uses genetic algorithm to find best seeds. SeedEvaluator : To evaluate a set of the seeds by the parameters. OligoGenerator : To generate a set of oligos for the desired experimental conditions. SeedChooser homepage

CONCLUSION The novel measure for evaluating the seeds in the oligo design based on the discriminability and the efficiency. The spaced seed was generally preferred to the other seeding methods. Our study can be applied to the oligo design programs in order to improve the performance by suggesting the experiment-specific seeds. We expect that our study will be helpful to the other genomic tasks.

Supplementary materials

T1, T2, T3: the target sequences. P1 and P2 are the matched oligos for an oligo P0 S1, S2 and S3 are the seed indices for S0 by a seed. T1 T2 T3 P1 P2 P0 S1 S2 S3 S0 T0 back

Relations of precision, recall and discriminability

Discriminability according to values of α back

Efficiency according to values of β and γ back

Efficient Discriminability for 70mer Oligos