1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.

Slides:

Advertisements

Similar presentations

IITB - Bioinformatics Workshop Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science.

Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.

Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.

Dr. N. MamoulisAdvanced Database Technologies1 Topic 7: Strings and Biological Data In some applications we store, search and analyze long sequences of.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Sabegh Singh Virdi ASC Processor Group Computer Science Department

Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.

Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper.

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Distance Functions for Sequence Data and Time Series

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.

1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.

CMPT-825 (Natural Language Processing) Presentation on Zipf’s Law & Edit distance with extensions Presented by: Kaustav Mukherjee School of Computing Science,

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.

Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.

Modern Information Retrieval Chapter 4 Query Languages.

Sequence alignment, E-value & Extreme value distribution

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.

Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

Filter Algorithms for Approximate String Matching Stefan Burkhardt.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.

Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Exact indexing of Dynamic Time Warping

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.

Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Performance Comparison of xBR-trees and R*-trees for Single Dataset Spatial Queries Performance Comparison of xBR-trees and R*-trees for Single Dataset.

An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1.

Doug Raiford Phage class: introduction to sequence databases.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

Bioinformatics.

Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.

Ambika Shrestha Chitrakar Prof. Slobodan Petrovic

Joining Massive High-Dimensional Datasets

Text Joins in an RDBMS for Web Data Integration

Scale-Space Representation for Matching of 3D Models

CSE 589 Applied Algorithms Spring 1999

Scale-Space Representation for Matching of 3D Models

BIOINFORMATICS Fast Alignment

15-826: Multimedia Databases and Data Mining

Sequence alignment, E-value & Extreme value distribution

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Presentation transcript:

1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara

2 Whole/Substring Matching Problem Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size). query string database string

3 String Similarity Motivation: Applications  Genetic sequence databases, NCBI  Text databases, spell checkers, web search.  Video databases (e.g. VIRAGE, MEDIA360) Database size is too large. Most of the techniques available are in-memory. Space requirement of current indexes is too large. Year Base Pairs (millions)

4 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion

5 Notation q : query string. m,n : length of strings. r : range query radius.  = r/|q|: error rate.

6 String Similarity: an example A C T - - T A G C R I I D A A T G A T A G -

7 Background Edit operations: Insert Delete Replace Edit distance (ED) between s 1 and s 2 = minimum number of edit operations to transform s 1 to s 2. Finding the edit distance is costly. O(mn) time and space if m and n are lengths of s 1 and s 2 if dynamic programming is used [NW70, SW81].

8 Related Work Lossless search Online  [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.  [WM92] (Wu, Manber) binary masks, O(rn).  [BYN99] (Beaze-Yates, Navarro) NFA Offline (index based)  [Mye94] (Myers) condensed r-neighborhood.  [BYN97] (Beaze-Yates, Navarro) dictionary. Lossy search [AG90] (Altschul, Gish) BLAST.  FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

9 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion

10 Frequency Vector Let s be a string from the alphabet  ={  1,...,   }. Let n i be the number of occurrences of the character  i in s for 1  i , then frequency vector: f(s) =[n 1,..., n  ]. Example: s = AATGATAG f(s) = [n A, n C, n G, n T ] = [4, 0, 2, 2]

11 Effect of Edit Operations on Frequency Vector Delete : decreases an entry by 1. Insert : increases an entry by 1. Replace : Insert + Delete Example: s = AATGATAG => f(s) = [4, 0, 2, 2] (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] (A  C), s = ACCTATAG => f(s) = [3, 2, 1, 2]

12 An Approximation to ED: Frequency Distance (FD 1 ) s = AATGATAG => f(s)=[4, 0, 2, 2] q = ACTTAGC => f(q)=[2, 2, 1, 2] pos = (4-2) + (2-1) = 3 neg = (2-0) = 2 FD 1 (f(s),f(q)) = 3 ED(q,s) = 4 FD 1 (f(s 1 ),f(s 2 ))=max{pos,neg}. FD 1 (f(s 1 ),f(s 2 ))  ED(s 1,s 2 ). f(q) FD 1 (f(q),f(s)) f(s)

13 An Illustration of Frequency Distance & Edit Distance Frequency Distance Set of strings 1 Set of strings 2 v1v1 v2v2 Edit Distance

14 Using Local Information: Wavelet Decomposition of Strings s = AATGATAC => f(s)=[4, 1, 1, 2] s = AATG + ATAC = s 1 + s 2 f(s 1 ) = [2, 0, 1, 1] f(s 2 ) = [2, 1, 0, 1]  1 (s)= f(s 1 )+f(s 2 ) = [4, 1, 1, 2]  2 (s)= f(s 1 )-f(s 2 ) = [0, -1, 1, 0]

15 Wavelet Decomposition of a String: General Idea A i,j = f(s(j2 i : (j+1)2 i -1)) B i,j = A i-1,2j - A i-1,2j+1  (s)= First wavelet coefficient Second wavelet coefficient

16 Wavelet Decomposition & ED Define FD(s 1,s 2 )=max{FD 1, FD 2 }.

17 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN and range queries Experimental results Conclusion

18 MRS-Index Structure Creation w=2 a transform s1s1

19 MRS-Index Structure Creation s1s1

20 MRS-Index Structure Creation s1s1

21 MRS-Index Structure Creation... s1s1 slide c times c=box capacity

22 MRS-Index Structure Creation s1s1...

23 MRS-Index Structure Creation... T a,1 s1s1 W=2 a

24 Using Different Resolutions... T a,1 s1s1 W=2 a... T a+1,1 W=2 a+1

25 MRS-Index Structure

26 MRS-index properties Relative MBR volume (Precision) decreases when c increases. w decreases. MBRs are highly clustered. Box volume Box Capacity

27 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion

28 Range Queries [KS01] w= w= w= w= s1s1 s2s2 sdsd 1=1= 2 12 1 3 23 2

29 k-Nearest Neighbor Query [KSF+96, SK98] k = 3

30 k-Nearest Neighbor Query k = 3 r = Edit distance to 3 rd closest substring

31 k-Nearest Neighbor Query k = 3 r

32 k-Nearest Neighbor Query k = 3

33 Outline Motivation & background Our contribution Experimental results Conclusion

34 Experimental Settings w={128, 256, 512, 1024}. Human chromosomes from ( ) chr02, chr18, chr21, chr22 Plotted results are from chr18 dataset. Queries are selected from data set randomly for 512  |q|  An NFA based technique [BYN99] is implemented for comparison.

35 Experimental Results 1: Effect of Box Capacity (10-NN)

36 Experimental Results 2: Effect of Window Size (10-NN)

37 Experimental Results 3: k-NN queries

38 Experimental Results 4: Range Queries

39 Outline Motivation & background Our Contribution Experimental results Discussion & conclusion

40 Discussion In-memory (index size is 1-2% of the database size). Lossless search. 3 to 45 times faster than NFA technique for k- NN queries. 2 to 12 times faster than NFA technique for range queries. Can be used to speedup any previously defined technique.

41 Future Work Extend to weighted edit distance and affine gaps. Extend to local similarity (substring/substring) search. Compare the quality of answers and speed to BLAST (lossy search). Use as a preprocessing step to BLAST. Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).

42 Related Work Lossless search Online  [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.  [WM92] (Wu, Manber) binary masks, O(rn).  [BYN99] (Beaze-Yates, Navarro) NFA Offline (index based)  [Mye94] (Myers) condensed r-neighborhood.  [BYN97] (Beaze-Yates, Navarro) dictionary. Lossy search [AG90] (Altschul, Gish) BLAST.  FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

43 Related Work (Similar problems) [BYP92] (Beaze-Yates, Perleberg) only replace is allowed. [Gus97] (Gusfield) exact matching, suffix trees. [JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.

44 THANK YOU

45 Frequency Distance to an MBR f(q) FD(f(q),f(s)) f(s) f(q) FD(f(q),B) B