Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
Chapter 3 The Greedy Method 3.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms and cost matrices
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Efficient Multidimensional Packet Classification with Fast Updates Author: Yeim-Kuan Chang Publisher: IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 4, APRIL.
CSE 830: Design and Theory of Algorithms
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
CSE 830: Design and Theory of Algorithms Dr. Eric Torng.
By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Design and Analysis of Algorithms - Chapter 41 Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two.
Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.
An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008 PAKDD 2008 Takeaki Uno National Institute of Informatics,
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Combinatorial Algorithms Reference Text: Kreher and Stinson.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
CMPT 438 Algorithms. Why Study Algorithms? Necessary in any computer programming problem ▫Improve algorithm efficiency: run faster, process more data,
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
LECTURE 13. Course: “Design of Systems: Structural Approach” Dept. “Communication Networks &Systems”, Faculty of Radioengineering & Cybernetics Moscow.
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
CSIS7101 – Advanced Database Technologies Spatio-Temporal Data (Part 1) On Indexing Mobile Objects Kwong Chi Ho Leo Wong Chi Kwong Simon Lui, Tak Sing.
An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)
Christopher Moh 2005 Competition Programming Analyzing and Solving problems.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
New Algorithms for Enumerating All Maximal Cliques
Searching and Sorting Recursion, Merge-sort, Divide & Conquer, Bucket sort, Radix sort Lecture 5.
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
Genome Homology Visualization by Short Similar Substring Enumeration 30/Sep/2008 RIMS AVEC Workshop Takeaki Uno National Institute of Informatics & Graduated.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Doug Raiford Phage class: introduction to sequence databases.
Ch03-Algorithms 1. Algorithms What is an algorithm? An algorithm is a finite set of precise instructions for performing a computation or for solving a.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Between Optimization  Enumeration (on Modeling, and Computation ) 30/May/2012 NII Shonan-meeting (open problem seminar) Takeaki Uno National Institute.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
CMPT 438 Algorithms.
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Query Processing in Databases Dr. M. Gavrilova
Fast Sequence Alignments
Locality Sensitive Hashing
Searching Similar Segments over Textual Event Sequences
Kinetic Collision Detection for Convex Fat Objects
Lecture 15: Least Square Regression Metric Embeddings
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics, The Graduate University for Advanced Studies (Sokendai)

Motivation: Analyzing Huge Data Recent information technology gave us many huge database - - Web, genome, POS, log, … "Construction" and "keyword search" can be done efficiently The next step is analysis; capture features of the data - - statistics, such as size, #rows, density, attributes, distribution… Can we get more?   look at (simple) local structures but keep simple and basic genome Results of experiments Database ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT 実験 1 実験 2 実験 3 実験 4 ● ▲ ▲ ● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ● ● ▲ ● ● ▲ ▲ ▲ ▲

Our Focus Find all pairs of similar objects (or structures) (or binary relation instead of similarity) Maybe, this is very basic and fundamental   There would be many applications - - finding global similar structure, - - constructing neighbor graphs, - - detect locally dense structures (groups of related objects) In this talk, we look at the strings

Existing Studies There are so many studies on similarity search (homology search)   Given a database, construct a data structure which enable us to find the objects similar to the given a query object quickly - - strings with Hamming distance, edit distance - - points in plane (k-d trees), Euclidian space - - sets - - constructing neighbor graphs (for smaller dimensions) - - genome sequence comparison (heuristics) Both exact and approximate approaches All pairs comparison is not popular

Our Problem Problem: For given a database composed of n strings of the fixed same length l, and a threshold d, find all the pairs of strings such that the Hamming distance of the two strings is at most d ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ・ ・ ATGCCGCG, AAGCCGCC ・ ・ GCCTCTAT, GCTTCTAA ・ ・ TGTAATGA, GGTAATGG ... ・ ・ ATGCCGCG, AAGCCGCC ・ ・ GCCTCTAT, GCTTCTAA ・ ・ TGTAATGA, GGTAATGG ...

Trivial Bound of the Complexity If all the strings are exactly the same, we have to output all the pairs, thus take Θ(n 2 ) time   simple all pairs comparison of O(l n 2 ) time is optimal, if l is a fixed constant   Is there no improvement? In practice, we would analyze only when output is small, otherwise the analysis is non-sense   consider complexity in the term of the output size We propose O(2 l (n+lM)) time algorithm M: #outputs

Basic Idea: Fixed Position Subproblem Consider the following subproblem: For given l-d positions of letters, find all pairs of strings with Hamming distance at most d such that "the letters on the l-d positions are the same" Ex) 2 nd, 4 th, 5 th positions of strings with length 5 We can solve by "radix sort" by letters on the positions, in O(l n) time.

Examine All Cases Solve the subproblem for all combinations of the positions   If distance of two strings S 1 and S 2 is at most 2, letters on l-d positions (say P) are the same   In at least one combination, S 1 and S 2 is found (in the subproblem of combination P) # combinations is l C d. When l=5 and d=2, it is 10   Computation is "radix sorts +α", O( l C d ln ) time for sorting   Use branch-and-bound to radix sort, in O( l C d n ) time

ExerciseExercise ・ ・ Find all pairs of strings with Hamming distance at most 1 G A B A B C A B D A C C E F G F F G A F G G A B A B C A B D A C C E F G F F G A F G A B C A B D A C C E F G F F G A F G G A B A B C A B D A C C E F G F F G A F G G A B A B C A C C A B D A F G E F G F F G G A B A B C A C C A B D A F G E F G F F G G A B A B C A B D A C C A F G E F G F F G G A B A B C A B D A C C A F G E F G F F G G A B

Duplication: How long is "+α" If two strings S 1 and S 2 are exactly the same, their combination is found in all subproblems, l C d times   If we allow the duplications, "+α" needs O(M l C d ) time   To avoid the duplication, use "canonical positions"

Avoid Duplications by Canonical Positions For two strings S 1 and S 2, their canonical positions are the first l-d positions of the same letters Only we output the pair S 1 and S 2 only in the subproblem of their canonical positions Computation of canonical posisions takes O(d) time, "+α" needs O(K d l C d ) time Avoid duplications without keeping the solutions in memory O( l C d (n+dM)) = O(2 l (n + lM) ) time in total ( O(n+M)) if l is a fixed constant ) O( l C d (n+dM)) = O(2 l (n + lM) ) time in total ( O(n+M)) if l is a fixed constant )

In Practice Is l C d small in practice?   In some case, yes (ex, genome sequences) If we want to find strings with at most 10% of error 20 C 2 = 190, 30 C 3 = 4060, 60 C 6 = … maybe, large for (bit) large l For dealing with (bit) large l, we use a variant of this algorithm

Partition to Blocks Consider the partition of strings into k blocks For given k-d positions of blocks, find all pairs of strings with distance at most d s. t. "the blocks on the positions are the same" Radix sorts are done in O( k C d n) time Ex) 2 nd, 4 th, 5 th positions of blocks of strings of length 5

Small "+α" is Expected The Hamming distance of two strings may be larger than d, even if their k-d blocks are the same  In the worst case,  In the worst case, "+α" is not linear in #output However, if #letters in k-d blocks are large enough, the strings having the same blocks are few   "+α" is not large, in practice, in almost O( k C d n) time

Experiments: l = 20 and d = 0,1,2,3 Prefixes of Y chromosome of Human Note PC with Pentium M 1.1GHz, 256MB RAM

Slice one of the long strings with overlaps Partition the other long string without overlap Compare all pairs 1 1 draw a matrix: intensity of a cell is given by #pairs inside 2 2 draw a point if 3 pairs in an area of length αand width β:   two substrings of length α have error of bit less than k %, they have at least some short similar substrings Comparison of Long Strings

Comparison of Chromosome Human 21 st and chimpanzee 22 nd chromosomes Take strings of 30 letters from both, with overlaps Intensity is given by # pairs White  possibly similar Black  never similar Grid lines detect "repetitions of similar structures" human 21 st chr. chimpanzee 22 nd chr. 20 min. by PC

Homology Search on Chromosomes Human X and mouse X chromosomes (150M strings for each) take strings of 30 letters beginning at every position ・ ・ For human X, Without overlaps ・ ・ d=2, k=7 ・ ・ dots if 3 points are in area of width 300 and length hour by PC human X chr. mouse X chr.

Extensions ??? Can we solve the problem for other objects? (sets, sequences, graphs,…) For graphs, maybe yes, but not sure for the practical performance For sets, Hamming distance is not preferable. For large sets, many difference should be allowed. For continuous objects, such as points in Euclidian space, we can hardly bound the complexity in the same way. (In the discrete version, the neighbors are finite, actually classified into constant number of groups)

ConclusionConclusion Output sensitive algorithm for finding pairs of similar strings ( in the term of Hamming distance) Multiple-classification by positions to be the same Using blocks to reduce the practical computation Application to genome sequence comparison Extension to other objects (sets, sequences, graphs) Extension to continuous objects (points in Euclidian space) Efficient spin out heuristics for practice Genome analyze system Extension to other objects (sets, sequences, graphs) Extension to continuous objects (points in Euclidian space) Efficient spin out heuristics for practice Genome analyze system Future works