10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Speaker: C. C. Lin Adviser: R. C. T. Lee
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
0 - 0.
CSCI 3130: Formal Languages and Automata Theory Tutorial 5
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Linear Lists – Array Representation
Advance Database Systems and Applications COMP 6521
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
1 Symbol Tables. 2 Contents Introduction Introduction A Simple Compiler A Simple Compiler Scanning – Theory and Practice Scanning – Theory and Practice.
Symbol Table.
By Snigdha Rao Parvatneni
Backup Slides. An Example of Hash Function Implementation struct MyStruct { string str; string item; };
Indexing DNA Sequences Using q-Grams
Theorem 7.16: Every CFL is a member of P Proof: Let G be a Chomsky normal form grammar for language L. The following O(n 3 ) algorithm decides whether.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
GOOGLE SUMMER OF CODE Enhancing the OpenMRS Patient Matching Module Demo Mentored By Shaun Grannis James Egg Gauthami Pingili.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
CSCI 311 – Algortihms and Data Structures Steve Guattery Dana 335a (department office) x7-3828
Frequent Closed Pattern Search By Row and Feature Enumeration
Space-for-Time Tradeoffs
Two implementation issues Alphabet size Generalizing to multiple strings.
CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Trees and Suffix Arrays
SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi.
15-853Page : Algorithms in the Real World Suffix Trees.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Goodrich, Tamassia String Processing1 Pattern Matching.
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
Indexing and Searching
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
1 Longest Common Subsequence Problem and Its Approximation Algorithms Kuo-Si Huang ( 黃國璽 )
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
MCS 101: Algorithms Instructor Neelima Gupta
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
MCS 101: Algorithms Instructor Neelima Gupta
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Visual Relations, Part 2 Advanced Visual Analysis.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
Advanced QlikView Performance Tuning Techniques
13 Text Processing Hongfei Yan June 1, 2016.
CSCE350 Algorithms and Data Structure
Chapter 7 Space and Time Tradeoffs
Searching Similar Segments over Textual Event Sequences
Suffix Trees String … any sequence of characters.
Dynamic Programming II DP over Intervals
Space-for-time tradeoffs
Optimal Partitioning of Data Chunks in Deduplication Systems
Matching Program Versions
Presentation transcript:

10/12/20141LCS

 Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest substring of S 1 that is also a substring of S 2.  A generalization is the k-common substring problem. Given the set of strings S={S 1,S 2,……………, S k }. where |S i |=n i. Σ n i =N. Find for each 2 ≤ k ≤ K, the longest string which occur as substring of all strings. 10/12/20142LCS

10/12/2014LCS3

10/12/2014LCS4

10/12/2014LCS5

10/12/2014LCS6 ABAB B A B A i j ABABAB Longest Common Substring

10/12/2014LCS7 a c c b b b c $ # $ # b c $ # a b c # c $

10/12/2014LCS8 Common Sub-Strings ‘a’ ‘b’ ‘c’ ‘ab’ ‘bc’ Longest Common Sub-String ‘ab’ ‘bc’

10/12/2014LCS9 LCS compares two strings and finds the longest run of characters that occurs in both of them. We can then declare the two documents as near duplicates if the ratio of the common substring length to the length of the documents exceeds some threshold. Consider the Example Below Selling a beautiful house in California. Buying a beautiful chip in California. The longest common substring is " in California." (it is 15 characters long, whereas " a beautiful " comes in second at 13 characters long). The first string is 40 characters long. So, you could assess how similar the strings are by taking the ratio: 15/40 = Best part about this application is that user can decide the threshold level interactively Target Audience of this Application *Ideal for Universities which do not have access to turn it in. *Students who do not have access to turn it.

10/12/2014LCS10 Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. Two corresponding fields within a record are said to agree only if all characters match; otherwise the fields are considered as mismatches. LCS score for the names ‘TAMMY SHACKELFORD’ ‘TAMMIE SHACKLEFORD’ The total length of the common substrings is [5 (SHACK) + 4 (TAMM) + 4 (FORD)] = 13. The length of the shorter name string (ignoring white space) is 16, therefore the LCS score is (13÷16) =

10/12/2014LCS11 ApproachWorst Case Time Complexity Brute ForceO(n^3) Dynamic ProgrammingO(m n) Suffix ArrayO(n log n) Suffix TreeO(n)

10/12/2014LCS12 ApproachTime Complexity Time(ms)Basic OperationsExecution Time (mille seconds) Brute Forcen^3129 Dynamic Programmingm*n68 Suffix Treen29 ApproachTime Complexity Time(ms)Basic OperationsExecution Time (mille seconds) Brute Forcen^3273 Dynamic Programmingm*n120 Suffix Treen60 Results were Measured on Intel Core i GHZ processor 4GB Ram System

10/12/2014LCS13  In Dynamic programming following changes can be done to exiting algorithm to reduce the memory usage of an implementation :-  Keep only the last and current row of the Dynamic Programming table to save memory O(min(m, n)) instead of O(n m)).  Store only non-zero values in the rows. This can be done using hash tables instead of arrays. This is useful for large alphabets.  Exiting Ukkonen’s suffix-tree implementation of longest common substring problem can be modified using McCreight and Weiner to see marginal improvement in time and space complexities.  Hybrid algorithm's performance can be compared with exiting performance results and see if there is any significant change in time and space complexity using rolling hash and suffix arrays

10/12/2014LCS14 On–line construction of Su ffi x trees Longest common substring problem Generalized suffix tree Real World Performance of Approximate String Comparators for use in Patient Matching

10/12/2014LCS15