Download presentation

Presentation is loading. Please wait.

Published byMeagan Neve Modified over 2 years ago

1
Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1

2
Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table representation Ayat A.Dawood2

3
Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. e.g., S = acaaacatat$ Ayat A.Dawood3 S(Suftab[i])SuftabI aaacatat$20 aacatat$31 acaaacatat$02 acatat$43 atat$64 at$85 caaacatat$16 catat$57 tat$78 t$99 $10

4
Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. e.g., S = acaaacatat$ Ayat A.Dawood4 S(Suftab[i])SuftabI aaacatat$20 aacatat$31 acaaacatat$02 acatat$43 atat$64 at$85 caaacatat$16 catat$57 tat$78 t$99 $10

5
Basically it is the suffix array enhanced with a set of tables. Using those tables, best performance and complexity are achieved lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1]. Ayat A.Dawood5 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010

6
L-interval: interval of suffixes sharing the same prefix Ayat A.Dawood6 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 1-[0..5]

7
Ayat A.Dawood7 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 1-[0..5] 2-[0..1] a L-interval: interval of suffixes sharing the same prefix

8
Ayat A.Dawood8 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t L-interval: interval of suffixes sharing the same prefix

9
Improvement (Fine Tuning): Alphabet-independent exact pattern matching. Improving bucket table representation Improving access to the lcp-table. Improvements are achieved using minimal perfect hashing techniques. Ayat A.Dawood9

10
Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al] Look up table requires O(|U|) space to achieve constant access time Ayat A.Dawood10

11
Ayat A.Dawood11 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

12
Ayat A.Dawood12 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

13
Ayat A.Dawood13 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

14
Ayat A.Dawood14 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

15
Using normal method: takes O(nm) Using the enhanced suffix arrays, it can be achieved in O(||m) [AbouElHoda et. al] Other modification to the enhanced suffix arrays allows it to be done in O(m log (||)). [Kim et. al],[Fischer et. al] Ayat A.Dawood15

16
Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood16 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t MPHF table

17
Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood17 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

18
Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood18 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

19
Ayat A.Dawood19 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

20
Ayat A.Dawood20 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

21
Problem: Space consumption of the look up table is prohibitive for large d and (d ^ ||). Solution: Use minimal perfect hashing techniques to store the look up table. Ayat A.Dawood21

22
Results: For the bacterial ecoli genome (size = 5400 bp) and for d= 12 Ayat A.Dawood22 Reduction comparing to lookup table MPHF size in bits Lookup table size in bits No. of keys Alphabet size 46% reduction (A,T,C,G) 93% reduction (A,T,C,G,*N) *N for undefined nucleotide or dummy character

23
Exact pattern matching problem Improving the bucket table representation. Improving access to the lcp-table. Ayat A.Dawood23

24
Ayat A.Dawood24

25
To reduce space, lcp- table is stored in 1 byte. If a common prefix is longer than 255, then it is stored in another table. To access this table, it is accessed sequential or using binary search Our Enhancement: Use MPHF to store the extra table to access it in constant time. Ayat A.Dawood lcp-table Extra lcp-table

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google