Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Similar presentations


Presentation on theme: "Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1."— Presentation transcript:

1 Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1

2 Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table representation Ayat A.Dawood2

3 Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. e.g., S = acaaacatat$ Ayat A.Dawood3 S(Suftab[i])SuftabI aaacatat$20 aacatat$31 acaaacatat$02 acatat$43 atat$64 at$85 caaacatat$16 catat$57 tat$78 t$99 $10

4 Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. e.g., S = acaaacatat$ Ayat A.Dawood4 S(Suftab[i])SuftabI aaacatat$20 aacatat$31 acaaacatat$02 acatat$43 atat$64 at$85 caaacatat$16 catat$57 tat$78 t$99 $10

5 Basically it is the suffix array enhanced with a set of tables. Using those tables, best performance and complexity are achieved lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1]. Ayat A.Dawood5 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010

6 L-interval: interval of suffixes sharing the same prefix Ayat A.Dawood6 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 1-[0..5]

7 Ayat A.Dawood7 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 1-[0..5] 2-[0..1] a L-interval: interval of suffixes sharing the same prefix

8 Ayat A.Dawood8 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t L-interval: interval of suffixes sharing the same prefix

9 Improvement (Fine Tuning): Alphabet-independent exact pattern matching. Improving bucket table representation Improving access to the lcp-table. Improvements are achieved using minimal perfect hashing techniques. Ayat A.Dawood9

10 Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al] Look up table requires O(|U|) space to achieve constant access time Ayat A.Dawood10

11 Ayat A.Dawood11 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

12 Ayat A.Dawood12 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

13 Ayat A.Dawood13 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

14 Ayat A.Dawood14 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

15 Using normal method: takes O(nm) Using the enhanced suffix arrays, it can be achieved in O(||m) [AbouElHoda et. al] Other modification to the enhanced suffix arrays allows it to be done in O(m log (||)). [Kim et. al],[Fischer et. al] Ayat A.Dawood15

16 Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood16 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t MPHF table

17 Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood17 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

18 Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood18 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

19 Ayat A.Dawood19 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

20 Ayat A.Dawood20 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

21 Problem: Space consumption of the look up table is prohibitive for large d and (d ^ ||). Solution: Use minimal perfect hashing techniques to store the look up table. Ayat A.Dawood21

22 Results: For the bacterial ecoli genome (size = 5400 bp) and for d= 12 Ayat A.Dawood22 Reduction comparing to lookup table MPHF size in bits Lookup table size in bits No. of keys Alphabet size 46% reduction7231956.638167721634748144 (A,T,C,G) 93% reduction17590331.6424414062584518115(A,T,C,G,*N) *N for undefined nucleotide or dummy character

23 Exact pattern matching problem Improving the bucket table representation. Improving access to the lcp-table. Ayat A.Dawood23

24 Ayat A.Dawood24

25 To reduce space, lcp- table is stored in 1 byte. If a common prefix is longer than 255, then it is stored in another table. To access this table, it is accessed sequential or using binary search Our Enhancement: Use MPHF to store the extra table to access it in constant time. Ayat A.Dawood25 0 2 3 2 0 257 279 300 260 lcp-table Extra lcp-table


Download ppt "Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1."

Similar presentations


Ads by Google