Presentation is loading. Please wait.

# Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

## Presentation on theme: "Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1."— Presentation transcript:

Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1

Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table representation Ayat A.Dawood2

Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S\$. e.g., S = acaaacatat\$ Ayat A.Dawood3 S(Suftab[i])SuftabI aaacatat\$20 aacatat\$31 acaaacatat\$02 acatat\$43 atat\$64 at\$85 caaacatat\$16 catat\$57 tat\$78 t\$99 \$10

Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S\$. e.g., S = acaaacatat\$ Ayat A.Dawood4 S(Suftab[i])SuftabI aaacatat\$20 aacatat\$31 acaaacatat\$02 acatat\$43 atat\$64 at\$85 caaacatat\$16 catat\$57 tat\$78 t\$99 \$10

Basically it is the suffix array enhanced with a set of tables. Using those tables, best performance and complexity are achieved lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1]. Ayat A.Dawood5 S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010

L-interval: interval of suffixes sharing the same prefix Ayat A.Dawood6 S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 1-[0..5]

Ayat A.Dawood7 S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 1-[0..5] 2-[0..1] a L-interval: interval of suffixes sharing the same prefix

Ayat A.Dawood8 S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t L-interval: interval of suffixes sharing the same prefix

Improvement (Fine Tuning): Alphabet-independent exact pattern matching. Improving bucket table representation Improving access to the lcp-table. Improvements are achieved using minimal perfect hashing techniques. Ayat A.Dawood9

Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al] Look up table requires O(|U|) space to achieve constant access time Ayat A.Dawood10

Ayat A.Dawood11 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 e.g., pattern = aca

Ayat A.Dawood12 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 e.g., pattern = aca

Ayat A.Dawood13 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 e.g., pattern = aca

Ayat A.Dawood14 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 e.g., pattern = aca

Using normal method: takes O(nm) Using the enhanced suffix arrays, it can be achieved in O(||m) [AbouElHoda et. al] Other modification to the enhanced suffix arrays allows it to be done in O(m log (||)). [Kim et. al],[Fischer et. al] Ayat A.Dawood15

Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood16 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t MPHF table

Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood17 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood18 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

Ayat A.Dawood19 S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

Ayat A.Dawood20 S(Suftab[i])lcptableSuftabI aaacatat\$020 aacatat\$231 acaaacatat\$102 acatat\$343 atat\$164 at\$285 caaacatat\$016 catat\$257 tat\$078 t\$199 \$010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

Problem: Space consumption of the look up table is prohibitive for large d and (d ^ ||). Solution: Use minimal perfect hashing techniques to store the look up table. Ayat A.Dawood21

Results: For the bacterial ecoli genome (size = 5400 bp) and for d= 12 Ayat A.Dawood22 Reduction comparing to lookup table MPHF size in bits Lookup table size in bits No. of keys Alphabet size 46% reduction7231956.638167721634748144 (A,T,C,G) 93% reduction17590331.6424414062584518115(A,T,C,G,*N) *N for undefined nucleotide or dummy character

Exact pattern matching problem Improving the bucket table representation. Improving access to the lcp-table. Ayat A.Dawood23

Ayat A.Dawood24

To reduce space, lcp- table is stored in 1 byte. If a common prefix is longer than 255, then it is stored in another table. To access this table, it is accessed sequential or using binary search Our Enhancement: Use MPHF to store the extra table to access it in constant time. Ayat A.Dawood25 0 2 3 2 0 257 279 300 260 lcp-table Extra lcp-table

Download ppt "Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1."

Similar presentations

Ads by Google