Download presentation

Presentation is loading. Please wait.

Published byMeagan Neve Modified over 3 years ago

1
Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1

2
Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table representation Ayat A.Dawood2

3
Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. e.g., S = acaaacatat$ Ayat A.Dawood3 S(Suftab[i])SuftabI aaacatat$20 aacatat$31 acaaacatat$02 acatat$43 atat$64 at$85 caaacatat$16 catat$57 tat$78 t$99 $10

4
Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. e.g., S = acaaacatat$ Ayat A.Dawood4 S(Suftab[i])SuftabI aaacatat$20 aacatat$31 acaaacatat$02 acatat$43 atat$64 at$85 caaacatat$16 catat$57 tat$78 t$99 $10

5
Basically it is the suffix array enhanced with a set of tables. Using those tables, best performance and complexity are achieved lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1]. Ayat A.Dawood5 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010

6
L-interval: interval of suffixes sharing the same prefix Ayat A.Dawood6 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 1-[0..5]

7
Ayat A.Dawood7 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 1-[0..5] 2-[0..1] a L-interval: interval of suffixes sharing the same prefix

8
Ayat A.Dawood8 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t L-interval: interval of suffixes sharing the same prefix

9
Improvement (Fine Tuning): Alphabet-independent exact pattern matching. Improving bucket table representation Improving access to the lcp-table. Improvements are achieved using minimal perfect hashing techniques. Ayat A.Dawood9

10
Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al] Look up table requires O(|U|) space to achieve constant access time Ayat A.Dawood10

11
Ayat A.Dawood11 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

12
Ayat A.Dawood12 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

13
Ayat A.Dawood13 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

14
Ayat A.Dawood14 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 e.g., pattern = aca

15
Using normal method: takes O(nm) Using the enhanced suffix arrays, it can be achieved in O(||m) [AbouElHoda et. al] Other modification to the enhanced suffix arrays allows it to be done in O(m log (||)). [Kim et. al],[Fischer et. al] Ayat A.Dawood15

16
Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood16 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t MPHF table

17
Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood17 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

18
Our work: Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. Ayat A.Dawood18 0-[0..10] 1-[0..5] 2-[6..7] 1-[8..9] 2-[4..5] 3-[2..3] 2-[0..1] a a c ct t

19
Ayat A.Dawood19 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

20
Ayat A.Dawood20 S(Suftab[i])lcptableSuftabI aaacatat$020 aacatat$231 acaaacatat$102 acatat$343 atat$164 at$285 caaacatat$016 catat$257 tat$078 t$199 $010 Bucket table 0aa 2ac 4at ag 6ca ct cc cg 8ta tc tg tt ga gt gc gg Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

21
Problem: Space consumption of the look up table is prohibitive for large d and (d ^ ||). Solution: Use minimal perfect hashing techniques to store the look up table. Ayat A.Dawood21

22
Results: For the bacterial ecoli genome (size = 5400 bp) and for d= 12 Ayat A.Dawood22 Reduction comparing to lookup table MPHF size in bits Lookup table size in bits No. of keys Alphabet size 46% reduction7231956.638167721634748144 (A,T,C,G) 93% reduction17590331.6424414062584518115(A,T,C,G,*N) *N for undefined nucleotide or dummy character

23
Exact pattern matching problem Improving the bucket table representation. Improving access to the lcp-table. Ayat A.Dawood23

24
Ayat A.Dawood24

25
To reduce space, lcp- table is stored in 1 byte. If a common prefix is longer than 255, then it is stored in another table. To access this table, it is accessed sequential or using binary search Our Enhancement: Use MPHF to store the extra table to access it in constant time. Ayat A.Dawood25 0 2 3 2 0 257 279 300 260 lcp-table Extra lcp-table

Similar presentations

OK

DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,

DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on file system in unix the lowest Ppt on model view controller jsp Ppt on graph theory in electrical engineering Download ppt on search engines Ppt on role of youth in indian politics Ppt on resources and development class 10 Ppt on political parties and electoral process in nigeria Ppt on multi level marketing Ppt on pricing policy benefits Ppt on cloud based mobile social tv