Presentation on theme: "Indexing DNA Sequences Using q-Grams"— Presentation transcript:
1Indexing DNA Sequences Using q-Grams Adriano Galati & Bram Raats
2Indexing DNA Sequences Using q-Grams Method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA databaseTo sidestep the linear scan of the entire databaseProposed:Hash tableC-treesbased on the q-gramsThese data structures allow quick detection of sequences
3IntroductionTwo sequences share a certain number of q-grams if ed is a certain thresholdSince there are 4 letters combinationsTwo level index to prune data sequences
4Introduction(2) Two level index Two level index to prune data sequences:First levelClusters of similar q-grams in DNA are generatedA typical Hash table is built in the segments with respect to the qClustersSecond levelThe segments are transformed into the c-signature based on their q-gramsA new index called the c-signature trees is proposed to organize the c-signatures of all segments of a DNA sequence for search efficiency
5Edit distanceTo process approximate matching, one common and simple approximation metric is called edit distanceDefinition:The edit distance between two sequences is defined as the minimum number of edit operations (i.e. insertions, deletions and substitutions) of single characters needed to transform the first string into the second
6Preliminaries Intuition: Two sequences would have a large number of q-grams in common when the ed between them is within a certain numberGiven a sequence S, its q-grams are obtained by sliding a window of length q over the characters of S|S| - q + 1 q-grams for a sequence S
7Question (Bogdan)1. I have noticed that the segments of the database text that are considered in this method are disjoint (see page 4, Introduction). I understand that for each segment all the consecutive, non-disjoint, q-grams are taken into consideration when computing the q-cluster and the c-signature of the segment. However, I am a bit puzzled that at the border between two adjacent segments nothing is done, which means that (q-1) q-grams are disregarded at each border. Since each segment contains w-q+1 q-grams, it means that overall a ratio of approximately (q-1)/(w-q+1) of all q-grams are disregarded (if we ignore the difference of 1 between the nr. of segments and the nr. of borders between adjacent segments). For common values of q=3 and w=30, this means about 7% of the q-grams. Do you see a solution for overcoming this problem?
8Answer (Bogdan)Effort to improve the efficiency discarding the regions (filtering) with low sequence similarityApproximate sequence matching is preferred to exact matching in genomic database due to evolutionary mutation in the genomic sequences and the presence of noise data in a real sequence database
9q-gram Signature kinds of q-grams All the possible q-grams are denoted asThe q-gram signature is a bitmap with 4q bits where i-th bit corresponds to the presence or absence of ri .For a sequence S, the i-th bit is set as ‘ 1’ if occurs at least once in sequence S, else ‘ 0’
11Example c-signature P=“ACGGTACT” q-gram signature is ( ) with 42 dimensions when q=2
12Hash tableAny DNA segment s can be encoded into a λ-bit (bitmap ) by the coding function:Hash table with size 2λ respect to qClusters
13Question (Jacob)I can't get my hands on the c-Trees (mentioned first on page 9). Could you please explain how such a tree is built up, because I can't figure it out.
14c-Trees Group of rooted dynamic trees built for indexing c-signature Height l set by userGiven treesEach path from the root to a leaf in Ti corresponds to the c-signature stringinternal node there are children
15Example c-TreesConsider the five DNA segments:If we get trees
17Query Processing HT and c-T are built on the DNA segments Query sequence Q is also partitioned in sliding query patternsTwo level filteringFLF: Hash Table Based Similarity SearchSLF: c-Trees Based Similarity Search
18Hash Table Based Similarity Search Query pattern qi encoded to a hash key hi (λ bit)ngbr of hi are enumeratedngbr are encoded in λ bit from the segments which are within a ed from qiOnce is enumerated, the segments in the bucket will be retrieved as candidates and stored into
19c-Trees Based Similarity Search Candidates will be further verified by c-treesc-signature of query q is divided into c-signature stringsThe algorithm retrieves the segment s which satisfies the range constraintDuring query processing, for each leaf in the tree T1 are computed
20Space and Time complexity Space complexity HT is for the table headfor the bucket of the tableThus the total space complexity for the Hash structure isTime complexity for querySpace complexity of each tree is
21Question (Bogdan)I have trouble understanding the graphic in Fig. 2(a). My intuition would tell me that the more common q-grams exist in the 2 sequences, the higher the probability of finding a high score alignment between them. However, the figure seems to show the opposite: as the nr. of q-grams increases, the probability decreases. I've obviously got something mixed up here, but I can't figure out what it is. Could you please explain?
23Answer (Bogdan)Sensitivity can be measured by the probability that a high score alignment is found by the algorithmThe graph starts with probability almost 1 when we have only 1 common q-gram and if we increase the number of q-grams, the probability (sensitivity) of matching the alignment will surely decrease