Presentation is loading. Please wait.

Presentation is loading. Please wait.

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Similar presentations


Presentation on theme: "Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing."— Presentation transcript:

1 Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing Algorithm 1 University of New South Wales, Australia 2 Renmin University of China, Chnia

2 School of Computer Science and Engineering Motivation Identify Near Duplicate Webpages 0012345679ABCDEF simhash 1012345679ABCDEF Similar Chemical data Maps in to Binary code 012345679ABCDEF0012345679ABCDEF1 Similar

3 School of Computer Science and Engineering More Applications Iris recognition Image retrieval C2LSH

4 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

5 School of Computer Science and Engineering Hamming Distance Query Hamming distance Hamming distance query Number of positions at which the corresponding symbols are different for two equal length vectors. q: ABCD v: ACCD Hamming distance(R, S) = 1 Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k, find all v i in V, that hd (v i, Q) <= k

6 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

7 School of Computer Science and Engineering Basic Idea General framework: 1.We can do k=1 efficiently (show later) 2.So we transform larger k problem to several small k=1 problem by partitioning 3.We do filtering by looking at each partition 4.We do verification at last 1111 1211 v q the same hd (q, v)<=1 hd(q left, v left )=0 or hd(q right, v righ t)=0 So if k =1, can be filtered by looking at each part 1111 1221 v q

8 School of Computer Science and Engineering Framework Data Partitioning Indexing Index Query Partitioning Candidates0 Filtering Candidates1 Verification Results Generating Signatures General Partitionin g Scheme 1-variants and 1-deletion variants Enhanced Filtering Hierarchical Filtering and Verification Dimension Rearrangement

9 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

10 School of Computer Science and Engineering Partitioning Lowerbound for partition strategy Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions, such that hd(q part, v part )<= In our algorithm, we choose When k is even, m = 1 When k is odd, m = 2 When k= 0 or 1, m=1, hd = 0 When k>=2, hd <= 1

11 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

12 School of Computer Science and Engineering Signature Generation 1-variants 1-deletion-variants Substituting each dimension with ‘#’ each time Substituting each dimension with each domain value each time (plus itself) v=[1, 2, 3] 1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #] v=[1, 2, 3] and Σ (domain) =[1, 2, 3] 1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3], [1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2] We index all 1-val(v) and when q comes in, we search q in the index We index all 1-del-val(v) and when q comes in, we generate 1-del-val(q), and search all 1-del-val(q) in the index OR

13 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

14 School of Computer Science and Engineering Enhanced Filter (Even) v q If k =2, based on the formula before, m=1, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is even, v is qualified for two situations: 1) m=1, where hd(v part, q part )=0 2)m=2, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is even, m = 1 Example 123456 121423

15 School of Computer Science and Engineering Enhanced Filter (Odd) If k =3, based on the formula before, m=2, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is odd, v is qualified for two situations: 1) m=2, where hd(v part, q part )<=1 and at least one of them = 0 2) m=3, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is odd, m = 2 Example v q 123456 111423

16 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

17 School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] 1 0 1 0 0 0 1 0 0 0 1 1 q=[5, 2, 2, 5] 1 0 1 0 1 0 0 1 0 1 1 1 Σ=|8|, k=1 So hd(v, q)>=2, filtered More over, even if k=4 4 comparisons to calculate hd(v,q)=3 diff 0011 0110 0000 XOR OR 0111hd(v,q)=3 We can use binary operations to do a hierarchical filtering and verification

18 School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] 1 0 1 0 0 0 1 0 0 0 1 1 q=[5, 2, 3, 5] 1 0 1 0 1 0 1 1 0 1 1 1 diffcumdiff XOR 0001 0101 0001 0000 0101 OR Number of 1 In cumdiff 1 2 <=1, conti. >1, filtered

19 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

20 School of Computer Science and Engineering Impact of Data Skewness Given k=2, then m = 1 and k’=1 Only v1 is qualified We propose to reset the order and partition Length to improve performance All vectors are qualified Dim v2 v1 q Partition1 1 1 0 1 1 0 1 1 0 Partition2 1 0 2 0 0 0 0 0 0 v3 202000 v4 300000 123456 Dim v2 v1 q Partition1 1 1 0 1 1 0 0 0 0 Partition2 1 0 2 1 1 0 0 0 0 v3 200020 v4 300000 125436

21 School of Computer Science and Engineering Greedy Dimension Rearrangement Dim v2 v1 Partition1 1 0 1 0 1 0 Partition2 0 2 0 0 0 0 v3 202000 v4 300000 123456 MaxFreq for Dim 133344 MaxFreq is the Max Frequency of any values in each dimension Dim v2 v1 Partition1 0 0 1 0 1 0 Partition2 0 0 1 0 0 2 v3 020020 v4 030000 512634 Our goal: Minimize the global MaxFreq MaxFreq for partition 441211 Achieve the goal

22 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

23 School of Computer Science and Engineering Conclusion 1.General Partition Scheme 2.1-variants and 1-deleltion-variants 3.Techniques help boost the performance –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement

24 School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

25 School of Computer Science and Engineering Experiment Settings Environment –Intel Xeon X3330 2.664GHz CPU, 4GB RAM –Debian 5.0.6 –AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem) –Ubuntu/Linaro 4.6.4-1 unbuntu5 –All complied with GCC 4.1.2 with –O3 Dataset

26 School of Computer Science and Engineering Experiment Settings Terms –EF, Enhanced Filtering –HB, Hierarchical Binary Filter –RD, Rearranging Dimensions Our algorithms 1.HSD, HSV, our proposed algorithms, the former one using 1-deleltion- variants as signatures and the latter one using 1-varitnas as signatures 2.HSD-nEB, HSV-nEB, variations that remove EF and HB 3.HSD-nB, HSV-nB, variations that remove HB 4.HSD-nR, HSV-nR, variations that remove RD Baseline algorithm 1.Scancount (Li et. ICDE08) State-of-the-art algorithms 1.Google (Manku et. www07) 2.Hengine (Liu et. ICDE11)

27 School of Computer Science and Engineering Query time HSV has the best performance

28 School of Computer Science and Engineering Candidate Size HSV has the smallest candidate size

29 School of Computer Science and Engineering Effect of EF and HB EF and HB help improve the performance

30 School of Computer Science and Engineering Effect of RD RD boost the performance for PubChem Data

31 School of Computer Science and Engineering Index Size HSV and HSD have a larger candidate size

32 School of Computer Science and Engineering Thank you


Download ppt "Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing."

Similar presentations


Ads by Google