Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Similar presentations


Presentation on theme: "Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California."— Presentation transcript:

1 Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara 36 th ACM International Conference on Information Retrieval

2 Definition: Finding pairs of objects whose similarity is above a certain threshold. Application examples: Collaborative filtering. Spam and near duplicate detection. Image search. Query suggestions. Motivation: APSS still time consuming for large datasets. All Pairs Similarity Search (APSS) ≥ τ Sim (d i,d j ) = cos(d i,d j ) 2

3 Previous Work Approaches to speedup APSS:  Exact APSS: –Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ] –Inverted indexing. [Arasu et al. VLDB’06] –Parallelization with MapReduce. [Lin SIGIR’09] –Partition-based similarity comparison [Maha WSDM’13]  Approximate APSS via LSH : Tradeoff between precision and recall plus addition of redundant computations. Approaches that utilize memory hierarchy:  General query processing [ Manegold VLDB02 ]  Other computing problems. 3

4 Baseline: Partition-based Similarity Search (PSS) Partitioning with dissimilarity detection Similarity comparison with parallel tasks [WSDM’13] 4

5 PSS Task  Read assigned partition into area S.  Repeat  Read some vectors v i from other partitions  Compare v i with S  Output similar vector pairs Until other potentially similar vectors are compared. Memory areas: S = vectors owned, B = other vectors, C = temporary. Task steps: 5

6 Focus and Contribution Contribution:  Analyze memory hierarchy behavior in PSS tasks.  New data layout/traversal techniques for speedup: ① Splitting data blocks to fit cache. ② Coalescing: read a block of vectors from other partitions and process them together. Algorithms:  Baseline: PSS [WSDM’13]  Cache-conscious designs: PSS1 & PSS2 6

7 PROBLEM1: PSS area S is too big to fit in cache Other vectors B C Inverted index of vectors … … Accumulator for S … S … … … … … Too Long to fit in cache! 7

8 PSS1: Cache-conscious data splitting B Accumulator for Si C … S1S1 … S2S2 SqSq aaa aa … … aaa … After splitting: … … Split Size? 8

9 PSS1 Task Compare (S x, B) PSS1 Task Compare(S x, B) Read S and divide into many splits Read other vectors into B … for d i in S x for d j in B Sim(d i,d j ) += w i,t * w j,t if( sim(d i,d j ) + maxw d i * sum d j <t) then … Output similarity scores For each split S x 9

10 Modeling Memory/Cache Access of PSS1 Area S i Area B Area C Sim(d i,d j ) + = w i,t * w j,t if( sim(di,dj) + maxw d i * sum d j < T ) then Total number of data accesses : D 0 = D 0 (S i ) + D 0 (B)+D 0 (C) 10

11 Cache misses and data access time D 0 : total memory data accesses. Memory and cache access counts: D 1 : missed access at L1 D 2 : missed access at L2 D 3 : missed access at L3 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem δ i : access time at cache level i δ mem : access time in memory. Memory and cache access time: 11

12 Total data access time Data found in L1 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem ~2 cycles

13 Total data access time Data found in L2 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem 6-10 cycles

14 Total data access time Data found in L3 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem 30-40 cycles

15 Total data access time Data found in memory Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem 100- 300 cycles

16 Actual vs. Predicted Avg. task time ≈ #features * ( lookup + multiply + add) + access mem 13

17 RECALL: Split size s B Accumulator for Si C … S1S1 … S2S2 SqSq aaa aa … … aaa … … … Split Size s

18 Ratio of Data Access to Computation Avg. task time ≈ #features * ( lookup + add+multiply) + access mem Data access computation Data access Split size s 15

19 PSS2: Vector coalescing Issues: PSS1 focused on splitting S to fit into cache. PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. Solution: coalescing multiple vectors in B

20 PSS2: Example for improved locality SiSi … …… C … … … B … Striped areas in cache 16

21 Evaluation Implementation: Hadoop MapReduce. Objectives: Effectiveness of PSS1, PSS2 over PSS. Benefits of modeling. Datasets: Twitter, Clueweb, Enron emails, YahooMusic, Google news. Preprocessing: Stopword removal + df-cut. Static partitioning for dissimilarity detection.

22 Improvement Ratio of PSS1,PSS2 over PSS 2.7x 18

23 RECALL: coalescing size b SiSi … …… C … … … B … b … Avg. # of sharing= 2 18

24 Average number of shared features 19

25 Overall performance

26 Clueweb

27 Impact of split size s in PSS1 Clueweb Twitter Emails

28 RECALL: split size s & coalescing size b SiSi … …… C … … … B … b s 20

29 Affect of s & b on PSS2 performance (Twitter) fastest 21

30 Conclusions Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1) Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2) Cost modeling for memory hierarchy access is a guidance to optimize parameter setting. Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.


Download ppt "Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California."

Similar presentations


Ads by Google