Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Similarity Search with Cache-Conscious Data Traversal

Similar presentations


Presentation on theme: "Efficient Similarity Search with Cache-Conscious Data Traversal"— Presentation transcript:

1 Efficient Similarity Search with Cache-Conscious Data Traversal
Xun Tang Committee: Tao Yang (Chair), Divy Agrawal, Xifeng Yan March 16, 2015

2 Roadmap Similarity search background Partition-based method background
Three main components in my thesis Partition-based symmetric comparison and load balancing [SIGIR’14a]. Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. Optimized search result ranking with cache-conscious traversal [SIGIR’14b]. Conclusion Feel free to ask questions Due to time limit, Put more time in things not covered in proposal talk

3 Similarity Search for Big Data
Finding pairs of data objects with similarity score above a threshold. Example Applications: Document clustering Near duplicates Spam detection Query suggestion Advertisement fraud detection Collaborative filtering & recommendation Very slow data processing for large datasets How to make it fast and scalable? Entity resolution is the problem of determining which records in a database refer to the same entities (aka. Deduplication, recordlinkage)

4 Applications : Duplicate Detection & Clustering
SpotSigs(systematic selections of shingles): SIGIR’09: “SpotSigs: Robust and Efficient Near Duplicate Detection in LargeWeb Collections”. Theobald. Stanford. d2 = (1, 2,2, 0,0,0,4, 3,2,7)

5 All-Pairs Similarity Search (APSS)
Dataset Cosine-based similarity: Given n normalized vectors, compute all pairs of vectors such that Quadratic complexity O(n2) Approximate: does not guarantee full recall Top-k is used when threshold unknown. >

6 Big Data Challenges for Similarity Search
Sequential time (hour) Apss is a challenging Problem First let’s see how long it takes to run it sequentially on a single core. These are numbers in hours. We ran Various sizes of three diff datasets on both amd and intel processors 3. For example,4m fits in memory, takes days to process on a single core, 4.We can apply some approximation to reduce time. Df-limit def. For 4m twitter, from x to y hr But still takes a long time when the data size is large, even with approx. As the data size increases, complexity of APSS grows quadratically. still slow 800hr. Values marked * are estimated by sampling 4M tweets fit in memory, but take days to process Approximated processing Df-limit [Lin SIGIR’09]: remove features if their document frequency exceed an upper limit

7 Communication overhead
Inverted Indexing and Parallel Score Accumulation for APSS [Lin SIGIR’09; Baraglia et al. ICDM’10] W4,5 f5 W2,5 W3,1 W4,1 W2,1 f1 W4,3 W2,3 f3 W8,3 W7,3 Vector d2 Partial result Map Inverted index representation for data vectors Vector 2 and 4. need to add partial results together MR: where each mapper computes partial scores and distributes them to reducers for score merging. Reduce + = sim(d2,d4) Vector d4 Communication overhead

8 Parallel Solutions for Exact APSS
Parallel score accumulation [Lin SIGIR’09; Baraglia et al. ICDM’10] Partition-based Similarity Search (PSS) [Alabduljalil et al. WSDM’13] Our previous work in 2013 proposed pss, 25x > faster The first phase divides the dataset into a set of partitions. During this process, the dissimilarity among partitions is identified . The second phase assigns a partition to each task and each task compares its partition with other potentially similar partitions. -- Shown in left part of figure, partition P1 is dissimilar with P3 and Pn, and is potentially similar with P2 and P4. -- In this figure, Task 1 is assigned P1 and compares P1 with P2 and P4.

9 Parallel time comparison: PSS vs. Parallel Score Accumulation
Inverted indexing with partial result parallelism 25x faster Explain figure: red is parallel score accumulation We adopt pss approach, because it’s has simpler parallelism management and has no shuffling between mappers and reducers. PSS Twitter

10 PSS: Partition-based similarity search
Key techniques : Partition-based symmetric comparison and load balancing [SIGIR’14a]. Challenge comes from the skewed distribution in data partition sizes and irregular dissimilarity relationship in large datasets. Analysis on competitiveness to the optimum. Scalable for large datasets on hundreds of cores. Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. Improve pss

11 Symmetry of Comparison
Partition-level comparison is symmetric Example: Should Pi compare with Pj or Pj compare with Pi? Impact communication/load of corresponding tasks Choice of comparison direction affects load balance Pi Pj Pi Pj 1,focus is load balance[not among processors but] among the comparison tasks Goal: balance the load among tasks in order to further accelerate partition-based search Why is it a prob. To balance load among tasks? partition level comparison is symmetric 3,Example pi pj. 4,Why pi gets larger once edge point to pi? Because the task that holds Pi takes the communication load to transfer Pj, and the computation load to compare them two 5,Partition sizes are unchanged, but task cost increases

12 Similarity Graph  Comparison Graph
Draw partition-level similarity relationship into a graph For example, upper left edge indicate p1 and p2 similar Who should be responsible for p1 and p2? If we say p1 is responsible, we’ll point the edge toward p1 We indicate it in a comparison graph Alg. Takes left graph as input, generate the right graph as output. What kind of output is good? Good if balanced. Load assignment process: Transition from similarity graph to comparison graph

13 Load Balance Measurement & Examples
Load balance metric: Graph cost = Max (task cost) Task cost is the sum of Self comparison, including computation and I/O cost Comparison to partitions point to itself 1,For a directed comparison graph, what’s a balanced workload? Quantify. 2,Our purpose is to lower graph cost, defined as the maximum task cost in the graph 3, Example: left figure, 86.7, graph cost, max task cost in graph How do we get 86.7? Include the cost of x, y. paper has more details on how to calculate a task cost. 4,Left graph, graph cost is can we do better? Flip the directions of all the edges, generate another directed graph, in the right graph, the cost is 67.1

14 Challenges of Optimal Load Balance
Skewed distribution of node connectivity & node sizes first, the similarity relationship among partitions is highly irregular. As shown in the figure, p2 and P3 ..., while P1 is ... 2. second, partition sizes are heavily skewed Table: how skewed the partition sizes are over Three datasets For example, twitter. The largest partition size is almost 6x larger than the average partition size. 3, As a result, hard to get optimal balance among tasks 4, In general, the scheduling problem is NP complete even if task costs are given, Our case, task costs are not known, we need to determine the comparison direction, problem becomes more challenging. Empirical data

15 Two-Stage Load Balance
Stage 1: Initial assignment of edge directions Key Idea: tasks with small partitions or low connectivity should absorb more load Example, assume p1 is smaller than p2, p3 Point the edges from p2 to p1, p3 to p1 P1 takes the load of the similarity comparison btw p1 p2 and p1 p3 absorption steps Optimize a sequence of steps that balances the load

16 Stage 2: Assignment refinement
Key Idea: gradually shift load of heavy tasks to their lightest neighbors Only reverse an edge direction if beneficial Result of Stage 1 A refinement step 1. Stage 1 may cause some tasks to carry an excessive amount of load. Esp. in a dense graph. That’s why we add a second stage to our alg. 2. Find highest cost task, 81.6 Pick an incoming neighbor of this partition to reverse the edge. Two neighbors p1, p5 In a greedy way, Best choice is neighbor with lowest cost 3. Only reverse an edge direction if beneficial Keep flipped, and have the graph cost reduced rom 81.6 to 67.1 Continue with the highest cost of the updated graph After finishing highest, we continue to look at the second highest and so forth

17 Competitive to Optimal Task Load Balancing
Is this two-stage algorithm competitive the optimum? Optimum = minimum (maximum task cost) Result: Two-stage solution ≤ (2 + δ) Optimum δ is the ratio of I/O and communication cost over computation cost In our tested cases, δ ≈ 10% How good is our algorithm? How good is it compare with the optimum? Optimum has many meanings, in this slide, we are looking at optimal task load balancing The optimum is the best anyone can get. Defined as min of max task cost Analytics shows the competitive ratio between our alg. and the optimum solution delta

18 Competitive to Optimum Runtime Scheduler
Can the solution of task assignment algorithm be competitive to the one produced by the optimum runtime scheduling? PTopt = Minimum (parallel time on q cores) A greedy scheduler executes tasks produced by two-stage algorithm E.g. Hadoop MapReduce Yielded schedule length is PTq Result: 1. Even though our alg. Competitive to the optimum in terms of max task cost, You might ask, But w/o knowing resources available or the runtime situation, How good would it be In terms of parallel run time when the tasks are scheduled to a cluster of machines? 2. Here We use a greedy scheduler, for example, hadoop scheduler as shown in the figure, A lot of comparison tasks, a cluster of machines each with multi core, Once a task is ready, it is assigned to the first avail. Computing unit /core to process 3. Result approx. 3, q, delta, We’ll come back to this in a moment

19 Scalability: Parallel Time and Speedup
Speedup is defined as the sequential time of these tasks divided by the parallel time. For the two larger datasets, top two lines, the speedup is about 84x for ClueWeb and 78x for Twitter when 100 cores are used. Declined slightly when more cores are used, [Efficiency 76%, 72% for 300 cores] Efficiency decline caused by the increase of I/O overhead among machines in larger clusters YMusic dataset is not large enough to use more cores for amortizing overhead

20 Comparison with Circular Load Assignment
[Alabduljalil et al. WSDM’13] Parallel time reduction Stage 1 up to 39% Stage 2 up to 11% Task cost Improvement percentage To evaluate how effectiveness of Two-Stage Load Balance Compare it with circular load assignment from a previous work. 2. First compare task cost. Two indicators /load balance factor: max/avg, std dev/avg The larger these two ratios are, the more severe load imbalance is. Want them small Compared to circular mapping, the two-stage assignment have both ratios lowered by 24% to 42% over three datasets. 3. For example, twitter Improve max/avg factor from 2.14 to 1.45, So what is the actual impact in when we run the tasks? Right figure. Our alg. Reduces the parallel time for 41% in Twitter dataset. Gray… blue....

21 PSS: Partition-based similarity search
Key techniques : Partition-based symmetric comparison and load balancing [SIGIR’14a]. Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1). Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data (PSS2). Cost modeling for memory hierarchy access as a guidance to optimize parameter setting.

22 Memory-hierarchy aware execution in PSS Task
S = vectors of a partition owned B = vectors of other partitions to compare C = temporary storage Task steps: Repeat Read some vectors vi from other partitions Compare vi with S Output similar vector pairs Until other potentially similar vectors are compared. Read assigned partition into area S.

23 Problem: PSS area S is too big to fit in cache
Other vectors B C Inverted index of vectors Accumulator for S S Too Long to fit in cache! Explain comparing vi in B spatial locality: nearby mem location is ref-ed in the near future. Temporal locality: same mem location is referenced again in the near future. Temporal proximity btw adjacent refs to the same mem location Branch locality: possible outcome of conditional branching is restricted to small set of possibilities Skewed distribution of features

24 PSS1: Cache-conscious data splitting
After splitting: B Accumulator for Si C S1 S2 Sq aa Split Size? Goal: improve locality of references Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1) splitting: there are multiple layers of cache so we’ll have multiple choices leading to different performances.

25 PSS1 Task PSS1 Task Read S and divide into many splits
Read other vectors into B For each split Sx Compare (Sx, B) In order to answer this question, we model the behavior of PSS1. To do that we need to find the core computation (most time consuming) Given shared feature, for each vector pairs, Core computation: append partial score; dynamic filtering Now that we find the core computation (which is these two lines) let’s see how to model their memory accesses. Output similarity scores Compare(Sx, B) for di in Sx for dj in B Sim(di,dj) += wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then

26 Modeling Memory/Cache Access of PSS1
Area Si Area C Area B The core computation is actually these two lines which require information from are Si, B and C Hence the total memory data access denoted by D0 = the memory access from each of these memory areas and we analyze them separately. Emphasis that you’ll be using S,B,C Sim(di,dj) = Sim(di,dj)+ wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then Total number of data accesses : D0 = D0(Si) + D0(B)+D0(C)

27 Cache misses and data access time
Memory and cache access counts: D0 : total memory data accesses D1 : missed access at L1 D2 : missed access at L2 D3 : missed access at L3 Next we look into cache misses to model the time required for data access. Then general formula for predicting the data access time is .. The details for predicting each cache miss Di is in the paper. Cost modeling for memory hierarchy access as a guidance to optimize parameter setting. Intel(R) Xeon(R) CPU X3430 @ 2.40GHz 32K 256K 8192K Intel(R) Core(TM) i GHz 6144K=6MB δi : access time at cache level i δmem : access time in memory. Memory and cache access time: Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

28 Total data access time Data found in L1 Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem ~2 cycles

29 Total data access time Data found in L2 Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 6-10 cycles

30 Total data access time Data found in L3 Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 30-40 cycles

31 Total data access time Data found in memory Total data access time
It is hence 10x-50x slower to use the data from memory that in cache Data found in memory Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem cycles

32 Time comparison PSS v.s. PSS1
Consider case: PSS1 split fits in L2 cache : L1 cache miss ratio. In practice > 10% is two orders of magnitude slower than Ideal ratio ~ 10x

33 Actual vs. Predicted Avg. task time ≈ #features * (lookup + multiply + add) + accessmem/cache We have formulated the avg. task time as follows ..# Now that we formulated the data access time lets look how our model fits reality. Valley. multiple times slower if split size is not chosen optimal. how the choice of split can impact performance. Cost modeling for memory hierarchy access is a guidance to optimize parameter setting.

34 PSS2: Vector coalescing
Issues: PSS1 focused on splitting S to fit into cache. PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. Solution: coalescing multiple vectors in B

35 PSS2: Example for improved locality
Si C B Striped areas in cache We’re bring multiple vectors in area B and store them in an inverted index format Improve temporal locality in memory areas B and C Reduce the amortization of inverted index lookup cost

36 Affect of s & b on PSS2 performance (Twitter)
With Twitter dataset (140 words / msg) This is PSS2 performance for different s, and b where darker areas means slower performance and light areas are faster processing. The shown rectangle shows the best performance achieved for s values that range between 5K-10K vectors and b within 32 range fastest

37 Improvement Ratio of PSS1,PSS2 over PSS
2.7x x-axis and y-axis. In general there is an improvement of 1.2 – 1.6x in PSS1 and 2.1 to 2.74x from PSS2 %Twitter %Clueweb % s %YMusic %Gnews

38 Incorporate LSH with PSS
LSH functions Signature generation MinHash Jaccard similarity Random Projection cosine similarity Lsh: hash the records using several hash functions to ensure that similar records have much higher probability of collision in buckets than dissimilar records Figure: concatenating $k$ hash values from each data object into a single signature for high precision, and by combining matches over $l$ such hashing rounds, each using independent hash functions, for good recall. lsh hash function satisfy similarity btw two docs == prob. Of hash of two docs equals Minhash: is the min-wise independent permutations method used in Shingling. For each of the random orderings of features in a document vector, the feature with lowest order is picked as minimum hash. Random projection: use a series of random hyperplanes as hash functions to encode document vectors as fixed-size bit vectors. A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES ’97, pages 21–, Washington, DC, USA, IEEE Computer Society. Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA, 2002. ACM.

39 LSH Pipeline LSH sub-steps Projection generation Signature generation
Bucket generation Benefits Great for parallelization More accessible for larger dataset After the centralized LSH step, we apply our efficient Patition-based algorithm in parallel upon all buckets generated in all rounds of LSH. Notice the LSH step is common before the original input data is copied and processed in parallel via PSS tasks. LSH computation is also parallelized over the distributed servers in the form of MapReduce jobs, and consists of sub-steps of projection generation, signature generation and bucket generation.

40 Effectiveness of Our Method
100% precision As comparison: 67% [Ture et. al. SIGIR’11] A guaranteed recall ratio for a certain similarity threshold Our combined algorithm design achieves $100\%$ precision with a guaranteed recall ratio. Ivory applies sliding window mechanism on sorted signatures of hamming distances in the hash table generated by one set of hash functions, and repeat this step for hundreds of rounds. Due to errors introduced by bit signatures and sliding window algorithm, the upper bound of recall for their method is $0.76$ with $1,000$-bit signature. If precision is desired, candidates within each LSH bucket could be post-processed by an additional pairwise clustering step by calculating exact similarities to filter out false positives. k bits each round l rounds

41 Efficiency – 20M Tweets >95% recall for 0.95 cosine similarity
50 cores Tradeoff of k Too high: partition too small Too low: not enough speedup via hashing

42 Method Comparison – 20M Tweets
Table compares our adopted method with two other approaches: running only LSH (Pure LSH) and running only PSS (Pure PSS). 1. Pure LSH is not good enough, because it generates a very high number of false positives. This is due to the relatively small number of bits ($k$) we used in signature and the fact that the LSH rounds are treated with \textit{OR} relation and the union of results are used. Pure LSH method with relatively high number of signature bits ($k$) could generate $>95\%$ recall with more rounds ($l$) of LSH, but precision is hard to improve over $94\%$, and more rounds means longer process time. Even with similar prec/recall, combined method is faster than pure lsh. 2. On the other hand, Pure PSS method guarantees $100\%$ precision and recall rate, but $8.8x$ as much time as our adopted method which applies LSH before PSS. Such comparison shows the efficiency of conducting LSH before PSS to speedup the process; and the necessity of conducting PSS after the LSH step as validation to ensure $100\%$ precision. LSH: improves efficiency (speed) with recall bound PSS: guarantees precision

43 Efficiency – 40M ClueWeb 95% recall for 0.95 cosine similarity
300 cores LSH+PSS better than Pure LSH Precision is increased to 100% with faster speed LSH+PSS better than Pure PSS 71x speedup Larger dataset -> higher k Larger feature size Due to the higher feature count per record and longer posting length, such a balance is achieved with a higher rounds of LSH $k$.

44 PSS with Incremental Updates
Update static partitions with new documents PSS: Easy to be extended to incremental. For example, web search engine constantly crawls the web for updated content, Twitter users continue creating new tweets, music website users sometimes update ratings or add new ratings to songs they listen to. How to handle incremental content update in Partition-based similarity search without? Every time new documents and / or new versions of old documents are generated, .. Once the new partition has grown to a threshold size or a threshold amount of time is reached, we start a MapReduce job to .. After the comparison, this new partition is added to the original set as a stand-alone partition with potential similarity relationship to all the other partitions. A naive solution triggers a all-partition pairs comparison once a threshold is reached. $50x$ faster New documents are appended to the end of a new partition Compare the new partition with all the original partitions

45 Result Ranking After Similarity-based Retrieval or Other Metrics
Once a set of distinguished result pages have been selected after conducting All-pairs Similarity Search on all the webpages that matches the query, a ranking among the result pages need to be computed before the search result page could be presented to the users. To organize the search result page in a way that maximizes the total reward, instead of relying on human judges, many companies adopt a system that leverages implicit user feedback to build a machine-learned model. Tree-based learning ensembles are trained offline and applied online to serve hundreds of millions of queries live each day.

46 Motivation Machine-learnt ranking models are popular
Ranking ensembles such as Gradient boosted regression trees (GBRT) A large number of trees are used to improve accuracy Winning teams at Yahoo! Learning-to-rank challenge used ensembles with 2k to 20k trees, or even 300k trees with bagging methods Time consuming for computing large ensembles Access of irregular document attributes impairs CPU cache reuse Unorchestrated slow memory access incurs significant cost Memory access latency is 200x slower than L1 cache Dynamic tree branching impairs instruction branch prediction Learning ensembles based on multiple trees are effective for web search and other complex data applications. It is not unusual that algorithm designers use thousands of trees to reach better accuracy and the number of trees becomes even larger with the integration of bagging. Access of irregular document attributes along with dynamic tree branching impairs the effectiveness of CPU cache and instruction branch prediction. Parallelization does not help increase query process throughput

47 Document-ordered Traversal (DOT)
Data Traversal in Existing Solutions: Tradeoff between ranking accuracy and performance can be played by using earlier exit based on document-ordered traversal (DOT) or scorer-ordered traversal (SOT) Left fig shows the data access sequence in DOT, marked on edges between documents and tree-based scorers. These edges represent data interaction during ranking score calculation. DOT first accesses a document and the first tree (marked as Step 1); it then visits the same document and the second tree. All $m$ trees are traversed before accessing the next document. As $m$ becomes large, the capacity constraint of CPU cache such as L1, L2, or even L3 does not allow all $m$ trees to be kept in the cache before the next document is accessed. Document-ordered Traversal (DOT) Scorer-ordered Traversal (SOT)

48 Our Proposal: 2D Block Traversal
We propose a cache-conscious 2D blocking method to optimize data traversal for better temporal cache locality. The edges in the left portion of this figure represent the interaction among blocks of documents and blocks of trees with access sequence marked on edges. For each block-level edge, we demonstrate the data interaction inside blocks in the right portion of this figure. Temporal proximity of reference to docs before they are swapped out of the cache

49 Why Better? Total slow memory accesses in score calculation
2D block can be up to s time faster. But s is capped by cache size DOT SOT 2D Block Simplified Cache Performance Analysis VPred [Asadi et al. TKDE’13] Convert control dependence to data dependence Loop unrolling with vectorization to reduce instruction branch misprediction and mask slow memory access latency 2D Block fully exploits cache capacity for better temporal locality Block-VPred: a combined solution that applies 2D Blocking on top of VPred [Asadi et al. TKDE’13] Convert control dependence to data dependence to reduce instruction branch misprediction

50 Scoring Time per Document per Tree in Nanoseconds
Benchmarks Yahoo! Learning-to-rank, MSLR-30K, and MQ2007 Metrics Scoring time Cache miss ratios and branch misprediction ratios reported by Linux perf tool In such cases, the benefit of converting control dependence as data dependence does not outweigh the overhead introduced. Query latency = Scoring time * n * m n docs ranked with an m-tree model

51 Query Latency in Seconds
For example, 2D blocking is 361\% faster than DOT and is 50\% faster than VPred %for Row 3 Yahoo! data with a 8,051-tree 150-leaf ensemble. for Row 3 with Yahoo! 150-leaf 8,051-tree benchmark. In this case, Block-VPred is 62\% faster than VPred and each query takes 1.23 seconds to complete scoring with Block-VPred. For a smaller tree in Row 5 (MSLR-30K), Block-VPred is 17\% slower than regular 2D blocking. the benefit of converting control dependence as data dependence does not outweigh the overhead introduced. MQ leaves DOT 204 vs. 2D Block 28.3  621% Yahoo 50 leaves SOT vs. 2D Block 36.4  213% Yahoo 150 leaves Vpred 123 vs. 2D Block 81.9  50% Fastest algorithm is marked in gray. 2D blocking Up to 620% faster than DOT Up to 213% faster than SOT Up to 50% faster than VPred Block-VPred Up to 100% faster than VPred Faster than 2D blocking in some cases

52 Time & Cache Perf. as Ensemble Size Varies
M=32k, DOT vs. 2D Block 82.7 vs. Block-Vpred 76.6 n is fixed as 2,000 DOT scoring time and L3 cache miss both rise dramatically when ensemble size increases & trees do not fit in cache SOT has a visibly higher miss ratio because it needs to bring back most of the documents from memory to L3 cache every time it evaluates them against a scorer Performance of Block-VPred and 2D blocking is close, barely affected by the change of ensemble size Time & cache perf. are highly correlated Change of ensemble size affects DOT the most 2D blocking is up to 287% faster than DOT

53 Efficient similarity search with cache-conscious data traversal
Summary Title Efficient similarity search with cache-conscious data traversal Contributions Load balancing for Partition-based symmetric comparison [SIGIR’14a]. Propose and implement a two-stage loadbalancing algorithm for efficiently executing partition-based similarity search in parallel. Proof of its competitiveness to the optimal solution.

54 Summary Fast runtime similarity search considering memory hierarchy with analysis [SIGIR’13+Journal]. Focus on the speedup of inter-partition comparison with a cache-conscious data layout traversal design. Predict the optimum data-split size by identifying the data access pattern, modeling the cost function, and estimating the task execution time. Further accelerate similarity search by orders of magnitude faster via Locality Sensitive Hashing. The optimized code can be upto 2.74x as fast as the original cache-obvious design.

55 Summary Optimized search result ranking with cache-conscious traversal [SIGIR’14b]. Focus on fast ranking score computation without accuracy loss in multi-tree ensemble models. Investigate data traversal methods for fast score calculation with large multi-tree ensemble models. Propose and implement a cache-conscious design by exploiting memory hierarchy capacity for better temporal locality.

56 Q & A Thank you!


Download ppt "Efficient Similarity Search with Cache-Conscious Data Traversal"

Similar presentations


Ads by Google