Optimizing Parallel Algorithms for All Pairs Similarity Search

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
CrowdER - Crowdsourcing Entity Resolution
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Fast Algorithms For Hierarchical Range Histogram Constructions
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department.
Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges.
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Load Balancing for Partition-based Similarity Search Date : 2014/09/01 Author : Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Source : SIGIR’14 Advisor.
Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Triangular Mesh Decimation
Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Searching Topics Sequential Search Binary Search.
Concurrency and Performance Based on slides by Henri Casanova.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee
Efficient Multi-User Indexing for Secure Keyword Search
Edinburgh Napier University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Information Retrieval and Web Search
Efficient Similarity Search with Cache-Conscious Data Traversal
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Parallel Programming in C with MPI and OpenMP
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Management Systems (CS 564)
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Counter propagation network (CPN) (§ 5.3)
Lecture 9: Entity Resolution
Department of Computer Science University of California, Santa Barbara
Predictive Performance
Replication-based Fault-tolerance for Large-scale Graph Processing
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Optimizing Parallel Algorithms for All Pairs Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara

All Pairs Similarity Search (APSS) Finding all pairs of objects with similarity score above a threshold. Example Applications: Collaborative filtering/recommendation. Coalition detection for advisement frauds. Query suggestions. Spam and near duplicate detection. Slow data processing for large datasets Approximate: doesn’t guarantee full recall Top-k is used when threshold unknown. Entity resolution is the problem of determining which records in a database refer to the same entities (aka. Deduplication, recordlinkage)

Problem Definition Cosine-based similarity: Given n normalized vectors, compute all pairs of vectors such that Quadratic complexity O(n2) Approximate: doesn’t guarantee full recall Top-k is used when threshold unknown. Entity resolution is the problem of determining which records in a database refer to the same entities (aka. Deduplication, recordlinkage) >

Previous Work to Speedup Similarity Search Approximated clustering LSH mapping [Gionis et al. VLDB99]: Tradeoff in precision, recall, & redundant comparison. Proposed techniques can be used with LSH. Exact similarity search: Dynamic computation filtering [Bayado et al. WWW07] Inverted indexing to search vectors that share features[Arasu et al. VLDB06] Parallel algorithms with MapReduce [Lin SIGIR09, Baraglia et al. ICDM10] Exploit parallelism in partial similarity score accumulation. Approximate: doesn’t guarantee full recall Top-k is used when threshold unknown. Entity resolution is the problem of determining which records in a database refer to the same entities (aka. Deduplication, recordlinkage)

Parallelism in Partial Similarity Score Accumulation Inverted index W4,5 f5 W2,5 W3,1 W4,1 W2,1 f1 W4,3 W2,3 f3 W8,3 W7,3 Partial result Map Reduce + = sim(d2,d4) Communication overhead

Focus and contribution: Partition-based APSS Focus on exact APSS with cosine similarity. New techniques to speedup APSS: Static partitioning to identify dissimilar vectors in advance.  Early removal of I/O, communication, computation. Partition-based symmetric comparison and load balancing.  Removes unnecessary I/O and communication before computation. No reduce tasks to avoid partial result parallelism.  removal of excessive communication. An order of magnitude of performance improvement. Approximate: doesn’t guarantee full recall Top-k is used when threshold unknown. Entity resolution is the problem of determining which records in a database refer to the same entities (aka. Deduplication, recordlinkage)

Partition-based Similarity Search (PSS)

Function of each PSS task Read assigned partition and build inverted index in memory area S. Repeat Read vectors from other partitions Compare S with these vectors Write similar vector pairs Until other potentially similar vectors are compared. Approximate: doesn’t guarantee full recall Top-k is used when threshold unknown. Entity resolution is the problem of determining which records in a database refer to the same entities (aka. Deduplication, recordlinkage) Compare s Coarse-grain task parallelism

I. Static Partitioning with Dissimilarity Detection Goal: Place dissimilar vectors into different partitions. Low-cost partitioning: near linear complexity in identifying pair-wise dissimilarity with no false positive. Partitions Dissimilarity edges Dataset

How To Detect Dissimilar Vectors? For normalized cosine similarity, Holder inequality: di is dissimilar to dj if: How to detect dissimilarity without pair-wise comparison?

Illustration of O(nlogn) static partitioning with dissimilarity detection (a) Sort (b) Group (c) Decompose dissimilar dissimilar ||d6||1 ≤ τ / maxw(d7)

Final Output of Static Partitioning Decompose G1 G2 G3 Gn Dissimilarity edges

II. Partition-oriented symmetric comparison Issue: Sim(di, dj)=Sim(dj, di). Redundant computations. Use vector IDs to detect symmetry ie. use one pair. Eliminate computation, but not I/O. Pi Pj ? Strategy Partition-level symmetric detection Eliminate entire partition I/O and communication P2 Q: Should Pi compare with Pj or Pj compares with Pi? Impact communication/workload balance.

II. Partition-level Circular Load Balancing Goal: Select partition-level comparison direction. Balance I/O traffic and workload among tasks.

III. Run PSS tasks in parallel Simple task parallelism: Do not exploit partial result parallelism. No reduce tasks  no map-reduce communication. Hybrid indexing for faster memory data access: Inverted indexing for assigned partition Forward indexing for other vectors W1,2 x W7,2 W3,2 x W7,2

Implementation and Evaluation Implemented in Java with Hadoop MapReduce. Evaluation objectives: Scalability of PSS with hybrid indexing. Compare against inverted indexing, PSS with forward indexing Effectiveness of static partitioning and partition-based circular balancing. Datasets: Twitter (20M). Clueweb (50M subset). Enron emails (448K). Data cleaning and stopword removal applied first Static partitioning takes ~3% of total time.

Speedup of PSS with hybrid indexing as # of cores increases

Parallel time comparison of 3 algorithms Inverted indexing PSS/forward indexing 20x faster 10x faster PSS/ hybrid indexing Twitter

Ratio of PSS/hybrid indexing time over forward indexing ps : average posting length in the assigned partition s: number of vectors in the assigned partitions Analysis predicts the trend of cost ratio and impact of faster data access with hybrid indexing. For Twitter: 20% vs. actual: ~10%

Execution time reduction by static partitioning and partition-level circular balancing Ratio: 1 – Time(optimized)/Time(baseline) Non-binary weights Static partitioning yields 30-75% reduction Circular balancing adds 3%-26% more

Execution time reduction with binary weights Ratio: 1 – Time(optimize)/Time(baseline) Binary weights Static partitioning yields 50-75% reduction Circular balancing adds 4%-11% more

Read I/O Volume Reduction by Static Partitioning Static partitioning yields 20%-80% volume reduction

Read I/O volume of statement-level vs Read I/O volume of statement-level vs. partition-level symmetric comparison Twitter Partition-level symmetric comparison has 50% less read I/O volume

Conclusions A scalable two-step approach: partition-based all-pairs similarity search fast static partitioning to place dissimilar vectors into different groups. circular partition-based comparison assignment that exploits computation symmetry & balances workloads. Simple parallel task execution with hybrid indexing. Up to 10x-20x faster in the tested datasets. By I/O and communication reduction, & faster memory data access http://www.cs.ucsb.edu/projects/psc Approximate: doesn’t guarantee full recall Top-k is used when threshold unknown. Entity resolution is the problem of determining which records in a database refer to the same entities (aka. Deduplication, recordlinkage)