CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Slides:



Advertisements
Similar presentations
CrowdER - Crowdsourcing Entity Resolution
Advertisements

Random Forest Predrag Radenković 3237/10
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Searching on Multi-Dimensional Data
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Today’s Agenda  HW #1 Due  Quick Review  Finish Input Space Partitioning  Combinatorial Testing Software Testing and Maintenance 1.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Chapter 3: Data Storage and Access Methods
System Partitioning Kris Kuchcinski
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Advanced Topics NP-complete reports. Continue on NP, parallelism.
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
1 CPS216: Data-intensive Computing Systems Operators for Data Access (contd.) Shivnath Babu.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Record Linkage in a Distributed Environment
Multi-object Similarity Query Evaluation Michal Batko.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Tracking Malicious Regions of the IP Address Space Dynamically.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Semi-Supervised Clustering
Optimizing Parallel Algorithms for All Pairs Similarity Search
Urban Sensing Based on Human Mobility
N-Way Model Merging CSC2125 Project
RE-Tree: An Efficient Index Structure for Regular Expressions
Database Management Systems (CS 564)
Lecture 9: Entity Resolution
Hierarchical clustering approaches for high-throughput data
Data Integration with Dependent Sources
Discovering Functional Communities in Social Media
Record Linkage with Uniqueness Constraints and Erroneous Values
Indexing and Hashing Basic Concepts Ordered Indices
Finding Subgraphs with Maximum Total Density and Limited Overlap
Efficient Record Linkage in Large Data Sets
Panagiotis G. Ipeirotis Luis Gravano
CPS216: Advanced Database Systems
A Framework for Testing Query Transformation Rules
CS639: Data Management for Data Science
Presentation transcript:

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip Bohannon 1CIKM 2012, "CBLOCK"

What is Deduplication? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects: Different ways of addressing (names, addresses, FaceBook accounts) the same person in text. Web pages with differing descriptions of the same business. Different photos of the same object. … 2CIKM 2012, "CBLOCK"

Deduplication Motivating Examples Linking Census Records Public Health Web search Comparison shopping Counter-terrorism Spam detection Machine Reading … 3CIKM 2012, "CBLOCK"

Big-Data & Deduplication 4CIKM 2012, "CBLOCK"

Blocking: Motivation Naïve pairwise: |R| 2 pairwise comparisons – 100 business listings each from 10,000 different cities across the world – 1 trillion comparisons – 11.6 days (if each comparison is 1 μs) Mentions from different cities are unlikely to be matches – Blocking Criterion: City – 100 million comparisons – 100 seconds (if each comparison is 1 μs) 5CIKM 2012, "CBLOCK"

Blocking: Motivation Mentions from different cities are unlikely to be matches – May miss potential matches 6CIKM 2012, "CBLOCK"

Blocking: Motivation Set of all Pairs of Records Matching Pairs of Records Pairs of Records satisfying Blocking criterion 7CIKM 2012, "CBLOCK"

Focus of this talk Need to scale de-duplication to very large datasets. Need to perform de-duplication across a large number of domains. Our Contribution: CBLOCK: An automatic blocking strategy for scaling de- duplication tasks. 8CIKM 2012, "CBLOCK"

Next … Blocking Problem Statement CBLOCK – Hierarchical Blocking Trees Structure Construction – Rollup – Drill-down Experiments 9CIKM 2012, "CBLOCK"

Blocking Problem Definition Input: Set of records R Output: Set of blocks/canopies Optimization Criteria: Coverage: Most duplicates within some block Efficiency: Blocks are small. When blocks evaluated in parallel, small ``largest block’’ 10CIKM 2012, "CBLOCK"

Blocking Problem Definition Coverage Estimator: – Use a training set T + of matching pairs of objects – Maximize: Efficiency Estimator: – size of each block is bounded by S 11CIKM 2012, "CBLOCK"

Blocking Problem Definition Input: Set of records R Output: Set of blocks/canopies Desiderata: Need to efficiently compute which block a record belongs to. Hash-based Blocking: Each block corresponds to objects that are hashed to the same key h i – Amenable to implementations on Map-Reduce x is hashed to C i if hash(x) = h i. Each hash function results in Disjoint Blocking: 12CIKM 2012, "CBLOCK"

Hash-based Blocking Examples of hash keys: – Last name – First three characters of first name – City + State + Zip Using one (or a conjunction of) blocking keys may be insufficient – Many objects may be hashed to a small number of hash keys. – 2,376,206 American’s shared the surname Smith in the 2000 US – NULL values may create large blocks. Solution: Construct blocking functions by combining simple functions 13CIKM 2012, "CBLOCK"

Next … Blocking Problem Statement CBLOCK – Hierarchical Blocking Trees Structure Construction – Rollup – Drill-down Experiments 14CIKM 2012, "CBLOCK"

CBLOCK Components Space of hash functions Coverage Estimator Efficiency Constraints Input Data Blocks Block-generator Blocking function Training phase Execution phase - Disjointness - Size Constraints - Cost Objective - “first 3 chars of name” - “last 4 digits of phone” = Disjoint Blocking Rollup Algorithm Drill-down Algorithm Non-disjoint Algorithm 15CIKM 2012, "CBLOCK"

Hierarchical Blocking Trees title release- year NULL <A* [A*,B*) director [T*,U*) 16CIKM 2012, "CBLOCK"

Hierarchical Blocking Tree Tree of hash functions. Each hash function is a root to leaf path. Permits efficient implementation. 17CIKM 2012, "CBLOCK"

Blocking Tree Construction Hardness: Constructing an optimal blocking tree is NP-hard. Greedy Heuristic: Successively pick hash function for each partition having size > S Picking hash function at each node based on: – Number of +ve examples that get split – Sizes of remaining canopies 18CIKM 2012, "CBLOCK"

Extensions Every block has size < S. But certain blocks may be very small, resulting in low recall. – Rollup of blocks: Merging small blocks to improve recall. A space of (manually generated) hash function is assumed as an input to CBLOCK. – Drill-down: Automatically constructing a set of simple hash functions. Allowing for non-disjoint blocking can increase recall – Use multiple hierarchical blocking trees. 19CIKM 2012, "CBLOCK"

Rollup Problem Input: Blocks C 1, …, C m (each of size < S), and +ve examples T + Output: Find canopies D 1, …, D m such that – D i ’s are disjoint – Each D i is a union of some C i ’s – |D i | < S – Recall subject to above maximized Results: – Problem is NP-complete – Greedy algorithm based on Dantzig’s 2-approximation for knapsack problem 20CIKM 2012, "CBLOCK"

Rollup Algorithm In each step find a pair of blocks D 1 and D 2 which maximize where benefit(D 1, D 2 ) = number of new matching pairs in the training set that will be in the same block after merging D 1 and D 2. 21CIKM 2012, "CBLOCK"

Drill-down Problem: Summary Determining partitioning in an ordered domain: – each partition gives canopy size < S – recall maximized 22 Our result: Poly-time optimal algorithm based on dynamic programming CIKM 2012, "CBLOCK"

Next … Blocking Problem Statement CBLOCK – Hierarchical Blocking Trees Structure Construction – Rollup – Drill-down Experiments 23CIKM 2012, "CBLOCK"

Experiments Datasets: – Sample of Y! Movies dataset (140K entities) – Sample of Y! Local dataset (40K entities) Metrics: – Recall: fraction of matching pairs in T+ which are in the same block – Efficiency: computation cost. 24CIKM 2012, "CBLOCK"

Experiments Algorithms – Random (R) – Single-hash (SH) – Chain (C): conjunctions of hash functions [Michelson & Knoblock AAAI ‘06], [Bilenko et al ICDM ‘06] – Chain Tree (CT): Same hash function is used in all levels of the tree – Hierarchical Blocking Tree (HBT) 25CIKM 2012, "CBLOCK"

Highlights Significantly outperform all other approaches wrt recall. Recall close to 1 using multiple rounds of HBT for movies data. Next: a sample of results. 26CIKM 2012, "CBLOCK"

Recall vs Max Canopy Size (Disjoint) Movies Dataset 27CIKM 2012, "CBLOCK"

Recall vs Max Canopy Size (Non-disjoint) Movies Dataset 28CIKM 2012, "CBLOCK"

Summary of Recall on Restaurants 29CIKM 2012, "CBLOCK"

Time (μs), max size=10K 30CIKM 2012, "CBLOCK"

Summary Presented CBLOCK, system for automatic blocking of large datasets A novel hierarchical blocking tree structure for specifying disjoint blocking functions Extensions of rollup, drilldown, and non-disjoint blocking Experiments show performance improvement over state- of-the-art 31CIKM 2012, "CBLOCK"

Thank you! 32CIKM 2012, "CBLOCK"