Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.

Slides:



Advertisements
Similar presentations
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Chen Li ( 李晨 ) Chen Li Scalable Interactive Search NFIC August 14, 2010, San Jose, CA Joint work with colleagues at UC Irvine and Tsinghua University.
Md. Mahbub Hasan University of California, Riverside.
A Unified Framework for Context Assisted Face Clustering
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.
Large-Scale Entity-Based Online Social Network Profile Linkage.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Speaker: Sattam Alsubaiee Supporting Location-Based Approximate-Keyword Queries Sattam Alsubaiee, Alexander Behm, and Chen Li University of California,
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
DB performance tuning using indexes Section 8.5 and Chapters 20 (Raghu)
1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine.
Chapter Physical Database Design Methodology Software & Hardware Mapping Logical Design to DBMS Physical Implementation Security Implementation Monitoring.
Dimensionality Reduction
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Rada Chirkova (North Carolina State University) and Chen Li (University of California, Irvine) Materializing Views With Minimal Size To Answer Queries.
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Executing SQL over Encrypted Data in Database-Service-Provider Model Hakan Hacigumus University of California, Irvine Bala Iyer IBM Silicon Valley Lab.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.
Privacy Preserving Schema and Data Matching Scannapieco, Bertino, Figotin and Elmargarmid Presented by : Vidhi Thapa.
CS4432: Database Systems II Query Processing- Part 2.
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
SZRZ6014 Research Methodology Prepared by: Aminat Adebola Adeyemo Study of high-dimensional data for data integration.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Efficient Approximate Search on String Collections Part I
CIS 207 The Relational Database Model
Web Data Integration Using Approximate String Join
CS222P: Principles of Data Management Notes #11 Selection, Projection
MIS 451 Building Business Intelligence Systems
Supporting of search-as-you-type using sql in databases
Efficient Record Linkage in Large Data Sets
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Time Relaxed Spatiotemporal Trajectory Joins
CS222: Principles of Data Management Notes #11 Selection, Projection
Minwise Hashing and Efficient Search
Materializing Views With Minimal Size To Answer Queries
Relaxing Join and Selection Queries
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

2 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003

How was the paper written?  Two faculty working on different areas, plus  1 st year PhD student

Chen’s Story: 2001 … 5

Data Integration Problems? Talking to medical doctors… 6

Example NameSSNAddr Jack Lemmon Maple St Harrison Ford Culver Blvd Tom Hanks Main St ……… Table R NameSSNAddr Ton Hanks Main Street Kevin Spacey Frost Blvd Jack Lemon Maple Street ……… Table S Q: Find records from different datasets that could be the same entity 7Chen Li

Sharad’s research 8Chen Li

Liang’s story 1 st -year PhD student at UC Irvine 9Chen Li

Challenges How to define good similarity functions? How to do matching efficiently? 10Chen Li

11 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

12 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space

13 Advantages Applicable to many metric similarity functions — E.g.: Edit distance Open to existing algorithms — Mapping techniques — Join techniques

14 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

15 Use data set 1 (54K names) as an example k=2, d=20 — Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?

16 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:

17 Secret of the paper …

18

19 Work since then … Chen: efficiency Sharad: quality

20 Chen’s Work on Efficiency Gram-based algorithms — Indexing — Selection algorithms — Join algorithms — Variable-length grams — Selectivity estimation Trie-based algorithms — Instant search

The Flamingo Package

22 Follow-up work in the community Significant amount of work on approximate string queries — Selection — Join

Make an impact? 23

UCI People Search 24Chen Li

Psearch (2008) : 2 stories 25Chen Li

Fuzzy search 26

Location-based search 27

Research commercialization 28Chen Li

Lesson learned: Hands-on experiences important! 29Chen Li