Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.

Slides:



Advertisements
Similar presentations
Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology,
Advertisements

Equivalence Relations
Longest Common Subsequence
1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.
Closure Properties of CFL's
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Discrete Mathematics Lecture 5 Alexander Bukharovich New York University.
GENERIC ENTITY RESOLUTION WITH NEGATIVE RULES Steven Euijong Whang · Omar Benjelloun · Hector Garcia-Molina Compiled by – Darshana Pathak.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Introduction to Excel 2007 Part 2: Bar Graphs and Histograms February 5, 2008.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
Copyright © Cengage Learning. All rights reserved. CHAPTER 1 SPEAKING MATHEMATICALLY SPEAKING MATHEMATICALLY.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
1 Section 1.7 Set Operations. 2 Union The union of 2 sets A and B is the set containing elements found either in A, or in B, or in both The denotation.
SETS CSC 172 SPRING 2002 LECTURE 20 Sets Defined by membership relation  Atoms many not have members, but may be members ofa set Sets may also be members.
CS5371 Theory of Computation Lecture 1: Mathematics Review I (Basic Terminology)
Normal forms for Context-Free Grammars
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Function: Definition A function is a correspondence from a first set, called the domain, to a second set, called the range, such that each element in the.
Detection and Resolution of Anomalies in Firewall Policy Rules
Hash Table March COP 3502, UCF.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Relational Database Concepts. Let’s start with a simple example of a database application Assume that you want to keep track of your clients’ names, addresses,
A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.
© Hanson Research Corporation Deduping contacts in Sage CRM 24 th Day of November 2010.
10.4 How to Find a Perfect Matching We have a condition for the existence of a perfect matching in a graph that is necessary and sufficient. Does this.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 2 Graphs and Functions Copyright © 2013, 2009, 2005 Pearson Education, Inc.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
Office Management Tools II Ms Saima Gul.  When you create your tables, you should assign each table a primary key—one or more fields whose contents are.
Relational Databases (MS Access)
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
DFA Minimization 1 2 Equivalent States Consider the accept states c and g. They are both sinks meaning that any string which ever reaches them is guaranteed.
Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
11.5 Graphs of Equations 11.6 Introduction to Functions 11.7 Function Notation.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
Copyright © Cengage Learning. All rights reserved.
Elementary Data Organization. Outline  Data, Entity and Information  Primitive data types  Non primitive data Types  Data structure  Definition 
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
Text Clustering Hongning Wang
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Summarization – CS 257 Chapter – 21 Database Systems: The Complete Book Submitted by: Nitin Mathur Submitted to: Dr.T.Y.Lin.
1 Reverse a String iPhone/iPad, iOS Development Tutorial.
Chapter 8: Relations. 8.1 Relations and Their Properties Binary relations: Let A and B be any two sets. A binary relation R from A to B, written R : A.
Copyright © 2013 Curt Hill Triggers The Generation of Indirect Actions.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
Discrete Structures – CNS 2300
7. Properties of Context-Free Languages
Discrimination and Classification
Jaya Krishna, M.Tech, Assistant Professor
Lecture 2- Query Processing (continued)
Function Rules and Tables.
Efficient Record Linkage in Large Data Sets
Accounting Information Systems 9th Edition
INFORMATION INTEGRATION
Evaluation of Relational Operations: Other Techniques
Lesson 1.7 Represent Functions as Graphs
Data Structures and Algorithm Analysis Hashing
CS561-Spring 2012 WPI, Mohamed eltabakh
Presentation transcript:

Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103

Contents 21.7 Entity Resolution  Deciding Whether Records Represent a Common Entity  Merging Similar Records  Useful Properties of Similarity and Merge Functions  The R-Swoosh Algorithm for ICAR Records  Other Approaches to Entity Resolution

Introduction Determining whether two records or tuples do or do not represent the same person, organization, place or other entity is called ENTITY RESOLUTION.

Deciding whether Records represent a Common Entity Two records represent the same individual if the two records have similar values for each of the fields associated with those records. It is not sufficient that the values of corresponding fields be identical because of following reasons: 1. Misspellings 2. Variant Names 3. Misunderstanding of Names

Continue: Deciding whether Records represent a Common Entity 4. Evolution of Values 5. Abbreviations Thus when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies and use the test that measures the similarity of records.

Deciding Whether Records Represents a Common Entity - Edit Distance First approach to measure the similarity of records is Edit Distance. Values that are strings can be compared by counting the number of insertions and deletions of characters it takes to turn one string into another. So the records represent the same entity if their similarity measure is below a given threshold.

Deciding Whether Records Represents a Common Entity - Normalization To normalize records by replacing certain substrings by others. For instance: we can use the table of abbreviations and replace abbreviations by what they normally stand for. Once normalize we can use the edit distance to measure the difference between normalized values in the fields.

Merging Similar Records Merging means replacing two records that are similar enough to merge and replace by one single record which contain information of both. There are many merge rules: 1. Set the field in which the records disagree to the empty string. 2. (i) Merge by taking the union of the values in each field (ii) Declare two records similar if at least two of the three fields have a nonempty intersection.

Continue: Merging Similar Records Name Address Phone 1. Susan 123 Oak St Susan 456 Maple St Susan 456 Maple St After Merging Name Address Phone (1-2-3) Susan {123 Oak St.,456 Maple St} { , }

Useful Properties of Similarity and Merge Functions The following properties say that the merge operation is a semi lattice : 1. Idempotence : That is, the merge of a record with itself should surely be that record. 2. Commutativity : If we merge two records, the order in which we list them should not matter. 3. Associativity : The order in which we group records for a merger should not matter.

Continue: Useful Properties of Similarity and Merge Functions There are some other properties that we expect similarity relationship to have: Idempotence for similarity : A record is always similar to itself Commutativity of similarity : In deciding whether two records are similar it does not matter in which order we list them Representability : If r is similar to some other record s, but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record.

R-swoosh Algorithm for ICAR Records Input: A set of records I, similarity function and a merge function. Output: A set of merged records O. Method: – O:= emptyset; – WHILE I is not empty DO BEGIN Let r be any record in I; Find, if possible, some record s in O that is similar to r; IF no record s exists THEN move r from I to O ELSE BEGIN delete r from I; delete s from O; add the merger of r and s to I; END;

Other Approaches to Entity Resolution The other approaches to entity resolution are : – Non- ICAR Datasets – Clustering – Partitioning

Other Approaches to Entity Resolution - Non ICAR Datasets Non ICAR Datasets : We can define a dominance relation r<=s that means record s contains all the information contained in record r. If so, then we can eliminate record r from further consideration.

Other Approaches to Entity Resolution - Clustering Clustering: Some time we group the records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar.

Other Approaches to Entity Resolution - Partitioning Partitioning: We can group the records, perhaps several times, into groups that are likely to contain similar records and look only within each group for pairs of similar records.

Thank You