Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.

Similar presentations


Presentation on theme: "Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103."— Presentation transcript:

1 Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103

2 Contents 21.7 Entity Resolution  21.7.1 Deciding Whether Records Represent a Common Entity  21.7.2 Merging Similar Records  21.7.3 Useful Properties of Similarity and Merge Functions  21.7.4 The R-Swoosh Algorithm for ICAR Records  21.7.5 Other Approaches to Entity Resolution

3 Introduction Determining whether two records or tuples do or do not represent the same person, organization, place or other entity is called ENTITY RESOLUTION.

4 Deciding whether Records represent a Common Entity Two records represent the same individual if the two records have similar values for each of the fields associated with those records. It is not sufficient that the values of corresponding fields be identical because of following reasons: 1. Misspellings 2. Variant Names 3. Misunderstanding of Names

5 Continue: Deciding whether Records represent a Common Entity 4. Evolution of Values 5. Abbreviations Thus when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies and use the test that measures the similarity of records.

6 Deciding Whether Records Represents a Common Entity - Edit Distance First approach to measure the similarity of records is Edit Distance. Values that are strings can be compared by counting the number of insertions and deletions of characters it takes to turn one string into another. So the records represent the same entity if their similarity measure is below a given threshold.

7 Deciding Whether Records Represents a Common Entity - Normalization To normalize records by replacing certain substrings by others. For instance: we can use the table of abbreviations and replace abbreviations by what they normally stand for. Once normalize we can use the edit distance to measure the difference between normalized values in the fields.

8 Merging Similar Records Merging means replacing two records that are similar enough to merge and replace by one single record which contain information of both. There are many merge rules: 1. Set the field in which the records disagree to the empty string. 2. (i) Merge by taking the union of the values in each field (ii) Declare two records similar if at least two of the three fields have a nonempty intersection.

9 Continue: Merging Similar Records Name Address Phone 1. Susan 123 Oak St. 818-555-1234 2. Susan 456 Maple St. 818-555-1234 3. Susan 456 Maple St. 213-555-5678 After Merging Name Address Phone (1-2-3) Susan {123 Oak St.,456 Maple St} {818-555-1234, 213- 555-5678}

10 Useful Properties of Similarity and Merge Functions The following properties say that the merge operation is a semi lattice : 1. Idempotence : That is, the merge of a record with itself should surely be that record. 2. Commutativity : If we merge two records, the order in which we list them should not matter. 3. Associativity : The order in which we group records for a merger should not matter.

11 Continue: Useful Properties of Similarity and Merge Functions There are some other properties that we expect similarity relationship to have: Idempotence for similarity : A record is always similar to itself Commutativity of similarity : In deciding whether two records are similar it does not matter in which order we list them Representability : If r is similar to some other record s, but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record.

12 R-swoosh Algorithm for ICAR Records Input: A set of records I, similarity function and a merge function. Output: A set of merged records O. Method: – O:= emptyset; – WHILE I is not empty DO BEGIN Let r be any record in I; Find, if possible, some record s in O that is similar to r; IF no record s exists THEN move r from I to O ELSE BEGIN delete r from I; delete s from O; add the merger of r and s to I; END;

13 Other Approaches to Entity Resolution The other approaches to entity resolution are : – Non- ICAR Datasets – Clustering – Partitioning

14 Other Approaches to Entity Resolution - Non ICAR Datasets Non ICAR Datasets : We can define a dominance relation r<=s that means record s contains all the information contained in record r. If so, then we can eliminate record r from further consideration.

15 Other Approaches to Entity Resolution - Clustering Clustering: Some time we group the records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar.

16 Other Approaches to Entity Resolution - Partitioning Partitioning: We can group the records, perhaps several times, into groups that are likely to contain similar records and look only within each group for pairs of similar records.

17 Thank You


Download ppt "Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103."

Similar presentations


Ads by Google