Presentation is loading. Please wait.

Presentation is loading. Please wait.

Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]

Similar presentations


Presentation on theme: "Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]"— Presentation transcript:

1 Duplicate Detection

2 Exercise 1. Use Extended Key to do Entity Identification[1]

3 Table R and S as shown below: Table R Table S NameCityZIPPersonNr Eva AaddeINGARÖ Eva AaltoNorsborg Eva AbrahamssonINGARÖ NameHomeAddressTelephone Eva AaddeMyskviksvägen Eva AbrahamssonMyrvägen Eva AbrahamssonPilgatan Eva AbrahamssonNyängsvägen 39A

4 Suppose the extended key is {name, city, homeaddress} and the following ILFDs: – (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”) – (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”) – (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”) – (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE ”) Please construct the integrated table [1] Lim, Jaideep Srivastava, Satya Prabhakar, James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p , April 19-23, 1993

5 Answer Exercise Integrated Table NameCityZIPPersonNrHomeAddressTelephone Eva AaddeINGARÖ Myskviksvägen Eva AbrahamssonINGARÖ Myrvägen Eva AbrahamssonSTOCKHOLMNULL Pilgatan Eva AbrahamssonTULLINGENULL Nyängsvägen 39A

6 Exercise 2. Use Priority Queue to do Duplicate Detection[2]

7 1.Table R, which is already sorted according to application-specific key : 2.Similarities between tuples Tuple T1 T2 T3 T4 T5 T6 T7 T1T2T3T4T5 T6T7 T T T T T T T Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within.

8 3.Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple’s similarity with the cluster’s all representitives. 4.The condition to declare a new cluster : matching score < The condition to declare a representitive: 0.5 < matching score < The size of Priority Queue: [2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997

9 Answer Record 1 Queue{1} Record 2 2:1 = 0.6 > 0.5 and < 0.8 Queue {1,2} Record 3 3:1 = 0.1 3:2 = 0.2 representitive = ( ) /2 = 0.15 < 0.5 Queue {3} {1, 2} Record 4 4:1 =0.3 4:2= 0.4 representitive = ( ) /2 = 0.35 < 0.5 4:3= 0.9 > 0.5 and > 0.8 Queue {3, 4} {1,2} Record 5 5:1 = 0.5 5:2 = 0.4 representitive = ( ) /2 = 0.45 < 0.5 5:3= 0.4 representitive = 0.4 <0.5 Queue {5} {3, 4} {1,2} Record 6 6:3 = 0.6 representitive = 0.6 > 0.5 and < 0.8 6:5 = 0.4 < 0.5 Queue {3, 4, 6} {5} {1,2} Record 7 7:3 = 0.5 7:6 = 0.4 representitive = ( )/2 = 0.45 < 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}


Download ppt "Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]"

Similar presentations


Ads by Google