# Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]

## Presentation on theme: "Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]"— Presentation transcript:

Duplicate Detection

Exercise 1. Use Extended Key to do Entity Identification[1]

Table R and S as shown below: Table R Table S NameCityZIPPersonNr Eva AaddeINGARÖ13469840126 -1223 Eva AaltoNorsborg14564851201-1225 Eva AbrahamssonINGARÖ13463861227-1227 NameHomeAddressTelephone Eva AaddeMyskviksvägen 808-571 480 27 Eva AbrahamssonMyrvägen 208-570 290 91 Eva AbrahamssonPilgatan 908-642 61 79 Eva AbrahamssonNyängsvägen 39A08-530 356 44

Suppose the extended key is {name, city, homeaddress} and the following ILFDs: – (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”) – (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”) – (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”) – (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE ”) Please construct the integrated table. ----------------------------------------------------- [1] Lim, Jaideep Srivastava, Satya Prabhakar, James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993

Answer Exercise Integrated Table NameCityZIPPersonNrHomeAddressTelephone Eva AaddeINGARÖ13469840126 -1223Myskviksvägen 808-571 480 27 Eva AbrahamssonINGARÖ13463861227-1227Myrvägen 208-571 480 27 Eva AbrahamssonSTOCKHOLMNULL Pilgatan 908-642 61 79 Eva AbrahamssonTULLINGENULL Nyängsvägen 39A08-530 356 44

Exercise 2. Use Priority Queue to do Duplicate Detection[2]

1.Table R, which is already sorted according to application-specific key : 2.Similarities between tuples Tuple T1 T2 T3 T4 T5 T6 T7 T1T2T3T4T5 T6T7 T1 0.60.10.30.5 0.10.2 T2 0.6 0.20.4 0.2 T3 0.10.2 0.90.4 0.60.5 T4 0.3 0.4 0.9 0.4 0.6 T5 0.5 0.4 0.8 T6 0.1 0.4 0.6 0.4 T7 0.2 0.5 0.6 0.8 0.4 Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within.

3.Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple’s similarity with the cluster’s all representitives. 4.The condition to declare a new cluster : matching score < 0.5 5.The condition to declare a representitive: 0.5 < matching score < 0.8 6.The size of Priority Queue: 2 ----------------------------------------------------- [2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997

Answer Record 1 Queue{1} Record 2 2:1 = 0.6 > 0.5 and < 0.8 Queue {1,2} Record 3 3:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 = 0.15 < 0.5 Queue {3} {1, 2} Record 4 4:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35 < 0.5 4:3= 0.9 > 0.5 and > 0.8 Queue {3, 4} {1,2} Record 5 5:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 = 0.45 < 0.5 5:3= 0.4 representitive = 0.4 <0.5 Queue {5} {3, 4} {1,2} Record 6 6:3 = 0.6 representitive = 0.6 > 0.5 and < 0.8 6:5 = 0.4 < 0.5 Queue {3, 4, 6} {5} {1,2} Record 7 7:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45 < 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}

Download ppt "Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]"

Similar presentations