Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015.

Similar presentations


Presentation on theme: "Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015."— Presentation transcript:

1 Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

2 INTRODUCTION IN many applications, a real-world entity may appear in multiple data sources so that the entity may have quite different descriptions. For example, there are several ways to represent a person’s name or a mailing address. Thus, it is necessary to identify the records referring to the same real-world entity, which is called Entity Resolution (ER). ER is one of the most important problems in data cleaning and arises in many applications such as information integration and information retrieval. Because of its importance, it has attracted much attention in the literature

3 Traditional ER approaches  Similarity comparison among records.  Can’t identify records correctly in some cases.

4 observation: The existence and nonexistence of some attribute-value pairs are both useful to identify records

5 Contribution

6 syntax

7 semantics

8 Properties of ER-Rule Set

9 Algorithm Rule Discovery(DiscR) -To get rules from a training data set Rule-based entity resolution (R-ER) -To determine the record in the new data set refers to which entity

10 Rule Discovery Several definition before the algorithm

11 Rule Discovery

12 Rule requirements

13

14

15 Gen-PR

16

17 Gen-SingleNR First step:

18 Second step:

19 Rule-based entity resolution we define the weight of each ER-rule r as:

20

21

22

23

24 Rule update Invalid rules Useless rules

25 Evaluation the effectiveness of our rule learning algorithm (DiscR) and our rule-based ER approach the impact of training data size on ER accuracy and the number of generated rules The impact of rule length threshold on ER accuracy The scalability of DiscR and R-ER with the size of data

26 Algorithm compared with: GHOST and CFR

27

28 Summary DiscR and R-ER can achieve a high accuracy using a small training data; updating rules indeed help identify records; The number of generated rules scales well with the training data size on both data sets; rules with length larger than 2 are seldom needed to identify records; both DiscR and R-ER scales well with the size of data.

29 Thank you!


Download ppt "Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015."

Similar presentations


Ads by Google