Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Cleaning and Transformation – Instance conflict resolution Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative.

Similar presentations


Presentation on theme: "Data Cleaning and Transformation – Instance conflict resolution Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative."— Presentation transcript:

1 Data Cleaning and Transformation – Instance conflict resolution Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004))

2 Instance conflicts resolution & merging Occur due to poor data quality  Errors exist when collecting or entering data  Sources are not updated Three types:  representation conflicts, e.g. dollar vs. euro  key equivalence conflicts, i.e. same real world objects with different identifiers  attribute value conflicts, i.e. instances corresponding to same real world objects and sharing an equivalent key, differ on other attributes

3 Example EmpIdNameSurnameSalary eugi it EmpIdNameSurnameSalary eugi Key conflicts Attribute conflicts

4 Techniques At design time or query time, even if actual conflicts occur at query time Key conflicts require object identification techniques There are several techniques for dealing with attribute conflicts

5 General approach for solving attribute- level conflicts Declaratively specify how to deal with conflicts:  Set of conflict resolution functions E.g., count(), min(), max(), random(), etc  Set of strategies to deal with conflicts Correspond to diff. Tolerance degrees  Query model that takes into account posible conflicts directly Use SQL directly or extensions to it.

6 Conflict resolution and result merging: list of proposals SQL-based conflict resolution (Naumann et al 2002) Aurora (Ozsu et al 1999) Fan et al 2001 FraSQL-based conflict resolution (Sattler et al 2003) FusionPlex (Motro et al 2004) OOra (Lim et al 1998) DaQuinCIS (Scannapieco 2004) (Wang et al 2000) ConQuer (Fuxman 2005) (Dayal 1985)

7 Conflict resolution and result merging: list of proposals SQL-based conflict resolution (Naumann et al 2002) Aurora (Ozsu et al 1999) Fan et al 2001 FraSQL-based conflict resolution (Sattler et al 2003) FusionPlex (Motro et al 2004) OOra (Lim et al 1998) DaQuinCIS (Scannapieco 2004) (Wang et al 2000) ConQuer (Fuxman 2005) (Dayal 1985)

8 Declarative (SQL-based) data merging with conflicts resolution [Naumann 2002] Problem: Merging data in an integrated database or in a query against multiple sources Solution:  proposal of a set of resolution functions (e.g. MAX, MIN, RANDOM; GROUP, AVG, LONGEST)  declarative merging of relational data sources by common queries through: Grouping queries Join queries (divide et impera) nested joining (one at a time, hierarchical)

9 Declarative (SQL-based) data merging with conflicts resolution [Naumann 2002] Problem: Merging data in an integrated database or in a query against multiple sources Solution:  proposal of a set of resolution functions (e.g. MAX, MIN, RANDOM; GROUP, AVG, LONGEST)  declarative merging of relational data sources by common queries through: Grouping queries Join queries (divide et impera) nested joining (one at a time, hierarchical)

10 Grouping queries SELECT isbn, LONGEST(title), MIN(price) FROM books GROUP BY isbn If multiple sources involved with potential duplicates: SELECT books.isbn,LONGEST(books.title), MIN(books.price) FROM ( SELECT * FROM books1 UNION SELECT * FROM books2) AS books GROUP BY books.isbn Advantages: good performance Inconvenient: user-defined aggregate functions not supported by most RDBMS

11 Join queries (1) Dismantle a query result that merges multiple sources into different parts, resolve conflicts within each part, and union them to a final result. Partitions the union of the two sources into :  The intersection of the two: Query 1 SELECT books1.isbn, LONGEST(books1.title, books2.title), MIN(books1.price, books2.price) FROM books1, books2 WHERE books1.isbn = books2.isbn

12 Join queries (2) Query 2a SELECT isbn, title, price FROM books1 WHERE books1.isbn NOT IN (SELECT books1.isbn, FROM books1, books2 WHERE books1.isbn = books2.isbn) ) Advantages: Use of built-in scalar functions available by RDBMS Query 2b SELECT isbn, title, price FROM books1 WHERE books1.isbn NOT IN (SELECT books2.isbn, FROM books2 ) Inconvs: Length and complexity of queries  The part of books 1 that does not intersect with books2 (Query 2a or 2b) and vice-versa

13 Advantages and inconvenients Pros: the proposal considers the usage of SQL, thus exploiting capabilities of existing database systems, rather than proposing proprietary solutions Cons: limited applicability of conflict resolution functions that often need to be domain-specific

14 References “Data Quality: Concepts, Methodologies and Techniques”, C. Batini and M. Scannapieco, Springer-Verlag, “Declarative data merging with conflict resolution”, F. Naumann and M. Haeussler, IQ2002.


Download ppt "Data Cleaning and Transformation – Instance conflict resolution Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative."

Similar presentations


Ads by Google