Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measuring referential Integrity in Distributed Databases Dhara Shah.

Similar presentations


Presentation on theme: "Measuring referential Integrity in Distributed Databases Dhara Shah."— Presentation transcript:

1 Measuring referential Integrity in Distributed Databases Dhara Shah

2 Introduction Distributed database: multiple databases residing at different locations which are communicated through the Internet. Violation of referential integrity due to similar content from different sources. Goal: Identify referential integrity problem to detect and avoid inconsistency or incompleteness. Promising alternative to detect and fix data quality issues in scientific database.

3 Assumptions Same tables but different content. Rows may have null values for primary key. Metadata has been integrated before. Content may be inconsistent due to both local and global issues. Broadcasting updates happens independently and asynchronously.

4 Column Metrics Metrics are measured on scale of [0…1] (1 being the optimal) lrcom(Ti.K) = |Ti K Tj | / |Ti| grcom(Ti.K) = |Ti K Tj | / |Ti| lrcon(Ti.F) = |Ti K,F Tj | / |Ti| grcon(Ti.K, Ti.F) = |Ti K,F Tj | / |Ti|

5 Table Metrics gcur(Ti) = |D1.Ti ∩ D2.Ti ∩ ・ ・ ・ ∩ Dn.Ti| / |D1.Ti ∪ D2.Ti ∪ ・ ・ ・ ∪ Dn.Ti| grcom(Ti) = Σ k j=1 |Ti|grcom(Ti.Kj ) / k|Ti| grcon(Ti) = Σ f j=1 |Ti|grcon(Ti.Fj ) / f|Ti|

6 Database Metrics lrcom(Di) = Σ m j=1 |Tj |lrcom(Tj ) / Σ j |Tj | lrcon(Di) = Σ m j=1 |Tj |lrcon(Tj ) / Σ j |Tj | grcom(D) = Σ m j=1 |Tj |grcom(Tj ) / Σ j |Tj | grcon(D) = Σ m j=1 |Tj |grcon(Tj ) / Σ j |Tj |

7 Query Optimization Local metrics in a single database  Aggregations grouping by FK before joins for table with several FKs.  Creating secondary index on each FK. Global metrics in distributed database  Transfer n-1 copies to central site  Compute metrics at one site and then incrementally update  Compute metrics for each pair of tables linked by a FK  Smallest table is transferred when join is required for two tables at different sites

8 Applications Applications w/ Scientific Databases  Central database: need fast connection and should be available all time  Local database: flexible and faster, many have more referential errors Program:  uses Logical data model (LDM) to calculate metrics.  Has graphical user interface, list which explains why errors happend

9 Conclusion Related work:  MOCHA: middleware system to integrate distributed data sources. Metrics that measure absolute and relative error w/ respect to referential integrity. Measures completeness and consistency. Raises new issues such as distributed query optimizations.

10 Citation Authors: Carlos Ordonez, Javier Garcia-Garcia, Zhibo Chen Title: Measuring Referential Integrity in Distributed Databases Name of Journal: CIMS '07 Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience Publication Date: November 2007 Page Range: 61-66


Download ppt "Measuring referential Integrity in Distributed Databases Dhara Shah."

Similar presentations


Ads by Google