Download presentation

Presentation is loading. Please wait.

Published byWyatt Fowles Modified over 2 years ago

1
1 Extending Relational Database Functionality with Data Inconsistency Resolution Support Ilya Pevzner, pevzner@cs.nyu.edu Arthur Goldberg artg@cs.nyu.edu Department of Computer Science Courant Institute New York University

2
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg2 Inconsistency Databases often contain information about real world objectsDatabases often contain information about real world objects When the data is collected and entered in the database (or measured), errors are introducedWhen the data is collected and entered in the database (or measured), errors are introduced When the same object is measured more than once, inconsistent data values may resultWhen the same object is measured more than once, inconsistent data values may result

3
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg3 Object Identification Identification of records describing the same real-world objectIdentification of records describing the same real-world object If key values are inconsistent values, object identification is not trivial and its results are uncertainIf key values are inconsistent values, object identification is not trivial and its results are uncertain Also known as approximate matching, duplicate detection and record linkageAlso known as approximate matching, duplicate detection and record linkage Area with multiple successful techniques, topic of KDD-2003 workshopArea with multiple successful techniques, topic of KDD-2003 workshop

4
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg4 Inconsistency Resolution Problem Given what is known about the world, find the “best” estimates for values of the inconsistent attributesGiven what is known about the world, find the “best” estimates for values of the inconsistent attributes Possible sources of the knowledge about the world:Possible sources of the knowledge about the world: a)The system designer or expert b)The end user c)The data Inconsistency resolution is also called mergingInconsistency resolution is also called merging Existing research is almost exclusively on a) and b)Existing research is almost exclusively on a) and b) –No systematic techniques Our work concentrates on c)Our work concentrates on c)

5
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg5 –Match using ID (trivial) –Merge using standardization Matching/Merging Example

6
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg6 Sometimes it is possible, but non-trivial, to tell which attribute value is bestSometimes it is possible, but non-trivial, to tell which attribute value is best In other cases, the answer is uncertainIn other cases, the answer is uncertain Merging Uncertainty

7
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg7 Research goals Develop merging methodologies that rely on the analysis of the dataDevelop merging methodologies that rely on the analysis of the data Extend relational databases withExtend relational databases with –Integrated model for representing matching and merging uncertainties –Integrated support for various matching and merging methodologies

8
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg8 Uncertainty in Relational Databases Semantics of NullsSemantics of Nulls – –E.g. J. Biskup. A foundation of Codd’s relational maybe-operations. ACM TODS, 8(4), December 1993. Fuzzy databasesFuzzy databases –E.g. –E.g. K. V. S. V. N. Raju and Arun K. Majumdar. Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM TODS, 13(2), June 1988. Probabilistic relationsProbabilistic relations – –E.g. E Zimanyi and A. Pirotte. Imperfect Information in Relational Databases. In Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., Kulwer Publ., 1997.

9
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg9 Probabilistic relations overview Probabilistic relations model uncertainty with truth probabilities added to classic relationsProbabilistic relations model uncertainty with truth probabilities added to classic relations –E.g. tuple X is in relation with probability P[X] Each probabilistic relation is associated with a set of classic relations representing “possible worlds” where the collection of outcomes for each probabilistic choice is fixedEach probabilistic relation is associated with a set of classic relations representing “possible worlds” where the collection of outcomes for each probabilistic choice is fixed –E.g. the probabilistic relation with the probabilistic choice in the above example will have two possible worlds – one with tuple X and one without Relational operations are defined through the associated classic relationsRelational operations are defined through the associated classic relations

10
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg10 Zimanyi’s Type-1 Probabilistic Relation DefinitionDefinition –A type-1 probabilistic relation is a relation R with a supplementary attribute w(R, t) added to each tuple t indicating the probability that a tuple t belongs to relation R

11
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg11 Probabilistic relationProbabilistic relation Possible worlds ( assuming unique(ID1) and unique(ID2)):Possible worlds ( assuming unique(ID1) and unique(ID2)): Zimanyi’s Type-1 Probabilistic Relation Example

12
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg12 Probabilistic matching Example: matching by nameExample: matching by name The way w(R, t) is computed depends on the matching methodologyThe way w(R, t) is computed depends on the matching methodology –An example of such methodology is ChoiceMaker™

13
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg13 Zimanyi’s Type-2 Probabilistic Relation DefinitionDefinition –Generalized relation in which attribute values can be probabilistic sets

14
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg14 Zimanyi’s Type-2 Probabilistic Relation Example Probabilistic relationProbabilistic relation Possible worldsPossible worlds

15
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg15 Probabilistic Merging Example Data sourcesData sources Query:Query: –List all people with the their correct name and social security number Execution plan:Execution plan: –Join using SSN (UID) –Merge names

16
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg16 Probabilistic Merging Example: Result MERGE

17
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg17 Merging Methodologies Ad-hoc techniquesAd-hoc techniques –Standardization E.g. convert both Jim and Jimmy to JamesE.g. convert both Jim and Jimmy to James –Pre-defined rules E.g. use gender to pick Andrea and not AndrewE.g. use gender to pick Andrea and not Andrew Machine LearningMachine Learning –Supervised (e.g. MaxEnt) Use experts to manually merge some data, use it to train and validateUse experts to manually merge some data, use it to train and validate –Unsupervised (e.g. dependency-based) E.g. Mine data for dependencies, use dependencies to pick the best estimatesE.g. Mine data for dependencies, use dependencies to pick the best estimates

18
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg18 SQL Extensions The MATCH predicateThe MATCH predicate –Uses a specified matching methodology to determine if specified tuples describe the same object The MERGE functionThe MERGE function –Uses a specified merging methodology to provide estimates for values of specified attributes The PROB functionThe PROB function –Provides access to probabilities in type-1 and type-2 probabilistic relations

19
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg19 The MATCH Predicate Can be used in the WHERE clause of SELECT statementCan be used in the WHERE clause of SELECT statement Takes the name of the matcher module and the tuples to be testedTakes the name of the matcher module and the tuples to be tested Returns true if the tuples match with probability exceeding the matcher threshold. Otherwise, returns falseReturns true if the tuples match with probability exceeding the matcher threshold. Otherwise, returns false SELECT statements with MATCH produce type- 1 probabilistic relationsSELECT statements with MATCH produce type- 1 probabilistic relations

20
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg20 MATCH Example Data source relationsData source relations QueryQuery SELECT S1.NAME, S1.SSN, S2.PHONE FROM S1,S2 WHERE MATCH(‘NAME_MATCHER’,S1.NAME,S2.NAME) ResultResult

21
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg21 The MERGE function May appear in SELECT listMay appear in SELECT list Accepts two parametersAccepts two parameters –Merger name –Merge list Returns a table of the form (v, w f ) where v is a value and w f is the corresponding probabilityReturns a table of the form (v, w f ) where v is a value and w f is the corresponding probability SELECT statements with MERGE produce type-2 probabilistic relationsSELECT statements with MERGE produce type-2 probabilistic relations

22
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg22 MERGE Example Data sourcesData sources QueryQuery SELECT S1.SSN, MERGE(‘NAME_MERGER’, (S1.NAME, S2.NAME)) AS NAME FROM S1, S2 WHERE S1.SSN=S2.SSN ResultResult

23
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg23 Query Processing Diagram

24
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg24 Interfaces

25
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg25 Validating with real-world data MEDLINE data setMEDLINE data set –Affiliation Fields: E-mail, Organization, AddressE-mail, Organization, Address –Statistics: 2,391,822 affiliations2,391,822 affiliations 523,140 matched by e-mail address523,140 matched by e-mail address 182,892 with US addresses182,892 with US addresses 32,505 non-identical duplicates32,505 non-identical duplicates Looking for other interesting data setsLooking for other interesting data sets –Errors –Dependencies –Duplicates –More distinct items –More Fields

26
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg26 Future plans Consider several data setsConsider several data sets Develop several merging methodologiesDevelop several merging methodologies Evaluate using real data and looking atEvaluate using real data and looking at –Performance –Merge Quality –Usability

27
VLDB-2003 Ph.D. WorkshopIlya Pevzner, Arthur Goldberg27 Questions ?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google