Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University.

Similar presentations


Presentation on theme: "Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University."— Presentation transcript:

1 Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

2 2 Outline Motivation and Applications Working Model for ULDBs Trio-One System Architecture Efficient Confidence Computations

3 3 Motivation Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate,...) Many of the same applications need to track the lineage of their data Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs) Coincidence or Fate?

4 4 Sample Applications Deduplication / Entity Resolution Uncertainty: Match and merge Lineage: Source records Information extraction Uncertainty: Extracted labels and values Lineage: Original context Information integration Uncertainty: Inconsistent information Lineage: Original sources

5 5 Sample Applications Scientific experiments Uncertainty: Captured (and derived) data Lineage: Layers of views Sensor data Uncertainty: Sensor values, aggregations, missing readings Lineage: Original readings, views

6 6 Claim The connection between uncertainty and lineage goes deeper than just a shared need by several applications Coincidence or Fate?

7 7 Lineage and Uncertainty Lineage... Enables simple and consistent representation of uncertain data Allows for intuitive and complete working models Correlates uncertainty in query results with uncertainty in the input data Can make confidence computations over uncertain data more efficient Applications use lineage to reduce or resolve uncertainty

8 8 Goal A new kind of DBMS in which: 1.Data 2.Uncertainty 3.Lineage are all first-class interrelated concepts Trio }

9 9 The Trio Trio 1.Data Model Simplest extension to relational model that’s sufficiently expressive 2.Query Language Simple extension to SQL with well-defined semantics and intuitive behavior 3.System A complete open-source DBMS that people want to use

10 10 The Present 1.Data Model Uncertainty-Lineage Databases (ULDBs) 2.Query Language TriQL 3.System Trio-One: Layered prototype deployable on top of any standard relational DBMS

11 11 Running Example: Crime-Solving Saw(witness,car) // may be uncertain Drives(person,car) // may be uncertain Suspects(person) = π person (Saw ⋈ car Drives)

12 12 Data Model: Uncertainty An uncertain database represents a set of possible instances of data values Amy saw either a Honda or a Toyota Jimmy drives a Toyota, a Mazda, or both Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 Hank is a suspect with confidence 0.7

13 13 Our Model for Uncertainty 1. Alternatives (mutually exclusive) 2. ‘?’ (Maybe) Annotations 3. Confidences

14 14 Our Model for Uncertainty 1. Alternatives: uncertainty about data value(s) 2. ‘?’ (Maybe) Annotations 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) witnesscar Amy{ Honda, Toyota, Mazda } = Three possible instances

15 15 Six possible instances Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) (Betty, Acura) ?

16 16 Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty Saw (witness,car) (Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2 (Betty, Acura): 0.6 ? Six possible instances, each with a probability

17 17 Models for Uncertainty Our model (so far) is not especially new We spent some time exploring the space of models for uncertainty Tension between understandability and expressiveness – Our model is understandable – But it is not complete, or even closed under common operations

18 18 Closure and Completeness Completeness Can represent any finite set of possible instances Closure Can represent any result of relational operators Note: Completeness  Closure

19 19 Models for Uncertainty A-Tuples – Uncertainty on attribute values (Amy, {Toyota, Mazda}) Not closed X-Tuples – Uncertainty on entire tuples { (Amy, Toyota), (Amy, Mazda) } Still not closed C-Tables – Tuples + Boolean constraints { (Amy, Toyota, X=1), (Amy, Mazda, X<>1), (Cathy, Honda, X<>1) } Closed and complete! Understandable?

20 20 Our Model is Not Closed Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jimmy, Toyota) ∥ (Jimmy, Mazda) (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ car Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT

21 21 to the Rescue Lineage (provenance): “where data came from” Internal lineage – pointers to base data External lineage – files, URLs, web portals, etc. In Trio: A Boolean function λ from alternatives to other alternatives (or external sources) Lineage

22 22 Example with Lineage IDSaw (witness,car) 11 (Cathy, Honda) ∥ (Cathy, Mazda) IDDrives (person,car) 21 (Jimmy, Toyota) ∥ (Jimmy, Mazda) 22 (Billy, Honda) ∥ (Frank, Honda) 23(Hank, Honda) IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank ? ? ? Suspects = π person (Saw ⋈ car Drives) λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1); λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 Correctly captures possible instances in the result

23 23 Trio Data Model 1. Alternatives (mutually exclusive) 2. ‘?’ (Maybe) Annotations 3. Confidences 4. Lineage ULDBs are closed and complete Uncertainty-Lineage Databases (ULDBs)

24 24 ULDBs: Lineage Conjunctive lineage sufficient for most operations Disjunctive lineage for duplicate-elimination Negative lineage for difference

25 25 ULDBs: Minimality A ULDB relation R represents a set of possible instances  Does every tuple in R appear in some possible instance? (no extraneous tuples)  Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s) Also Data-minimality Lineage-minimality

26 26 Data Minimality Examples Extraneous ‘?’... 10 (Billy, Honda) ∥ (Frank, Honda)... 40 Billy ∥ Frank... ? λ (40,1)=(10,1); λ (40,2)=(10,2) extraneous... 20(Billy, Honda)... ? 30(Frank, Honda)... ?

27 27 Data Minimality Examples Extraneous tuple (Diane, Mazda) ∥ (Diane, Acura) Diane extraneous (Diane, Mazda)(Diane, Acura) ? ??

28 28 ULDBs: Membership Questions  Does a given tuple t appear in some (all) possible instance(s) of R ?  Is a given table T one of (all of) the possible instances of R ? Polynomial algorithms based on data-minimization NP-Hard

29 29 Non-Theorists... Wake Back Up!

30 30 Querying ULDBs Simple extension to SQL Formal semantics, intuitive meaning Query uncertainty, confidences, and lineage TriQL

31 31 Initial TriQL Example IDSaw (witness,car) 11 (Cathy, Honda) ∥ (Cathy, Mazda) IDDrives (person,car) 21 (Jimmy, Toyota) ∥ (Jimmy, Mazda) 22 (Billy, Honda) ∥ (Frank, Honda) 23(Hank, Honda) SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank ? ? ? λ (31) = (11,2),(21,2) λ (32,1)=(11,1),(22,1); λ (32,2)=(11,1),(22,2) λ (33) = (11,1), 23

32 32 Formal Semantics Query Q on ULDB D D D D 1, D 2, …, D n possible instances Q on each instance representation of instances Q(D 1 ), Q(D 2 ), …, Q(D n ) D’ implementation of Q operational semantics D + Result

33 33 Operational Semantics Over conventional relational database: For each tuple in cross-product of X 1, X 2,..., X n 1.Evaluate the predicate 2.If true, project attr-list to create result tuple 3.If INTO clause, insert into table SELECT attr-list [ INTO table ] FROM X 1, X 2,..., X n WHERE predicate

34 34 Operational Semantics Over ULDB: For each tuple in cross-product of X 1, X 2,..., X n 1.Create “super tuple” T from all combinations of alternatives 2.Evaluate predicate on each alternative in T ; keep only the true ones 3.Project attr-list on each alternative to create result tuple 4.Details: ‘?’, lineage, confidences SELECT attr-list [ INTO table ] FROM X 1, X 2,..., X n WHERE predicate

35 35 Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda)

36 36 Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda)

37 37 Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda)

38 38 Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda)

39 39 Operational Semantics: Example (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

40 40 Operational Semantics: Example (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

41 41 Operational Semantics: Example (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

42 42 Operational Semantics: Example SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) Suspects Jim ∥ Bill Hank ? ? λ ( ) =...

43 43 Confidences Confidences supplied with base data Trio computes confidences on query results Default probabilistic interpretation Can choose to plug in different arithmetic Saw (witness,car) (Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4 Drives (person,car) (Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6 (Hank, Honda) ): 1.0 Suspects Jim: 0.12 ∥ Bill: 0.24 Hank: 0.6 ? ? ? 0.30.4 0.6 Probabilistic Min

44 44 TriQL: Querying Confidences Built-in function: Conf() SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8 SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(*) > 0.4

45 45 TriQL: Querying Lineage Built-in join predicate: Lineage() SELECT Saw.witness INTO AccusesHank FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Suspects.person = ‘Hank’ Also works for transitive lineage: SELECT witness FROM AccusesHank WHERE Lineage(Saw,AccusesHank)

46 46 Additional Query Constructs “Horizontal subqueries” Refer to tuple alternatives as a relation Unmerged (allow horizontal duplicates) Flatten, GroupAlts NoLineage, NoConf, NoMaybe Query-computed confidences SQL-like data modification statements

47 47 Final Example Query PrimeSuspect (crime#, accuser, suspect) (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) personscore Amy10 Betty15 Cathy5 Suspects Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Credibility List suspects with conf values based on accuser credibility

48 48 Final Example Query PrimeSuspect (crime#, accuser, suspect) (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) personscore Amy10 Betty15 Cathy5 Suspects Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

49 49 Final Example Query PrimeSuspect (crime#, accuser, suspect) (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) personscore Amy10 Betty15 Cathy5 Suspects Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

50 50 Final Example Query PrimeSuspect (crime#, accuser, suspect) (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) personscore Amy10 Betty15 Cathy5 Suspects Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P) “Scaled as conf” “Uniform as conf” … “Scaled as conf” “Uniform as conf” …

51 51 The Trio System Version 1 Entirely on top of a conventional DBMS Surprisingly easy and complete, reasonably efficient

52 52 The Trio System: Version 1 Lin:R aidFrtableaidTo R aidxidC Relational DBMS create trio table T(A,B) select C into R... Trio Metadata Trio API SQL commands Result cursors Traverse lineage T aidxidAB

53 53 Relation Encoding aidxidconfpersoncar 2112JimMazda 2212BillMazda 2321HankHonda Suspects Jim ∥ Bill Hank aidxidconfperson 311 3 Jim 321 3 Bill 332 2 Hank Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jim, Mazda) ∥ (Bill, Mazda) (Hank, Honda) aidxidconfwitnesscar 1112CathyHonda 1212CathyMazda ? ? aidFrtableaidTo 31Saw12 31Drives21 32Saw12 32Drives22 33Saw11 33Drives23 Saw Drives Suspects Lin:Suspects

54 54 Relation Encoding with Confidences aidxidconfpersoncar 211 0.3 JimMazda 221 0.6 BillMazda 232 1.0 HankHonda aidxidconfperson 311 NULL Jim 321 NULL Bill 332 NULL Hank Saw (witness,car) (Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4 Drives (person,car) (Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6 (Hank, Honda) aidxidconfwitnesscar 111 0.6 CathyHonda 121 0.4 CathyMazda Saw Drives Suspects 0.12 0.24 0.6

55 55 Query Translation Persistent results: Query Q into result table R 1.Run query Q’ to produce “super-result” R Q’ ≈ Q but adds aid/xid’s of source tuples, joins lineage tables for lineage() predicates 2.Group R into alternatives, generate new xid’s 3.Materialze lineage data into lin:R 4.Compute confidences? (optional) 5.Add/update metadata: schemas, confidence info, lineage structure Transient results: stop at 2, return cursor over new xid’s

56 56 The Trio System: Version 1 R aidxidC Relational DBMS create trio table T(A,B) select C into R... Trio Metadata Trio API SQL commands Result cursors Traverse lineage T aidxidAB Lin:R aidFrtableaidTo

57 57 The Trio System: Version 1 Trio API and translator (TriQL parser in Python) Trio API and translator (TriQL parser in Python) Trio Metadata TrioExplorer (GUI client) TrioExplorer (GUI client) Trio Stored Procedures Encoded Data Tables Lineage Tables Standard SQL Data & system columns A-IDs for alternatives “Verticalized” through shared X-IDs Confidences One lineage table per derived table Connecting A-IDs Table types Schema-level lineage structure conf() lineage() TriQL queries DDL commands Schema browsing Table browsing Explore lineage On-demand confidence computation Command-line client Command-line client Relational DBMS SPI-Plugin for PostgreSQL Replacing “Fetch-All” SPI-Plugin for PostgreSQL Replacing “Fetch-All”

58 58 The Trio System: Version 2 Trio API and translator TrioExplorer (GUI client) TrioExplorer (GUI client) Command-line client Command-line client Trio DBMS Specialized Trio Data Structures Specialized Trio Processing & Optimizing Standard SQLDirect function calls

59 59 Efficient Confidence Computation Previous approach (probabilistic databases) Each operator computes confidences during query execution Only certain (“safe”) query plans allowed In ULDBs Confidence of alternative A is function of confidences in λ(A) Our approach Decoupled data and confidence computation Allow any query plan for data computation Compute confidences on-demand based on lineage

60 60 CIDs & Confidence Memoization Identify set of Closest Independent Descendants (CIDs) for each relation in the schema-level lineage DAG  relations with no shared ancestors Batched confidence computations in SQL Select unions of distinct base alternatives from all root-to-leaf lineage paths With CIDs and memoization enabled – if confidences are memoized then stop at CIDs – else traverse down to the leafs (base data) and memoize at CIDs (ongoing work)

61 61 “Probabilistic TPC-H” Setup Horizontal split of TPC-H tables Uniform-random number of alternatives [1-10] Uniform confidences Partsupps Lineitems Orders O2O2 O2O2 O1O1 O1O1 O O L L P P OL LP OLP 1.5M 1.0M0.1M 1.0M 0.5M 0.7M O3O3 O3O3 L2L2 L2L2 L1L1 L1L1 L3L3 L3L3 L4L4 L4L4 P2P2 P2P2 P1P1 P1P1 P3P3 P3P3 O O L L P P CID(OLP) = {O, L, P} 0.8M 6M 1.5M

62 62 Confidence Computation Times

63 63 Per-Tuple vs. Batching (no CIDs)

64 64 Memoization & Join Multiplicity

65 65 Current Topics System Full query language Nice interface Performance experiments Demo applications Algorithms: confidence & lineage computation, extraneous data, membership questions Minimize lineage traversal Confidence memoization Efficient batch computations

66 66 Future Directions Theory, Model, Algorithms Unlimited opportunities System Storage, indexing, partitioning Statistics and query optimization More features Top-k by confidence Incomplete relations; continuous uncertainty; correlated uncertainty External lineage; update lineage; versioning

67 Search “stanford trio” Current Trio team: Parag Agrawal, Anish Das Sarma, Raghotham Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara, Martin Theobald, Jennifer Widom Special thanks to: Ashok Chandra, Alon Halevy, Jeff Ullman, Omar Benjelloun but don’t forget the lineage…


Download ppt "Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University."

Similar presentations


Ads by Google