Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University.

Similar presentations


Presentation on theme: "Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University."— Presentation transcript:

1 Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University

2 2 Depiction of Trio Project

3 3 Some Context Trio Project They’re building a new kind of DBMS in which: 1.Data 2.Accuracy 3.Lineage are all first-class interrelated concepts

4 4 The “Trio” in Trio 1.Data Student #123 is majoring in Econ: (123,Econ)  Major 2.Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS)  Major With confidence 60% student #456 is a CS major: (456, CS 0.6 )  Major 3.Lineage 456  HardWorker derived from: (456, CS)  Major “CS is hard”  some web page

5 5 Depiction Data Uncertainty Lineage (“sourcing”)

6 6 Original Motivation for the Project New Application Domains Many involve data that is uncertain ( approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate,...) Many of the same ones need to track the lineage (provenance) of their data Neither uncertainty nor lineage is supported in current database systems

7 7 Importance of Lineage [technical] Lineage: 1.Enables simple and consistent representation of uncertain data 2.Correlates uncertainty in query results with uncertainty in the input data 3.Can make computation over uncertain data more efficient [fluffy] Applications use lineage to reduce or resolve uncertainty

8 8 Sample Applications Information extraction Find & label entities in unstructured text Often probabilistic Information integration Combine data from multiple sources Inconsistencies Scientific experiments Inexact/incomplete data Many levels of “derived data products”

9 9 Sample Applications Sensor data management Approximate readings Missing readings Levels of data aggregation Deduplication (“data cleaning”) Object linkage, entity resolution Often heuristic/probabilistic Approximate query processing Fast but inexact answers

10 10 The “Usual” DBMS Features (From first lecture of any Intro to Databases class) 1.Efficient, 2.Convenient, 3.Safe, 4.Multi-User storage of and access to 5.Massive amounts of 6.Persistent data

11 11 Completeness vs. Closure All sets-of-instances Representable sets-of-instances Op1 Op2 Proposition: Proposition: An incomplete representation is still interesting if it’s expressive enough and closed under all required operations Completeness: blue=yellow Closure: arrow stays in blue

12 12 Operations: Semantics Easy and natural (re)definition for any standard database operation (call it Op) Closure: up-arrow always exists Completeness  Closure Note: Completeness  Closure D I 1, I 2, …, I n J 1, J 2, …, J m D’D’ possible instances Op on each instance rep. of instances Op’ direct implementation

13 13 Incompleteness personday JenniferMonday MikeTuesday personday JenniferMonday Instance1 Instance2 personday MikeTuesday Instance3 personday JenniferMonday MikeTuesday ? ? generates 4 th instance: empty relation

14 14 Non-Closure Under Join personday Mike{Monday,Tuesday} dayfood Mondaychicken Tuesdayfish ⋈ Result has two possible instances: persondayfood MikeMondaychicken Instance1 persondayfood MikeTuesdayfish Instance2 Not representable with or-sets and ?

15 15 Another “Trio” in Trio 1.Data Model Simplest extension to relational model that’s sufficiently expressive 2.Query Language Simple extension to SQL with well-defined semantics and intuitive behavior 3.System A complete open-source DBMS that people want to use

16 16 Another “Trio” in Trio 1.Data Model Uncertainty-Lineage Databases (ULDBs) 2.Query Language TriQL 3.System Trio-One — built on top of standard DBMS

17 17 Remainder of Talk 1.Data Model Uncertainty-Lineage Databases (ULDBs) 2.Query Language TriQL 3.System Trio-One — built on top of standard DBMS 4. Demo

18 18 Quote from Jennifer We are not about machine learning or probabilistic reasoning! We are about efficient and convenient storage, manipulation, and retrieval of large data sets (with uncertainty and lineage in them)

19 19 Running Example: Crime-Solving Saw (witness, color, car) // may be uncertain Drives (person, color, car) // may be uncertain Suspects (person) = π person (Saw ⋈ Drives)

20 20 In Standard Relational DBMS Drives (person, color, car) JimmyredToyota BillyblueHonda FrankredMazda FrankgreenMazda Saw (witness, color, car) AmyredMazda BettyblueHonda CarolgreenToyota Suspects Frank Billy Create Table Suspects as Select person From Saw, Drives Where Saw.color = Drives.color And Saw.car = Drives.car Create Table Suspects as Select person From Saw, Drives Where Saw.color = Drives.color And Saw.car = Drives.car

21 21 Data Model: Uncertainty An uncertain database represents a set of possible instances Amy saw either a Honda or a Toyota Jimmy drives a Toyota, a Mazda, or both Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 Hank is a suspect with confidence 0.7

22 22 Their Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences

23 23 Our Model for Uncertainty 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Three possible instances

24 24 Six possible instances Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence 3. Confidences ? Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Bettyblue, Acura

25 25 Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty Six possible instances, each with a probability ? Saw (witness, color, car) Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2 Betty blue, Acura 0.6

26 26 Models for Uncertainty Our model (so far) is not especially new We spent some time exploring the space of models for uncertainty Tension between understandability and expressiveness – Our model is understandable – But it is not complete, or even closed under common operations

27 27 Our Model is Not Complete or Closed Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda Hank, Honda Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT

28 28 to the Rescue Lineage (provenance): “where data came from” Internal lineage External lineage In Trio: A function λ from data elements to other data elements (or external sources) Lineage

29 29 Example with Lineage ID Saw (witness, car) 11Cathy Honda ∥ Mazda ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda ID Suspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1); λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 Correctly captures possible instances in the result

30 30 Example with Lineage IDSaw (witness, car) 11Cathy Honda ∥ Mazda IDDrives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ?

31 31 Trio Data Model 1.Alternatives 2.‘?’ (Maybe) Annotations 3.Confidences 4.Lineage ULDBs are closed and complete Uncertainty-Lineage Databases (ULDBs)

32 32 Formal Definition of ULDB

33 33 Formal definition of Completeness

34 34 Proof of Completeness Proof: Construct R with x-relations S1,.., Sn, corresponding to R1,..,Rn Construct an extra relation PW that encodes the possible instances. PW contains exactly one x-tuple: (1)|(2)|...|(m). Each Si is constructed as follows, For every Pj, each tuple t in Ri forms a maybe x-tuple with just one alternative with value t. Duplicates within and across possible instances are preserved in Si. We add (j) in PW to the lineage of alternatives in tuples copied from Pj. This now exactly encodes the data in each of the possible instances.

35 35 Continuous, Proof … The correct lineage is obtained as follows, We look at the lineage j in Pj and mimic it in the x-tuples it contributes in S1 through Sn. For example, if j(t1) = {t2} in Pj, where t1 is a subset of R1 and t2 is a subset of R2, then the x-tuple that t2 gave in S2 is added to the lineage of the x-tuple from t1 in S1. As a final step, we remove the extra relation PW but retain its symbols as external lineage. Therefore, each possible LDB of D now has the same schema as each Pj, and represents exactly the same data and internal lineage.

36 36 ULDBs: Minimality A ULDB relation R represents a set of possible instances Does every tuple in R appear in some possible instance? (no extraneous tuples) Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s) Also Data-minimality Lineage-minimality

37 37 Data Minimality Examples Extraneous ‘?’... 10 Billy, Honda ∥ Frank, Mazda... 20 Billy ∥ Frank... ? λ (20,1)=(10,1); λ (20,2)=(10,2) extraneous

38 38 Data Minimality Examples Extraneous tuple Diane Mazda ∥ Acura Diane extraneous ? ?? DianeAcuraDianeMazda

39 39 Querying ULDBs Simple extension to SQL Formal semantics, intuitive meaning Query uncertainty, confidences, and lineage TriQL

40 40 Simple TriQL Example IDSaw (witness, car) 11Cathy Honda ∥ Mazda IDDrives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank ? ? ? λ (31)=(11,2),(21,2) λ (32,1)=(11,1),(22,1); λ (32,2)=(11,1),(22,2) λ (33)=(11,1),23 Create Table Suspects as Select person From Saw, Drives Where Saw.car = Drives.car Create Table Suspects as Select person From Saw, Drives Where Saw.car = Drives.car

41 41 Formal Semantics Relational (SQL) query Q on ULDB D D D D 1, D 2, …, D n possible instances Q on each instance representation of instances Q(D 1 ), Q(D 2 ), …, Q(D n ) D’ implementation of Q operational semantics D + Result

42 42 TriQL: Querying Confidences Built-in function: Conf() SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8

43 43 TriQL: Querying Lineage Built-in join predicate: Lineage() SELECT Suspects.person FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Saw.witness = ‘Amy’ X ==> Y shorthand for Lineage(X,Y)

44 44 Operational Semantics Over standard relational database: For each tuple in cross-product of X 1, X 2,..., X n 1.Evaluate the predicate 2.If true, project attr-list to create result tuple SELECT attr-list FROM X 1, X 2,..., X n WHERE predicate

45 45 Operational Semantics Over ULDB: For each tuple in cross-product of X 1, X 2,..., X n 1.Create “super tuple” T from all combinations of alternatives 2.Evaluate predicate on each alternative in T ; keep only the true ones 3.Project attr-list on each alternative to create result tuple 4.Details: ‘?’, lineage, confidences SELECT attr-list FROM X 1, X 2,..., X n WHERE predicate

46 46 Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jim ∥ Bill Mazda HankHonda

47 47 Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jim ∥ Bill Mazda HankHonda

48 48 Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jim ∥ Bill Mazda HankHonda

49 49 Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jim ∥ Bill Mazda HankHonda

50 50 Operational Semantics: Example (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda) Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jim ∥ Bill Mazda HankHonda

51 51 Operational Semantics: Example (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda) Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jim ∥ Bill Mazda HankHonda

52 52 Operational Semantics: Example (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda) SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda) Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jim ∥ Bill Mazda HankHonda

53 53 Additional TriQL Constructs “Horizontal subqueries” Refer to tuple alternatives as a relation Aggregations: low, high, expected Unmerged (horizontal duplicates) Flatten, GroupAlts NoLineage, NoConf, NoMaybe Query-computed confidences Data modification statements

54 54 Trio-Specific Additional Features Lineage tracing On-demand confidence computation Coexistence checks Extraneous data removal  Interrelated algorithms

55 55 The Trio System Version 1 (“Trio-One”) On top of standard DBMS Surprisingly easy and complete, reasonably efficient

56 56 Trio-One Overview Standard relational DBMS Trio API and translator (Python) Trio API and translator (Python) Command-line client Command-line client Trio Metadata TrioExplorer (GUI client) TrioExplorer (GUI client) Trio Stored Procedures Encoded Data Tables Lineage Tables Standard SQL Partition and “verticalize” Shared IDs for alternatives Columns for confidence,“?” One per result table Uses unique IDs Encodes formulas Table types Schema-level lineage structure Conf() Lineage() DDL commands TriQL queries Schema browsing Table browsing Explore lineage On-demand confidence computation

57 57 Strengths First DBMS with uncertainty and lineage Has many applications like I showed earlier Done by Stanford!

58 58 Weaknesses Paper has lots of definitions rather than explanations. Proofs are written in whole text rather multiple lines with math symbols.

59 59 Future Work More forms of uncertainty Continuous uncertainty (intervals, Gaussians) Correlated uncertainty Incomplete relations More forms of lineage External lineage Update lineage Confidence-based queries Threshold; “Top-K”

60 60 Future Work Conjunctive lineage sufficient for most operations Disjunctive lineage for duplicate-elimination Negative lineage for difference General case after several queries: Boolean formula

61 Search “stanford trio”


Download ppt "Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University."

Similar presentations


Ads by Google