Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”

Similar presentations


Presentation on theme: "Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”"— Presentation transcript:

1 Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio

2 2 People Current Jennifer Widom (faculty) Omar Benjelloun (post-doc) Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) Michi Mutsuzaki (MS) Tomoe Sugihara (visitor) Incoming Martin Theobald (post-doc) Raghu Murthy (MS) Ander de Keijzer (visitor) Alums Alon Halevy, Ashok Chandra (visitors) Chris Hayworth (MS)

3 3 Why Uncertainty + Lineage? Many applications seem to need both From a technical standpoint, it turns out that lineage... 1.Enables simple and consistent representation of uncertain data 2.Correlates uncertainty in query results with uncertainty in the input data 3.Can make computation over uncertain data more efficient

4 4 Trio Components 1.Data Model ULDBs (Uncertainty-Lineage Databases): Simple extension to relational model 2.Query Language TriQL: Simple extension to SQL, well-defined semantics and intuitive behavior 3.System Version 1: Complete system and GUI built on top of conventional DBMS

5 5 Running Example: Crime-Solving Saw(witness,car) // may be uncertain Drives(person,car) // may be uncertain Suspects(person) = π person (Saw ⋈ Drives)

6 6 Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences

7 7 Our Model for Uncertainty 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) witnesscar Amy{ Honda, Toyota, Mazda } = Three possible instances

8 8 Six possible instances Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) (Betty, Acura) ?

9 9 Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty Saw (witness,car) (Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2 (Betty, Acura): 0.6 ? Six possible instances, each with a probability

10 10 Models for Uncertainty Our model (so far) is not especially new We spent some time exploring the space of models for uncertainty [ICDE 06, journal] Tension between understandability and expressiveness – Our model is understandable – But it is not complete, or even closed under common operations

11 11 Our Model is Not Closed Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jimmy, Toyota) ∥ (Jimmy, Mazda) (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT

12 12 Lineage to the Rescue Lineage Captures “where data came from” In Trio: A function λ from alternatives to other alternatives (or external sources)

13 13 Example with Lineage IDSaw (witness,car) 11 (Cathy, Honda) ∥ (Cathy, Mazda) IDDrives (person,car) 21 (Jimmy, Toyota) ∥ (Jimmy, Mazda) 22 (Billy, Honda) ∥ (Frank, Honda) 23(Hank, Honda) IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank ? ? ? Suspects = π person (Saw ⋈ Drives) λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1); λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 Correctly captures possible instances in the result

14 14 Uncertainty-Lineage Databases (ULDBs) 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences 4. Lineage ULDBs are closed and complete [VLDB 06]

15 15 ULDBs: Lineage Conjunctive lineage sufficient for most operations Duplicate-elimination: Disjunctive lineage Difference: Negative lineage General case after multiple operations/queries: Boolean formula

16 16 ULDBs: Interesting Questions Data-minimality: extraneous alternatives, extraneous “?” Lineage-minimality: harder Membership: tuple and table, some-instance and all-instances Coexistence: multiple tuples Extraction: remove tables, retain possible- instances

17 17 Example: Extraneous Data (Diane, Mazda) ∥ (Diane, Acura) Diane extraneous (Diane, Mazda)(Diane, Acura) ? ??

18 18 Example: Coexistence Mazda Acura (Diane, Mazda) ∥ (Diane, Acura) (Diane, Mazda)(Diane, Acura) ? ?? ? Can’t coexist

19 19 Querying ULDBs: Semantics Query Q on ULDB D D D D 1, D 2, …, D n possible instances Q on each instance representation of instances Q(D 1 ), Q(D 2 ), …, Q(D n ) D’ implementation of Q operational semantics D + Result

20 20 Querying ULDBs: TriQL Basic TriQL: SQL with new semantics Obeys commutative diagram for uncertain data Tracks lineage Query results: new table or on-the-fly Implemented TriQL: also built-in predicates conf(), lineage(), lineage*()

21 21 Additional TriQL Constructs [Language manual on web site] “Horizontal subqueries” Refer to tuple alternatives as a relation Unmerged (horizontal duplicates) Flatten, GroupAlts NoLineage, NoConf, NoMaybe Query-specified confidences [done] Data modification statements

22 22 Confidence Computation Confidences computed on-demand based on lineage —Confidence of alternative A is function of confidences in λ * (A) —Permits any query plan for data computation Default probabilistic interpretation, but queries can override SELECT person, min(conf(Saw),conf(Drives)) as conf FROM Saw, Drives WHERE Saw.car = Drives.car

23 23 Trio System: Version 1 Standard relational DBMS Trio API and translator (Python) Trio API and translator (Python) Command-line client Command-line client Trio Metadata TrioExplorer (GUI client) TrioExplorer (GUI client) Trio Stored Procedures Encoded Data Tables Lineage Tables Standard SQL “Verticalize” Shared IDs for alternatives Columns for confidence,“?” One per result table Uses unique IDs Table types Schema-level lineage structure conf() lineage() “==>” lineage*() “==>>” DDL commands TriQL queries Schema browsing Table browsing Explore lineage On-demand confidence computation

24 24 Current & Future Topics Algorithms: confidence computation, coexistence extraneous data Minimize lineage traversal Memoization Batch operations System Full query language More internal processing ? – Storage and indexing – Statistics and query optimization

25 25 Current & Future Topics Top-K by confidence Extend basic uncertainty model —Incomplete relations —Continuous uncertainty —Correlated uncertainty ? External lineage, update lineage, versioning


Download ppt "Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”"

Similar presentations


Ads by Google