Presentation is loading. Please wait.

Presentation is loading. Please wait.

Principles of Managing Uncertain Data

Similar presentations


Presentation on theme: "Principles of Managing Uncertain Data"— Presentation transcript:

1 Principles of Managing Uncertain Data
Course Introduction Faculty of Computer Science Technion – Israel Institute of Technology Winter

2 Assumed Background Databases Algorithms and complexity
Relational model, database querying, SQL, relational algebra, schema, integrity constraints (e.g., functional dependencies) Algorithms and complexity Asymptotic running time, polynomial time, NP, completeness, reduction Basic probability theory Probability spaces, random variables, conditional probability

3 Requirements Home assignments Mandatory attendance 5 x dry (20% each)
Contact me in advance if you are having a problem attending a specific lecture

4 Historical Perspective

5 Pre-Relational Databases
Cross-app solutions for data store/access proposed already in the 1960s Examples: The CODASYL committee standardized a network data model (Codasyl Data Model) A network of entities linked to each other, very similar to object-oriented models Integrated Data Stores (Charles Bachman) High-performance graph database from 1964 (!) IBM’s Information Management System (IMS) driven by the Apollo program Hierarchical data model, index and transaction support C. W. Bachman

6 Codd’s Vision (1) 1970: Codd invents the relational database model
Idea: interface via First-Order Logic! Data = collection of relations, interconnected via keys Relations conform to a schema Questions via a query language over the schema System translates queries into actual execution plans Principle: separate logical from physical layers Work done in IBM San Jose, now IBM Almaden [E. F. Codd: A Relational Model of Data for Large Shared Data Banks. In Communications of the ACM 13(6): (1970) ] Edgar F. Codd ( )

7 Codd’s Vision (2) : Codd introduced the relational algebra and the relational calculus Algebraic and logical QLs, respectively Proved their equal expressive power [E. F. Codd: Relational Completeness of Data Base Sublanguages. In: R. Rustin (ed.): Database Systems: 65-98] Edgar F. Codd ( )

8 Codd Catches On (1) 1973: Michael Stonebraker and Eugene Wong implement Codd’s vision in INGRES Commercialized in 1983 Evolved to Postgres (now PostgreSQL) in 1989 M. Stonebraker E. Wong

9 Codd Catches On (2) 1974: A group from the IBM San Jose lab implements Codd’s vision in System R, which evolved to DB2 in 1983 SQL initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce [Chamberlin, Boyce: SEQUEL: A Structured English Query Language. SIGMOD Workshop, Vol : ] 1977: Influenced by Codd, Larry Ellison founds Software Development Labs Becomes Relational Software in 1979 Becomes Oracle Systems Corp (1982), named after its Oracle database product R. F. Boyce ( ) D. D. Chamberlin J. Grey L. Ellison P. G. Selinger

10 Selected Database Research Topics*
System Design Distributed, storage, in-memory, recovery Database Security Further XML Query eval / optimize Compression Entity Resolution Views View-based access Incremental maintain Information Extraction from Web/text Query Languages Codasyl, SQL, recursion, nesting Database Privacy Crowdsourcing Utilizing crowd input in databases System Optimization Caching & replication Indexing Clustering Data Models Streaming data Graph data Schema Design ER models, normal forms, dependency Social Networks & Social Media DB Uncertainty Inconsistency & cleaning Probabilistic DB Benchmarking Transaction & concur. Data Models Semantic Web (RDF, ontologies) NoSQL (doc, graph, key-value) Heterogeneity Data Integration Interoperability DB Performance Query process & opt. Evaluation methods DB & IR DB for search Search for DB Analytics (OLAP) Data Models OO, geo, temporal Data Models Multimedia, DNA Text, XML DB & ML & AI Schema Matching & Discovery Provenance/ lineage Model / compute Logic Deductive (Datalog) Integrity/constraints Data Exchange Mining & Discovery Discovering association rules Cloud Databases Ranking & personalization Incompleteness (null) Column Stores 1980 1990 2000 * Based on SIGMOD session topics from DBLP

11 Publication Venues for DB Research
Conferences: General: SIGMOD: ACM Special Interest Group on Management of Data (since 1975) VLDB: Intl. Conf. on Very Large Databases (since 1975) ICDE: IEEE Intl. Conf. on Data Engineering (since 1984) EDBT: Intl. Conference on Extending Database Technology (since 1988) Theory oriented: PODS: ACM Symp. on Principles of Database Systems (since 1982) ICDT: Intl. Conference on Database Theory (since 1986) Journals: TODS: ACM Transactions on Database Systems (since 1976) VLDBJ: The VLDB Journal (since 1992) SIGMOD REC: ACM SIGMOD Record (since 1969)

12 Turing Awards for DB Technology
1973 1981 2014 1998

13 Some Modern Database Content
Knowledge Bases Business DBs Sensing Data Integration Signal / Image Processing OCR / Image Text Analytics / NLP Web Pages Social Media Financial Reports Gov Reports Med Reports

14 Knowledge Bases MPI YAGO Microsoft Probase Google Knowledge Graph
Attribute Value Instance Israel Concept Instance Concept country 0.4 location 0.35 Probability Person 0.2 Relationship Relationship MPI YAGO Microsoft Probase Google Knowledge Graph DBPedia Stanford DeepDive CMU NELL Freebase . . .

15 Relating to Big Data Volume Velocity Variety Veracity Value
Missing information Conflicting Information Probabilistic information

16 Uncertainty is Popular in DB Research
VLDB 2014 Ten Year Best Paper Nilesh Dalvi and Dan Suciu: Efficient Query Evaluation on Probabilistic Databases PODS 2014 Keynote Leonid Libkin: Incomplete data: what went wrong, & how to fix it SIGMOD/PODS 2014 Workshop on Big Uncertain Data Kimelfeld (DB) and Kersting (AI) ICDT 2013 Test-of-Time Award Ronald Fagin, Phokion Kolaitis, Renee Miller, and Lucian Popa: Data Exchange: Semantics and Query Answering

17 2016 Dagstuhl Perspective Workshop
Gathering of top database theorists, evaluating the field and planning ahead

18 In the Course Foundations of database complexity
Data/combined complexity, join acyclicity, hypertree width Principled, application-independent paradigms to managing uncertainty in data Incomplete / inconsistent / probabilistic databases Two key aspects for every paradigm: Representation & Semantics How do we represent what we know? what is missing? what is conflicting? what is our confidence? Query evaluation What is the meaning of query answering in the presence of uncertainty? What is the computational complexity?

19 Basic Database Concepts

20 Schema and Databases A database schema is finite set of relation names, each mapped into a relation schema Example: Student(sid,name,year) , Course(cid,topic) , Studies(sid,cid) A (database) instance over a schema consists of a relation for each relation schema Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

21 Integrity Constraints in Databases
A student cannot get two grades for the same course Grade must be > 53 (check constraint) Student Course Took ID name addr 1234 Avia Haifa 2345 Boris Nesher number name lecturer 363 DB Anna 319 PL Barak sID cNum grade 1234 363 95 2345 319 73 No two tuples have the same ID (key constraint) Courses with the same number have the same name (functional dependency) sID is a Student.ID; cNum is a Course.number (referential constraint)

22 What are Integrity Constraints?
Schema-level (data-independent) specifications on how records should behave beyond the relational structure (e.g., students with the same ID have the same name, take the same courses, etc.) DBMS guarantees that constraints are always satisfied, by disabling actions that cause violations What if we get data that violates the constraints to begin with?? Wait for “inconsistent databases”

23 Why Integrity Constraints?
Maintenance: consistency assured without custom code Development complexity: no reliance on consistency tests But exceptions need to be handled Optimization: operations may be optimized if we know that some constraints hold (e.g., once a sought student ID is found, you can stop; you won’t find it again)

24 Querying: Which Courses Avia Took?
ID name addr 1234 Avia Haifa 2345 Boris Nesher number name lecturer 363 DB Anna 319 PL Barak sID cNum grade 1234 363 95 319 82 2345 73 Assembly ... mov $1, %rax mov $1, %rdi         mov $message, %rsi       mov $13, %rdx           syscall mov $60, %rax xor %rdi, %rdi         QL SQL SELECT C.name FROM S,C,T WHERE S.name = ‘Avia’ AND S.ID = T.sID AND T.cNum = C.number Algebra (RA) πC.name(σS.name=‘Avia’, number=cNum, ID=sID(S⨉C⨉T))) Python Logic (RC) for s in S: for c in C: for t in T: if s.sName==‘Avia’ and s.ID==t.sID and t.cNum == c.number: print c.name {⟨x⟩|∃y,n,z,l,g [S(y,n,'Avia')∧C(z,x,l)∧T(y,z,g)]} Logic Programming (Datalog) Q(x)  S(y,n,‘Avia’),C(z,x,l),T(y,z,g)

25 Relational Algebra Primitive operators:
Projection () Selection () Renaming () Union (∪) Difference (\) Cartesian Product (×) Natural join (⨝) can be defined using  ×   Conjunctive Queries (CQ): ⨝   Unions of CQs (UCQs, “positive RA”): ⨝   ∪

26 RC Query { (x,u) | Person(u, 'female', 'Canada') ⋀
Person(id, gender, country) Parent(parent, child) Spouse(person1, person2) { (x,u) | Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)] ] } z Which relatives does this query find? y w u x

27 What is the Meaning of the Following?
Domain Independence What is the Meaning of the Following? { (x) | ¬Person(x, 'female', 'Canada') } { (x,y) |∃z [Spouse(x,z) ⋀ y=z] } { (x,y) |∃z [Spouse(x,z) ⋀ y≠z] } Person(id, gender, country) Parent(parent, child) Spouse(person1, person2)

28 Equivalence Between RA and D.I. RC
Theorem: RA and domain-independent RC have the same expressive power. More formally, on every schema S: For every RA expression E there is a domain- independent RC query Q such that Q≡E For every domain-independent RC query Q there is an RA expression E such that Q≡E

29 ⋀1⩽i<nR(Xi,Xi+1) ⋀ R(Xn,X1)
Complexity of Joins R Which join is more complicated? In what sense? ⋀1⩽i<nR(Xi,Xi+1) ⋀1⩽i<j⩽nR(Xi,Xj) X1 X2 X3 X4 X5 Acyclic / Path X1 X2 X5 X3 ⋀1⩽i<nR(Xi,Xi+1) ⋀ R(Xn,X1) X4 X1 X2 Bounded treewidth Clique X3 X5 X4

30 Incomplete Databases

31 Missing Information Problem: pieces of data missing, but we need to keep whatever partial knowledge we have Registrations student course Ahuva PL Courses course lecturer PL Eran A source tells us that Alon is a student of Keren How can we represent it in our DB? Registrations student course Ahuva PL Alon Courses course lecturer PL Eran Keren ⊥=NULL

32 SQL’s NULL NULL is SQL’s special “missing value”
Same queries as complete tables, but SQL assigns a special behavior to logic over NULL “Three-valued logic”: true, false, unknown Alas, there are some issues...

33 Of course, we've lost our initial association (join)...
Try It Yourself (psql) CREATE TABLE Registrations( student varchar(40), course varchar(40)); INSERT INTO Registrations VALUES ('Ahuva','PL'), ('Alon',NULL); CREATE TABLE Courses( course varchar(40), lecturer varchar(40)); INSERT INTO Courses VALUES ('PL','Eran'), (NULL,'Keren'); Registrations student course Ahuva PL Alon Courses course lecturer PL Eran Keren SELECT student, lecturer FROM Registrations R, Courses C WHERE R.course = C.course; student lecturer Ahuva Eran Of course, we've lost our initial association (join)...

34 Try More Yourself (psql)
Registrations student course Ahuva PL Alon Courses course lecturer PL Eran Keren SELECT student FROM Registrations; SELECT student FROM Registrations WHERE course='PL'; SELECT student FROM Registrations WHERE course!='PL'; student Ahuva Alon student Ahuva student SELECT student FROM Registrations WHERE course='PL' OR course!='PL'; Inconsistent logic... real problem! student Ahuva Alon??

35 Labeled Nulls in “Naive” Tables
Just like nulls, but each null has a name We do not know what the value is, but we do know that two nulls with the same name are the same Registrations student course Ahuva PL Alon ⊥1 ⊥2 Courses course lecturer PL Eran ⊥1 Keren ⊥2 Shaul student course lecturer Ahuva PL Eran Alon ⊥1 Keren ⊥2 Shaul ? =

36 . . . Possible Worlds . . . Registrations student course Ahuva PL Alon
DB Registrations student course Ahuva PL Alon DB Anna AI Closed-World Assumption: Open-World Assumption: Registrations student course Ahuva PL Alon ⊥1 ⊥2 Registrations student course Ahuva PL Alon ⊥1 ⊥2 Registrations student course Ahuva PL Alon DB Registrations student course Ahuva PL Alon DB AI Avi ML . . . . . .

37 Semantics of Query Answering
Incomplete DB Q ? . . . Possible Worlds

38 Semantics of Query Answering
Incomplete DB a2 Q a3 Q Q a4 Q ? . . . Possible Worlds

39 Semantics of Query Answering
Incomplete DB a2 Q a3 Q Q a4 Q ∩ai {a1,a2,…} . . . Certain answers (“weak) Represent as an incomplete relation (“strong”) Possible Worlds

40 Application: Data Exchange
Mapping ... Global Schema Messages Users Associations

41 The Clio Project IBM + U. Toronto – tool for data exchange
Commercialized in IBM DB2

42 Formalism [Fagin et al. 05]
A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Registrations student course Courses course lecturer Σ StudLecturer(x,y)  ∃z Registrations(x,z) ⋀ Courses(z,y) StudLecturer student course Ahuva Shaul Alon Keren ?? We don’t have z! So 2 options: Abort Do our best to max usability source instance

43 Formalism [Fagin et al. 05]
A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Registrations student course Courses course lecturer Σ StudLecturer(x,y)  ∃z Registrations(x,z) ⋀ Courses(z,y) StudLecturer student course Ahuva Shaul Alon Keren Registrations student course Ahuva ⊥1 Alon ⊥2 Courses course lecturer ⊥1 Shaul ⊥2 Keren source instance solution

44 Problems Studied in Data Exchange
Materialization Many solutions exist; what makes one solution “better” than another? If there a “best” solution? How to find it? Target query answering Given a source instance and a query over the target, evaluate the query (semantics / complexity) Manipulating schema mappings Composition and inversion of mappings

45 Inconsistent Databases

46 Inconsistency An inconsistent database contains inconsistent (or impossible) information Two students have the same ID A student gets credit for the same course twice A student takes a course that is not listed in the course database A student has a grade for this course but a grade is missing for an assignment Modeling: (D,Σ) where D is a database and Σ is a set of required logical integrity constraints over DBs; alas, D violates Σ

47 Query Answering ? Ahuva Alon Functional Dependency:
Grades student course grade Ahuva PL 90 Alon 86 81 Courses course lecturer PL Eran DC Keren Database D Functional Dependency: student, course  grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 85 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon ?

48 Query Answering X Ahuva Alon Functional Dependency:
Grades student course grade Ahuva PL 90 Alon 86 81 Courses course lecturer PL Eran DC Keren Database D Functional Dependency: student, course  grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 87 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon X

49 Query Answering Ahuva Alon Functional Dependency:
Grades student course grade Ahuva PL 90 Alon 86 81 Courses course lecturer PL Eran DC Keren Database D Functional Dependency: student, course  grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 80 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon

50 Inconsistent database D
Minimal Repairs [Arenas, Bertossi, Chomicki 99]: Definition: Let (D,Σ) be an inconsistent DB. A repair is a DB D', such that: DB D' is consistent (with respect to Σ) DB D' differs from D in a “minimal way” Grades student course grade Ahuva PL 90 Alon 86 Grades student course grade Ahuva PL 90 Alon 86 81 Repair D'1 Grades student course grade Ahuva PL 90 Alon 81 Inconsistent database D Repair D'2

51 Semantics of Query Answering
Inconsistent DB Q ? . . . Repairs (consistent DBs)

52 Semantics of Query Answering
Inconsistent DB a2 Q a3 Q Q a4 Q ? . . . an Q Repairs (consistent DBs)

53 Semantics of Query Answering
Inconsistent DB a2 Q a3 Q Q a4 Q ∩ai . . . an Q Consistent Answers Repairs (consistent DBs)

54 Algorithms / Complexity
Koutris & Wijsen [2015]: For consistent query answering with key constraints, we know how Select-Project-Join (SPJ) w/o self joins are classified into 3 categories: 1. 2. 3. Inconsistent DB Inconsistent DB coNP-complete (exptime under standard complexity assumptions) ignore inconsistency ∩ai Q Q' ∩ai Alg Rewriting Graph algorithm

55 Incorporating Preferences
Functional dependencies: course  lecturer lecturer  course Courses course lecturer DB Keren DC Eran What if we trust some tuples more than others? Staworko, Chomicki, Marcinkowski: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2-3): (2012)

56 Probabilistic Databases

57 How to accommodate the probabilistic nature of data at the database & query level?
Student University Ahuva Technion Alon HaifaU Employee Employer Role Ahuva Intel Eng PM VP Alon Yahoo! Google Find the students that are employed as engineers How many students work at Intel? Is any PM a Technion student?

58 How to accommodate the probabilistic nature of data at the database & query level?
Student University Ahuva Technion Alon HaifaU Pr 1.0 0.7 0.3 Employee Employer Role Ahuva Intel Eng PM VP Alon Yahoo! Google Pr 0.7 0.2 0.1 0.4 Find the students that are employed as engineers Ahuva (0.7), Alon (0.8) How many students work at Intel? Expectation = Is any PM a Technion student? Yes w/ prob 1-((1-0.2)*(1-0.7*0.1))

59 Probabilistic Database
Semantics Probabilistic Database p1 p2 p3 p4 . . . pn Space of ordinary DBs

60 Semantics of Query Answering
Probabilistic Database p1 p2 Q ? p3 p4 . . . pn Space of ordinary DBs

61 Semantics of Query Answering
Probabilistic Database p1 p1 a2 Q p2 p2 a3 Q Q ? p3 p3 a4 Q p4 p4 . . . an Q pn pn Space of ordinary DBs

62 Semantics of Query Answering
Probabilistic Database p1 p1 a2 Q p2 p2 a3 Q Q A p3 p3 a4 Q p4 p4 Rep of the probability space . . . an Q pn Mapping tuple  marginal probability pn Space of ordinary DBs

63 Algorithms for Query Answering
Dalvi & Suciu dichotomy: SPJ queries can be fully classified into: Queries that can be solved in polynomial time By repeated decomposition into simpler queries Queries for which answering is #P-hard Hence, cannot be computed in polynomial time under standard complexity assumptions Heuristic via BDDs [Olteanu+] Guaranteed approximation via sampling Additive approx. p±𝜀 is simple Multiplicative approx. (1±𝜀)p requires more work

64 Probabilistic XML [Abiteboul, Kimelfeld, Sagiv, Senellart]:
Representation systems and XPath evaluation

65 Planned Schedule

66 Database Query Languages 3 13/11 Querying Complexity 4 20/11
30/10 Intro 2 06/11 Database Query Languages 3 13/11 Querying Complexity 4 20/11 Acyclic Joins Assignment 1 Due 24/11 5 27/11 Complements 6 04/12 Incomplete Data Assignment 2 Due 08/12 7 11/12 Data Exchange 8 18/12 Assignment 3 Due 22/12 No Lecture (Hanukkah) 9 01/01 Inconsistent Data 10 08/01 Consistent Query Answering Assignment 4 Due 12/01 11 15/01 Probabilistic Databases 12 22/01 Inference on Probabilistic Databases Assignment 5 Due 25/01 Semester Ends 26/01


Download ppt "Principles of Managing Uncertain Data"

Similar presentations


Ads by Google