Well-designed XML Data

Slides:



Advertisements
Similar presentations
A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.
Advertisements

Shantanu Narang.  Background  Why and What of Normalization  Quick Overview of Lower Normal Forms  Higher Order Normal Forms.
Announcements Read 6.1 – 6.3 for Wednesday Project Step 3, due now Homework 5, due Friday 10/22 Project Step 4, due Monday Research paper –List of sources.
Schema Refinement and Normal Forms Given a design, how do we know it is good or not? What is the best design? Can a bad design be transformed into a good.
1 Design Theory. 2 Minimal Sets of Dependancies A set of dependencies is minimal if: 1.Every right side is a single attribute 2.For no X  A in F and.
Chapter 3 Notes. 3.1 Functional Dependencies A functional dependency is a statement that – two tuples of a relation that agree on some particular set.
Relational Normalization Theory. Limitations of E-R Designs Provides a set of guidelines, does not result in a unique database schema Does not provide.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases.
Chapter 7: Relational Database Design. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Chapter 7: Relational Database Design First Normal.
An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto.
1 Database Design Theory Which tables to have in a database Normalization.
Normal Form Design addendum by C. Zaniolo. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Normal Form Design Compute the canonical cover.
1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.
Cs3431 Normalization. cs3431 Why Normalization? To remove potential redundancy in design Redundancy causes several anomalies: insert, delete and update.
1 Functional Dependency and Normalization Informal design guidelines for relation schemas. Functional dependencies. Normal forms. Normalization.
Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto.
1 Multi-valued Dependencies. 2 Multivalued Dependencies There are database schemas in BCNF that do not seem to be sufficiently normalized. Consider a.
Schema Refinement and Normalization Nobody realizes that some people expend tremendous energy merely to be normal. Albert Camus.
Cs3431 Normalization Part II. cs3431 Attribute Closure : Example Consider R (A, B, C, D, E) with FDs A  B, B  C, CD  E Does A  E hold ? (Is A  E.
1 Triggers: Correction. 2 Mutating Tables (Explanation) The problems with mutating tables are mainly with FOR EACH ROW triggers STATEMENT triggers can.
Functional Dependencies and Relational Schema Design.
©Silberschatz, Korth and Sudarshan7.1Database System Concepts Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design.
Chapter 10 Functional Dependencies and Normalization for Relational Databases.
CS 405G: Introduction to Database Systems 16. Functional Dependency.
Functional Dependencies and Normalization 1 Instructor: Mohamed Eltabakh
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Normal Forms1. 2 The Problems of Redundancy Redundancy is at the root of several problems associated with relational schemas: Wastes storage Causes problems.
Schema Refinement and Normalization. Functional Dependencies (Review) A functional dependency X  Y holds over relation schema R if, for every allowable.
Database Normalization Revisited: An information-theoretic approach Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi.
Further Normalization II: Higher Normal Forms Prof. Yin-Fu Huang CSIE, NYUST Chapter 13.
CS143 Review: Normalization Theory Q: Is it a good table design? We can start with an ER diagram or with a large relation that contain a sample of the.
Lecture 09: Functional Dependencies. Outline Functional dependencies (3.4) Rules about FDs (3.5) Design of a Relational schema (3.6)
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
Functional Dependencies and Normalization 1 Instructor: Mohamed Eltabakh
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
1 Lecture 6: Schema refinement: Functional dependencies
Revisit FDs & BCNF Normalization 1 Instructor: Mohamed Eltabakh
Functional Dependencies. FarkasCSCE 5202 Reading and Exercises Database Systems- The Complete Book: Chapter 3.1, 3.2, 3.3., 3.4 Following lecture slides.
Christoph F. Eick: Functional Dependencies, BCNF, and Normalization 1 Functional Dependencies, BCNF and Normalization.
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
1 Lecture 7: Normal Forms, Relational Algebra Monday, 10/15/2001.
© D. Wong Ch. 3 (continued)  Database design problems  Functional Dependency  Keys of relations  Decompositions based on Functional Dependency.
CS 157B Database Systems Dr. T Y Lin. Updates 1.Red color denotes updated data (ppt) 2.Class participation will be part of “extra” credits to to “quiz.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2009.
CS 338Database Design and Normal Forms9-1 Database Design and Normal Forms Lecture Topics Measuring the quality of a schema Schema design with normalization.
Ch 7: Normalization-Part 1
CS411 Database Systems Kazuhiro Minami 04: Relational Schema Design.
Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 19.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
11/06/97J-1 Principles of Relational Design Chapter 12.
1 CS 430 Database Theory Winter 2005 Lecture 8: Functional Dependencies Second, Third, and Boyce-Codd Normal Forms.
Objectives of Normalization  To create a formal framework for analyzing relation schemas based on their keys and on the functional dependencies among.
Normalization and FUNctional Dependencies. Redundancy: root of several problems with relational schemas: –redundant storage, insert/delete/update anomalies.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Chapter 14 Functional Dependencies and Normalization Informal Design Guidelines for Relational Databases –Semantics of the Relation Attributes –Redundant.
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
Chapter 8 Relational Database Design Topic 1: Normalization Chuan Li 1 © Pearson Education Limited 1995, 2005.
Lecture 11: Functional Dependencies
3.1 Functional Dependencies
A Normal Form for XML Documents
Lecture 6: Design Theory
A Normal Form for XML Documents
Functional Dependencies and Normalization
Lecture 8: Database Design
Normalization cs3431.
Instructor: Mohamed Eltabakh
XML Constraints Constraints are a fundamental part of the semantics of the data; XML may not come with a DTD/type – thus constraints are often the only.
Chapter 3: Multivalued Dependencies
Lecture 09: Functional Dependencies
Presentation transcript:

Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto

Outline Part 1 - Database Normalization from the 1970s and 1980s. Part 2 - Classical theory revisited: normalizing XML documents. Part 3 - Classical theory re-done: new justifications for normalization. 2

Part 1: Classical Normalization Design: decide how to represent the information in a particular data model. Even for simple application domains there is a large number of ways of representing the data of interest. We have to design the schema of the database. Set of relations. Set of attributes for each relation. Set of data dependencies. 3

Designing a Database: An Example Attributes: number, title, section, room. Data dependency: every course number is associated with only one title. Relational Schema: BAD alternative: R(number, title, section, room), number  title GOOD alternative: S(number, title), number  title T(number, section, room),  4

Problems with BAD: Update Anomaly number title section room CSC258 Computer Organization 1 LP266 2 GB258 3 GB248 CSC434 Database Systems Title of CSC258 is changed to Computer Organization I. 5

Problems with BAD: Update Anomaly number title section room CSC258 Computer Organization 1 LP266 2 GB258 3 GB248 CSC434 Database Systems Title of CSC258 is changed to Computer Organization I. 5

Problems with BAD: Update Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 Database Systems Title of CSC258 is changed to Computer Organization I. The instance stores redundant information. 5

Computer Organization I Deletion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 Database Systems CSC434 is not given in this term. 6

Computer Organization I Deletion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 Database Systems CSC434 is not given in this term. 6

Computer Organization I Deletion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted. 6

Computer Organization I Insertion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) 7

Computer Organization I Insertion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) 7

Computer Organization I Insertion Anomaly number title section room CSC258 Computer Organization I 1 LP266 2 GB258 3 GB248 CSC336 Numerical Methods ? A new course is created: (CSC336, Numerical Methods) The instance stores attributes that are not directly related. 7

Avoiding Update Anomalies number title CSC258 Computer Organization CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 Title of CSC258 is changed to Computer Organization I. 8

Avoiding Update Anomalies number title CSC258 Computer Organization CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 Title of CSC258 is changed to Computer Organization I. 8

Avoiding Update Anomalies number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 CSC434 is not given in this term. Title of CSC258 is changed to Computer Organization I. The instance does not store redundant information. 8

Avoiding Update Anomalies number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 CSC434 CSC434 is not given in this term. 8

Avoiding Update Anomalies number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) CSC434 is not given in this term. The title of CSC434 is not removed from the instance. 8

Avoiding Update Anomalies number title CSC258 Computer Organization I CSC434 Database Systems number section room CSC258 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) 8

Avoiding Update Anomalies number title CSC258 Computer Organization I CSC434 Database Systems CSC336 Numerical Methods number section room CSC258 1 LP266 2 GB258 3 GB248 A new course is created: (CSC336, Numerical Methods) No information about sections has to be provided. Each relation stores attributes that are directly related. 8

Normalization Theory Main idea: a normal form defines a condition that a well designed database should satisfy. Normal form: syntactic condition on the database schema. Defined for a class of data dependencies. Main problems: How to test whether a database schema is in a particular normal form. How to transform a database schema into an equivalent one satisfying a particular normal form. 9

Normalization Theory Today Normalization theory for relational databases was developed in the 70s and 80s. Why do we need normalization theory today? New data models have emerged: XML. XML documents can contain redundant information. Redundant information in XML documents: Can be discovered if the user provides semantic information. Can be eliminated. 10

Part 2: XML and Normalization XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11

Part 2: XML and Normalization XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11

Part 2: XML and Normalization XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11

Part 2: XML and Normalization XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11

Part 2: XML and Normalization XML Document: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> … </course> <course cno=“CSC434”> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

Part 2: XML and Normalization XML Document: DTD: <courses> <course cno=“CSC258”> <taken_by> <student sno=“st1”> <name> Fox </name> <grade> B+ </grade> </student> </taken_by> </course> </courses> courses  course* course @cno taken_by student* student @sno name, grade name #PCDATA grade 11

XML Databases XML Schema: (D, ) D :  : courses  course* course @cno taken_by student* student @sno name, grade Two students with the same @sno value must have the same name. 12

Redundancy in XML courses course course info @cno taken_by @cno @sno name “CSC258” “CSC434” “st1” “Fox” student student student As a first example, let’s see a university database. [Describe the Example] In this case, if we know that @sno is an identifier for students, we are storing redundant information. In this case, the student with number “st1” appears twice . . . @sno name grade @sno name grade “st1” “Fox” “B+” “st1” “Fox” “A+” 13

XML Database Normalization DTD: Data dependency: courses  course* course @cno taken_by student* student @sno name, grade Two students with the same @sno value must have the same name. 14

XML Database Normalization DTD: Data dependency: courses  course* course @cno taken_by student* student @sno grade , info* Two students with the same @sno value must have the same name. @sno is the identifier of info elements. info  @sno name 14

A “Non-relational” Example DBLP conf conf title issue issue “ICDT” article article @year article @year “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001” 15

XNF: XML Normal Form It eliminates two types of anomalies. It was defined for XML functional dependencies: DBLP.conf.@title  DBLP.conf DBLP.conf.issue  DBLP.conf.issue.article.@year 16

Problems to Address Functional dependencies for XML. Normal form for XML documents (XNF). Generalizes BCNF. Algorithm for normalizing XML documents. Implication problem for functional dependencies. 17

Framework: Paths in DTDs Paths(D): all paths in a DTD D courses.course courses.course.@cno courses.course.student.name courses.course.student.name.S We distinguish three kinds of elements: attributes (@), strings (S) and element types. FDs are defined by means of a relational representation of XML documents. 18

Framework: XML Trees v0 v1 . . . v2 v5 “cs100” v3 v4 v6 v7 “123” “456” courses v0 course course v1 . . . @cno student student v2 v5 “cs100” @sno grade @sno grade name name v3 v4 v6 v7 “123” “456” S S S S “Fox” “B+” “Smith” “A-” 19

Tree Tuples Relational representation: tree tuples - mappings t : Paths(D)  Vertices  Strings  {} A tree tuple represents an XML tree: courses t(courses) = v0 t(courses.course) = v1 t(courses.course.@cno) = “cs100” t(courses.course.student) = v2 t(p) = , for the remaining paths v0 course v1 @cno student v2 “cs100” 20

XML Tree: set of Tree Tuples courses courses v0 v0 course course course course v1 v1 . . . . . . @cno @cno student student student student v2 v2 v5 v5 “cs100” “cs100” @sno @sno grade grade @sno @sno grade grade name name name name v3 v3 v4 v4 v6 v6 v7 v7 “123” “123” “456” “456” S S S S S S S S “Fox” “Fox” “B+” “B+” “Smith” “Smith” “A-” “A-” 21

Functional Dependencies for XML Expressions of the form: X  Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). XML tree T can be tested for satisfaction of X  Y if: X  Y  Paths(T)  Paths(D) T  X  Y if for every pair u, v of tree tuples in T: u.X = v.X and u.X ≠  implies u.Y = v.Y 22

FD: Examples University DTD: courses  course* course  @cno, student* student  @sno, name, grade Two students with the same @sno value must have the same name: courses.course.student.@sno  courses.course.student.name.S Every student can have at most one grade in every course: { courses.course, courses.course.student.@sno }  courses.course.student.grade.S 23

Implication Problem for FD Given a DTD D and a set of functional dependencies   {}: (D, )   if for any XML tree T conforming to D and satisfying  , it is the case that T   (D, )+ = {  | (D, )   } Functional dependency  is trivial if it is implied by the DTD alone: (D, )   24

XNF: XML Normal Form XML specification: a DTD D and a set of functional dependencies . A Relational DB is in BCNF if for every non-trivial functional dependency X  Y in the specification, X is a key. (D, ) is in XNF if: For each non-trivial FD X  p.@l or X  p.S in (D, )+, X  p is in (D, )+. 25

Back to DBLP DBLP is not in XNF: DBLP.conf.issue  DBLP.conf.issue.article.@year  (D,)+ DBLP.conf.issue  DBLP.conf.issue.article  (D,)+ Proposed solution is in XNF. 26

Normalization Algorithm The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD of the form: DBLP.conf.issue  DBLP.conf.issue.article.@year then apply the “DBLP example rule”. Otherwise: choose a minimal anomalous FD and apply the “University example rule”. 27

Normalizing XML Documents Theorem The decomposition algorithm terminates and outputs a specification in XNF. Furthermore, it does not lose information: Unnormalized Normalized XML document XML Document Q1, Q2 are XQuery core queries. Q1 Q2 28

Part 3: What was Missing? Justification! What is a good database design? Well-known solutions: BCNF, 4NF, … But what is it that makes a database design good? Elimination of update anomalies. Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. Previous work was specific for the relational model. Classical problems have to be revisited in the XML context. 29

Justification of Normal Forms Problematic to evaluate XML normal forms. No XML update language has been standardized. No XML query language yet has the same “yardstick” status as relational algebra. We do not even know if implication of XML FDs is decidable! We need a different approach. It must be based on some intrinsic characteristics of the data. It must be applicable to new data models. It must be independent of query/update/constraint issues. Our approach is based on information theory. 30

Information Theory Entropy measures the amount of information provided by a certain event. Assume that an event can have n different outcomes with probabilities p1, …, pn. Amount of information gained by knowing that event i occurred : Average amount of information gained (entropy) : Entropy is maximal if each pi = 1/n : 31

Entropy and Redundancies Database schema: R(A,B,C), A  B Instance I: Pick a domain properly containing adom(I) : Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 Entropy: log 5 ≈ 2.322 A B C 1 2 3 4 A B C 1 3 2 4 A B C 1 2 4 A B C 1 2 3 4 A B C 1 2 3 4 Pick a domain properly containing adom(I) : {1, …, 6} Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 Entropy: log 1 = 0 {1, …, 6} 32

Entropy and Normal Forms Let  be a set of FDs over a schema S. Theorem (S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0). A similar result holds for 4NF and MVDs. This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ... 33

Problems with the Measure The measure cannot distinguish between different types of data dependencies. It cannot distinguish between different instances of the same schema: R(A,B,C), A  B A B C 1 2 3 4 A B C 1 2 3 4 5 entropy = 0 entropy = 0 34

A General Measure A B C 1 2 3 4 Instance I of schema R(A,B,C), A  B : 35

A General Measure A B C 1 2 3 4 Instance I of schema R(A,B,C), A  B : Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. 35

A General Measure A B C 1 2 3 4 Instance I of schema R(A,B,C), A  B : Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. 35

A General Measure A B C 1 3 2 4 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. Initial setting: pick a position p  Pos(I) and pick k such that adom(I)  {1, …, k}. For example, k = 7. 35

A General Measure A B C 1 3 2 4 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. 35

A General Measure A B C 3 1 2 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. 35

A General Measure A B C 3 1 2 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 35

A General Measure A B C 2 3 1 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 35

A General Measure A B C 1 2 3 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 35

A General Measure A B C 4 2 3 1 7 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 35

A General Measure A B C 1 2 3 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 48/ 35

A General Measure A B C 3 1 2 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) = 35

A General Measure A B C a 3 1 2 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) = 35

A General Measure A B C 2 a 3 1 7 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) = 35

A General Measure A B C 1 a 3 2 6 Instance I of schema R(A,B,C), A  B : Computation: for every X  Pos(I) – {p}, compute probability distribution P(a | X), a  {1, …, k}. P(2 | X) = 48/ (48 + 6  42) = 0.16 For a ≠ 2, P(a | X) = 42/ (48 + 6  42) = 0.14 Entropy ≈ 2.8057 (log 7 ≈ 2.8073) 35

A General Measure A B C 1 3 2 4 Instance I of schema R(A,B,C), A  B : Value : we consider the average over all sets X  Pos(I) – {p}. Average: 2.4558 < log 7 (maximal entropy) It corresponds to conditional entropy. It depends on the value of k ... 35

A General Measure Previous value: For each k, we consider the ratio: How close the given position p is to having the maximum possible information content. General measure: 36

Basic Properties The measure is well defined: Bounds: For every set of first­order constraints  defined over a schema S, every I  inst(S,), and every p  Pos(I): exists. Bounds: 37

Basic Properties The measure does not depend on a particular representation of constraints. If 1 and 2 are equivalent: It overcomes the limitations of the simple measure: R(A,B,C), A  B A B C 1 2 3 4 5 A B C 1 2 3 4 0.875 0.781 38

Well-Designed Databases Definition A database specification (S,) is well-designed if for every I  inst(S,) and every p  Pos(I), = 1. In other words, every position in every instance carries the maximum possible amount of information. We would like to test this definition in the relational world ... 39

Relational Databases  is a set of data dependencies over a schema S:  = : (S,) is well-designed.  is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF.  is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF.  is a set of FDs and JDs: If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true. A syntactic characterization of being well-designed is given in [AL03]. 40

Relational Databases The problem of verifying whether a relational schema is well-designed is undecidable. If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable. Now we would like to apply our definition in the XML world ... 41

XML Databases XML schema: (D,). D is a DTD.  is a set of data dependencies over D. We would like to evaluate XML normal forms. The notion of being well-designed extends from relations to XML. The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T). 42

Positions in an XML Tree DBLP conf conf title issue issue “ICDT” “ICDT” article article article author title @year author title @year title @year “Dong” “Dong” “. . .” “. . .” “1999” “1999” “Jarke” “Jarke” “. . .” “. . .” “1999” “1999” “. . .” “. . .” “2001” “2001” 43

Well-Designed XML Data We consider k such that adom(T)  {1, …,k}. For each k : We consider the ratio: General measure: 44

XNF: XML Normal Form For arbitrary XML data dependencies: Definition An XML specification (D,) is well-designed if for every T  inst(D,) and every p  Pos(T), = 1. For functional dependencies: Theorem An XML specification (D,) is in XNF if and only if (D,) is well-designed. 45

Normalization Algorithms The information-theoretic measure can also be used for reasoning about normalization algorithms. For BCNF and XNF decomposition algorithms: Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease. 46

Future Work We would like to consider more complex XML constraints and characterize good designs they give rise to. We would like to characterize 3NF by using the measure developed in this paper. In general, we would like to characterize “non-perfect” normal forms. We would like to develop better characterizations of normalization algorithms using our measure. Why is the “usual” BCNF decomposition algorithm good? Why does it always stop? 47