Download presentation
Presentation is loading. Please wait.
1
Managing XML and Semistructured Data
Lecture 10: Schema Prof. Dan Suciu Spring 2001
2
In this lecture Schema for unordered data Resources
conformance and classification using simulation upper bound and lower bound schema Resources MSL: A Model for W3C XML Schema by Brown, Fuchs, Robie, Wadler, in WWW10, 2001. Subsumption for XML Types by Kuper and Simeon, ICDT'2001. Data on the Web Abiteboul, Buneman, Suciu : chapter 7
3
Schemas for Relational v.s. SS Data
Schemas in relational data: Defined before the data Strictly enforced Schemas in SS data Defined after the data May not be enforced (some data doesn’t have schema)
4
When the Schema is Created
Created by the user Before or after the data Does not need to be unique Extracted from the data Unheard of in relational databases Extracted from the query Like type inference in programming languages Different schema formalisms for each task !
5
Schemas for Unordered Data
OEM Cycles Schema itself will be a graph
6
Schemas: An Example Some database: &r1 &c1 &c2 &s2 &s3 &s6 &s7 &s10
company name address url “Widget” “Trenton” “Gadget” “ “Paris” &p2 &p1 &p3 &s0 &s1 &s4 &s5 &s8 &s9 person “Smith” name position phone “Manager” “Jones” “ ” “Dupont” “Sales” employee manages c.e.o. works-for &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target”
7
Graph Schemas Root person company works-for managed-by Employee
c.e.o. | employee name | address | url name | phone | position string description Any * Upper Bound Schema
8
The Two Questions to Ask
Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?
9
Graph Simulation Definition Two edge-labeled graphs G1, G2
A simulation is a relation R between nodes: if (x1, x2) in R, and (x1,a,y1) in G1, then exists (x2,a,y2) in G2 (same label) s.t. (y1,y2) in R x1 x2 a R G1 G2 y1 a R y2
10
Using Simulation Data graph D, schema S
conformance: find maximal simulation R from D to S Notation: D S classification: check if (x,c) in R Notation: x c
11
Example Database Graph Schema &r1 person person company manages person
Root manages person person company works-for &p1 employee company c.e.o. &c1 &p2 &c2 c.e.o. &p3 works-for works-for works-for phone position managed-by name position address name name address name name Company Employee &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description c.e.o. | employee “Smith” “Manager” “Widget” “Trenton” “Jones” “ ” “Gadget” “Paris” “Dupont” “Sales” name | address | url description &s10 name | phone | position string description &a5 &a1 “ eval 1998 &a4 procurement salesrep 1997 Any task &a7 * &a2 &a3 &a6 contact “below target” “on target” Database Graph Schema
12
Formally Graph schema S is a graph, s.t.: Nodes are called classes
Edges are labeled with unary predicates, p(x) Examples: person = “x=person” name | address = “x=name x=address” * = “true” int = “x int” name = “x name”
13
Examples of Graph Schemas (What Do They Mean ?)
person person S1= description name name age age string int string int name age string int part S3= S1 = there are persons; each can have several names and several ages, of types string and int respectively S2 = same as S1, plus persons can have description edges, under which we can have any edge with any label, except name and age S3 = describes a hierarchy of parts and subparts, arbitrary deep. Leaf subparts (or parts) may have names and prices. S4 = describes ANY database S5 = persons may have name and age of types string, int, and under description there may be any structure. S4= subpart name price person description * S5= name string int age string int “Universal schema”, ST *
14
D = S = D S person person person person person name age name name
string int Smith 55 D S
15
Any database conforms to ST ! “Universal schema”
* Any database conforms to ST ! “Universal schema”
16
Schemas in SS Data v.s. Relational Data
Each data instance has exactly one schema Semistructured data One data instance has several schemas
17
The Classification Problem
D = person person person person name phone name name phone name John 1234 Mary string string string string Schema is nondeterministic: creates ambiguous classifications.
18
The Classification Problem
Definition A schema S is deterministic if for every class c and every label a, there is at most one outgoing edge labeled a from c Fact: if S is deterministic and D is a tree, then each node is uniquely classified (When D is not a tree, then this is not true.)
19
Deterministic Upper Bound Schemas
Given a schema S, we can always construct an deterministic approximation, Sd S= person person Sd= person name name phone phone name string string string string string string string In general, Sd obtained by powerset constrcut expensive
20
Lower Bound Schemas Introduced (under a different name and formalism) in: Nestorov, Abiteboul, Motwani, Extracting Schema from Semistructured Data, SIGMOD 98 Goal: extract some “schema” from the data We will see later why these are “lower bound schemas”
21
Lower Bound Schemas Schema = datalog program with special form
Classes = predicates Company(x) :- link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) :- link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v), Root(x) :- link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z) Maximal fixpoint, rather than minimal fixpoint ! (next)
22
Datalog Maximal Fixpoint
Standard datalog semantcis. Transform program to: Company(x) y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x) y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Compute minimal model. What is it ? the empty set
23
Datalog Minimal Fixpoint
Standard datalog semantcis. Transform program to: Company(x) y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x) y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Answer: the empty set ! Compute maximal model.
24
Lower-Bound Schemas Equivalent representation of schema: Root person
company works-for managed-by Employee Company c.e.o. address name name string
25
Simulation Strikes Back
person person company manages person company works-for Root person &p1 employee c.e.o. &c1 &p2 &c2 c.e.o. &p3 company works-for works-for name works-for position name phone address name address position name name managed-by &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description Company Employee c.e.o. “Smith” “Manager” “Widget” “Trenton” “Jones” “ ” “Gadget” “Paris” “Dupont” “Sales” description &s10 address name name &a5 string “ eval 1998 &a1 &a4 procurement salesrep 1997 &a7 &a3 task &a2 &a6 contact “below target” “on target” Lower Bound Database A model of program P is precisely a simulation (and vice versa)
26
Summary: Lower v.s. Upper Bound Schemas
Tells us what edges are allowed Lower bound schemas Tell us what edges are required
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.