IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

Similar presentations


Presentation on theme: "IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model."— Presentation transcript:

1 IS432: Semi-Structured Data Dr. Azeddine Chikh

2 1. Semi Structured Data Object Exchange Model

3 Introduction From a database perspective : the Web has generated an enormous demand for recently developed database architectures for database integration such as data warehouses and mediation systems The Web has led to the development of semistructured data model with languages adapted to this model. 3

4 Introduction The emergence of XML as a standard for data representation on the Web is expected greatly to facilitate the publication of electronic data by providing a simple syntax for data that is both human and machine readable 4

5 Introduction Although the document and database viewpoints were, until quite recently, irreconcilable, there is now a convergence in technologies brought about by the development of XML for data on the Web and the closely related development of semistructured data in the database community 5

6 Unstructured Data 6 data can be of any type not necessarily following any format or sequence does not follow any rules is not predictable examples include text video sound images

7 Structured Data 7 data is organized in semantic chunks (entities) similar entities are grouped together (relations or classes) entities in the same group have the same descriptions (attributes) descriptions for all entities in a group (schema) have the same defined format have a predefined length are all present and follow the same order

8 Semi-Structured Data 8 idea predates XML but not HTML data is available electronically in database systems file systems, e.g., bibliographic data, Web data data exchange formats, e.g., EDI, scientific data attempt to reconcile database and document "worlds" semi-structured data organized in semantic entities similar entities are grouped together entities in same group may not have same attributes order of attributes not necessarily important not all attributes may be required size of same attributes in a group may differ type of same attributes in a group may differ

9 Example of Semi-Structured Data 9 name: Azeddine CHIKH email: az_chikh@ksu.edu.sa, az_chikh@hotmail.com name: first name: Mourad last name: Benchikh email: m_benchikh@ksu.edu.sa name: Ashraf Youcef affiliation: IS Department

10 Semi-Structured Data Models 10 based on labelled graphs rather than labelled trees used for data exchange among, and integration of, heterogeneous data sources schema information is in the edge labels sometimes called schemaless or self-describing data stored at the leaves

11 Graph Terminology (1) 11 a (directed) graph G = (N,E) consists of a set N of nodes and a set E of edges each edge in E is an (ordered) pair of nodes (x,y), where x is the source and y is the target a path from x1 to xn is a sequence of edges (x1, x2), (x2, x3),..., (xn-1, xn) the length of a path is the number of edges in it a node r is a root for graph G if there is a path from r to every other node in G a cycle is a path from a node to itself a graph with no cycles is called acyclic

12 Graph Terminology (2) 12 a graph is rooted if it has a single root a tree is a rooted graph G in which there is a unique path from the root to every other node in G a node is a leaf if it is not the source of any edge graphs can have node labels and/or edge labels in an edge-labelled graph G = (N,E,FE), FE is an edge labelling function that maps each edge to a label in a node-labelled graph G = (N,E,FN), FN is a node labelling function that maps each node to a label

13 Object Exchange Model (OEM) 13 original OEM used only node labels we use a variant in which the edges are labelled an OEM data graph is a rooted, labelled, directed graph its edge labels map to strings only its leaf nodes have labels which map to data values no ordering of edges leaving a node

14 OEM Syntax 14 example may be written as { book: { author: "Coetzee", title: "Disgrace", year: 1999} } simple label-value pairs labels can be repeated, e.g., for multiple authors this is a serialization syntax for the graph what about graphs that are not trees? introduce object identifiers (oids) for nodes

15 Example of OEM Data Graph (1) 15

16 Example of OEM Data Graph (2) 16

17 Example of OEM Syntax 17 bib: &1 { paper: &2 {... }, book: &3 {... }, paper: &4 { author: &10 { firstname: &15 "Serge", lastname: &16 "Abiteboul”}, author: &11 {... } title: &12 {... } pages: &13 { first: &17 122, last: &18 133 }, references: &2, references: &3 }

18 Characteristics of SSData 18 structure is irregular: missing or additional attributes (labels) parts of data lack structure, e.g., images some may yield little structure, e.g., plain text a-priori schema vs a-posteriori dataguide db: fix the schema, then populate the db web: design pages, then design schema to facilitate access schema is large schema is often ignored, e.g., information retrieval queries schema is rapidly evolving

19 Schema Graphs 19 given some semi-structured data, might want to extract a schema that describes it useful for browsing the data by types optimizing queries by reducing the number of paths searched improving storage of data schema graph specifies what edges are permitted in a data graph every path in the data graph occurs in the schema graph

20 Example of a Schema Graph 20

21 Data Graph Satisfying a Schema G. 21 given data graph D and schema graph S D is an instance of S (or D satisfies S) if there exists a simulation R from D to S such that (root(D), root(S)) is in R a simulation is a relation R between nodes: if (u,v) is in R and (u,x) labelled l is in D then there exists (v,y) labelled l in S such that (x,y) is in R for our example: node &1 in D related under R to node at target of edge labelled bib in S &2 and &4 related to node at target of edge labelled paper &3 related to node at target of edge labelled book note that above two cases need to satisfy requirements of edges labelled references as well &10 and &11 related to node at target of edge labelled author

22 A Less Specific Schema Graph 22

23 Data Guides 23 Data guide is a concise and accurate summary of a data graph accurate: every path in the data occurs in the data guide, and vice versa concise: every path in the data guide occurs exactly once data guide is the most specific schema graph for a given data graph i.e., there is a simulation from the data guide to every other schema graph the data graph satisfies

24 Example of a Data Guide (1) 24

25 Example of a Data Guide (2) 25

26 References 26 www.cis.upenn.edu/~db/tutorials.html Tutorial on semi-structured data by Peter Buneman from Symposium on Principles of Database Systems, 1997 www.cis.upenn.edu/~db/tutorials.html www-db.stanford.edu/lore/research/data.html Abiteboul S., Buneman P., Suciu D., «Data on the Web - From Relations to Semistructured Data and XML», Morgan Kaufmann Publishers, San Francisco, California


Download ppt "IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model."

Similar presentations


Ads by Google