Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model.

Similar presentations


Presentation on theme: "CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model."— Presentation transcript:

1 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model

2 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 312 The Web Today HTML documents generated by humans or by applications, consumed by humans only, easy access: across platforms, across organizations.  only layout, no semantic information Limited application interoperability HTML not understood by applications at most, some heuristic rules. Database technology SQL standard, but still lots of vendor specific aspects in implementations.

3 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 313 XML Data Exchange Format A standard from the W3C (World Wide Web Consortium, http://www.w3.org).http://www.w3.org The mission of the W3C „... developing common protocols that promote its evolution and ensure its interoperability...“. Basic ideas XML = data XML generated by applications XML consumed by applications Easy access: across platforms, organizations.

4 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 314 Paradigm Shift on the Web For web search engines: From documents (HTML) to data (XML) From document management to document understanding (e.g., question answering) From information retrieval to data management For database systems: From relational (structured) model to semistructured data From data processing to data /query translation From storage to transport

5 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 315 The Semistructured Data Model Developed by the DBS community to address the following, emerging issues Data sets with non-rigid structure Biological data sequence data, 3D data, text data... and their relationships Web data Integration of heterogeneous sources not only, but especially for Web data and biological data.

6 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 316 The Semistructured Data Model Data is self-describing, i.e. the data description is integrated with the data itself rather than in a separate schema. Database is a collection of nodes and arcs (directed graph). Leaf nodes represent data of some atomic type ( atomic objects, such as numbers or strings). Interior nodes represent complex objects consisting of components (child nodes), connected by arcs to this node. Arcs are directed and connect two nodes.

7 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 317 The Semistructured Data Model Arc labels indicates the relationship between the two corresponding nodes. The root node is the only interior node without in-arcs, representing the entire database. All database objects are children of the root node. Every node must be reachable from the root. A general graph structure is possible, i.e. the graph need not be a tree structure.

8 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 318 Graphical Representation &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “ Abiteboul ” 1997 “ Victor ” “Vianu” 122133 paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib complex object atomic object

9 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 319 Textual Representation Example: Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } Nested tuples, set-values, object identifiers (oids)

10 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 320 Textual Representation Simplified textual representation. Can omit oids. { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } }

11 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 321 Comparison with Relational Model Missing attributes Additional attributes Multiple attribute values (set-valued attributes) Objects as attribute values No global schema  only the first characteristics supported by relational model, all others are not

12 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 322 Comparison with Relational Model Semistructured data Self-describing, Irregular data, No a-priori structure. Relational DB Separate schema, Regular data, A-priori structure.

13 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 323 Comparison with Relational Model { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row name phone “John”3634“Sue”“Dick”63436363 Example

14 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 324 XML A W3C standard for an Extensible Markup Language. Origins: Structured text SGML (Standard Generalized Markup Language). Motivation HTML describes presentation only, XML describes content and its meaning (semantics). HTML is fix language, XML allows to define your own markup languages.

15 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 325 From HTML to XML  HTML describes the presentation / layout

16 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 326 From HTML to XML HTML example Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999

17 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 327 From HTML to XML XML example Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

18 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 328 Elements Tags book, title, author, … start tag:, end tag: defined by user / programmer (different from HTML!) Elements …, … An element consists of a matching start and end tag and the enclosed content. Elements can be nested, i.e. content of one element can consist of sequence of other elements.

19 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 329 Attributes Attributes can be associated with any element. Provide additional information about elements. Attributes can have only one value. Example Foundations of Databases Abiteboul … 1995 Attributes can also be used to connect elements.

20 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 330 Non-tree-like XML So far: only tree-like XML documents, i.e. each element is nested within at most one other element. Attributes can also be used to create non-tree XML documents. Attributes with a domain of ID serve as primary keys of elements. Attributes with a domain of IDREF serve as foreign keys referencing the ID of another element.

21 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 331 Non-tree-like XML Example of a non-tree structure Jane Mary John

22 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 332 Namespaces An XML document can involve tags that come for multiple sources. One and the same tag can appear in more than one source. Apples Bananas African Coffee Table 80 120

23 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 333 Namespaces Name conflicts can be resolved by prefixing tag names according to their source. Apples Bananas African Coffee Table 80 120 When using prefixes in XML, a namespace for the prefix must be defined. The namespace must be referenced (via an URI) in the start tag of an enclosing element.

24 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 334 Namespaces...... Or alternatively:......

25 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 335 Namespaces A URI is a Universal Resource Identifier, typically a URL. The document referenced by the URI describes the meaning of the tags in the namespace. This description is informal and is not used by the XML parser. The description can even be empty.

26 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 336 Well-Formed XML A well-formed XML document satisfies the following conditions: Begins with a declaration that it is XML. Has a single root element that encloses the whole document. Consists of properly nested elements, i.e. start and end tag of an element are within the same enclosing element. standalone =“yes” states that document has no DTD. In this mode, you can invent your own tags, like in semistructured data model.

27 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 337 Well-Formed XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 …... …

28 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 338 Well-Formed XML HTML browsers will display documents with errors (like missing end tags). The W3C XML specification states that a program should stop processing an XML document if it finds an error. The main reason is that XML is being consumed by programs rather than by humans (as HTML). W3C provides a validator that checks whether an XML document is well-formed.

29 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 339 Valid XML The validator can also check whether an XML document is valid, i.e. conforms to a Document Type Definition ( DTD ). A DTD specifies the allowable tags and how they can be nested. XML with a DTD is no longer semistructured (self- describing). However, a DTD is less rigid than the schema of a relational DB. E.g., a DTD allows missing and multiple attributes / elements.

30 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 340 Document Type Definitions Document Type Definition ( DTD ): set of rules ( grammar ) specifying elements, attributes and all other aspects of XML documents. For each element, specify name and content type. Content type can, e.g., be #PCDATA (character string), other elements, regular expression made of the above content types * = zero or more occurrences ? = zero or one occurrence + = one or more occurrences, = sequence of elements.

31 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 341 Document Type Definitions Specification of element type “ “>“ Specification of attributes “ “>“ Attribute type either #REQUIRED or #IMPLIED (optional).

32 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 342 Document Type Definitions ID: domain with unique values within the given document. IDREF: references one ID. IDREFS: references a list of IDs. Example...

33 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 343 Document Type Definitions Document type contains all corresponding element types: “ “[“ “]>“ Use of DTD by some document: reference DTD in document opening line STANDALONE = “no“. Example

34 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 344 Example DTD: Product Catalog <!DOCTYPE CATALOG [ <!ATTLIST PRODUCT NAME CDATA #IMPLIED CATEGORY (HandTool|Table|Shop-Professional) "HandTool" PARTNUM CDATA #IMPLIED PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago" INVENTORY (InStock|Backordered|Discontinued) "InStock"> <!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED POWER CDATA #IMPLIED> <!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte" ADAPTER (Included|Optional|NotApplicable) "Included" CASE (HardShell|Soft|NotApplicable) "HardShell"> <!ATTLIST PRICE MSRP CDATA #IMPLIED WHOLESALE CDATA #IMPLIED STREET CDATA #IMPLIED SHIPPING CDATA #IMPLIED> ]>

35 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 345 XML Schema The successor of DTDs to specify a schema for XML documents. A W3C standard. Includes and extends functionality of DTDs. In particular, XML Schemas support data types. This makes it easier to validate the correctness of data and to work with data from a database. XML Schemas are written in XML. You don't have to learn a new language and can use your XML parser to parse your Schema files.

36 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 346 Simple Elements Simple elements contain only text. They can have one of the built-in datatypes: xs:string, xs:decimal, xs:integer, xs:boolean xs:date, xs:time. Example

37 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 347 Simple Elements Restrictions allow you to further constrain the content of simple elements.

38 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 348 Attributes Attributes can be specified using the attribute element: Attribute elements are nested within the element of the element with which they are associated. By default, attributes are optional. To make an attribute mandatory, use Attributes can have the same built-in datatypes as simple elements.

39 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 349 Complex Elements Complex elements can contain other elements and can have attributes. Nested elements need to occur in the order specified. The number of repetitions of elements are controlled by the attributes minOccurs and maxOccurs. The default is one repetition. A complex element with an attribute:

40 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 350 Complex Elements A complex element containing a sequence of nested (simple) elements:

41 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 351 Complex Elements If you name the complex element, other elements can reference and include it:

42 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 352 XML Document With Schema An XML document that uses a schema has to reference the schema in the schemaLocation attribute of its root element : Tove Jani Reminder Don't forget me this weekend!

43 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 353 Example XML Schema …

44 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 354 XML vs. Semistructured Data Both described best by a graph. Both are schema-less, self-describing (XML without DTD / XML schema). XML is ordered, semistructured data is not. XML can mix text and elements: Making Java easier to type and easier to type Phil Wadler XML has lots of other stuff: attributes, entities, processing instructions, comments.

45 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 355 Summary Due to their variable and complex structure, Web documents cannot naturally be modeled using the relational model. The Semistructured Data Model is a self- describing data model providing sufficient flexibility for representing Web documents. One of the weaknesses of the Web is that (HTML) documents cannot be processed automatically. The purpose of XML is to provide a way of recording the semantics of Web documents and their components. For this sake, XML allows you to define your application-specific tags.

46 CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 356 Summary XML documents are lists of elements and attributes. Elements can be nested to form tree- like structures. Non-hierarchical structures are also possible. Document type definitions (DTDs) are similar to but less restrictive than DB schemas, specifying rules that corresponding XML documents have to satisfy. XML schemas are a more recent and more DB- like extension of DTDs.


Download ppt "CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model."

Similar presentations


Ads by Google