Presentation is loading. Please wait.

Presentation is loading. Please wait.

0360-569 Semantic Web (Winter 2007) A Report on DTD vs XML Schema: A Practical Study By – Bex, G. J., Neven, F., Bussche, J. V. Presented By:Quazi Rahman.

Similar presentations


Presentation on theme: "0360-569 Semantic Web (Winter 2007) A Report on DTD vs XML Schema: A Practical Study By – Bex, G. J., Neven, F., Bussche, J. V. Presented By:Quazi Rahman."— Presentation transcript:

1 0360-569 Semantic Web (Winter 2007) A Report on DTD vs XML Schema: A Practical Study By – Bex, G. J., Neven, F., Bussche, J. V. Presented By:Quazi Rahman Titas Mutsuddi

2 60-5692 Outline 1.Introduction 2.Structural View of DTDs and XSDs 3.Dataset 4.Expressiveness of XSDs 5.Additional Features 6.Regular Expression Characterization 7.Schema and Ambiguity 8.Errors 9.Conclusion 10.Reference

3 60-5693 1. Introduction  DTD and XSD are two widely used schemas to describe the contents in an XML documents.  Although DTDs and XSDs differs syntactically, they are quite related on an abstract level.  In this paper the authors present a comparative study of both DTDs and XSDs. They have tried to answer two questions: Which of the extra features or expressiveness of XML schema are effectively used in practice that are not allowed in DTDs, and How sophisticated are the structural properties (nature of regular expression) of the two formalisms.

4 60-5694 1. Introduction (cont’d) Definition of DTD and XSD  Both Document Type Definitions (DTDs) and XML Schema Definitions (XSDs) states what tags and attributes are used to describe the elements in an XML document, where each tag is allowed, and which tags can appear within other tags, etc.  Applications use a document's DTDs or XSDs to properly read and display a document's contents.  Changes in the format of the document can be easily made by modifying the DTDs or the XSDs of the document.

5 60-5695 1. Introduction (cont’d) Merits and Demerits of DTD and XSD  Shortcomings of DTDs No support for namespaces Limited support for data types Limited support for cardinality  Shortcomings of XSDs It is more complex than DTDs There are complains about the performance issue.  Merits of XSDs XSDs are extensible to future additions  Reuse Schema in other Schemas  Create new data types derived from the standard types  Reference multiple schemas in the same document XSDs are richer and more powerful than DTDs

6 60-5696 1. Introduction (cont’d) Merits of XSDs XSDs are written in XML  Don't have to learn a new language  Can use XML editor to edit Schema files  Can use XML parser to parse Schema files  Can transform Schema with XSLT XSDs support data types. It is easier to:  Describe allowable document content  Validate the correctness of data  Work with data from a database  Define data facets (restrictions on data)  Define data patterns (data formats)  Convert data between different data types XSDs support namespaces

7 60-5697 2. Structural View of DTD and XSD  An XML document may be viewed as a finite ordered tree structure. An Example: Amelie 17 Good bye, Lenin 20 20%

8 60-5698 2. Structural View of DTD and XSD (cont’d)  Corresponding Tree structure: store dvd dvd titlepricetitlepricediscount “Amelie”“17” “Good bye, Lenin” “20” “20%”

9 60-5699 2. Structural View of DTD and XSD (cont’d)  DTD to describe the previous document  For the tree above let us consider every node label is a member of some finite alphabet .  Definition 1. A DTD is a pair (d, s) where d is a function that maps -symbols to regular expression over , and s   is the start symbol. A tree satisfies the DTD if its root is labeled by s and for every node u with label a, the sequence a 1 …a n of labels of its children matches the regular expression d(a).

10 60-56910 2. Structural View of DTD and XSD (cont’d)  We can abstract the DTD by the set of rules of the form a r, where a is an element and r is a regular expression over the alphabets of elements. Such as store dvd+ dvd title price discount?  Definition 2. A specialized DTD (SDTD) is a 4-tuple (, ’, , ), where ’ is an alphabet of types,  is a DTD over ’ and  is a mapping from ’ to . Note that  can be applied to a ’-tree as a re-labeling of the nodes, thus yielding a -tree. A -tree t then satisfies the SDTD if t can be written as (t’), where t’ satisfies the DTD .

11 60-56911  A simple example of a SDTD: store(dvd 1 + dvd 2 )*dvd 2 (dvd 1 + dvd 2 )* dvd 1 title price dvd 2 title price discount  Here, dvd 1 defines ordinary DVDs while dvd 2 defines DVDs on sale. The rule for store specifies that there should be at least one of the latter  Definition 3. A single-type SDTD is an SDTD (, ’, (d,s), ) with the property that no regular expression d(a) has occurrences of types of the form b i and b j with the same b but different i and j.  The example above is not a single-type SDTD, as both dvd 1 and dvd 2 occur in the rule for store. 2. Structural View of DTD and XSD (cont’d)

12 60-56912  An example of single-type grammar is given below: storeregulars discounts regulars(dvd 1 )* discountsdvd 2 (dvd 2 )* dvd 1 title price dvd 2 title price discount  Although there are still two element definitions dvd 1 and dvd 2, they can only occur in a different context, regulars and discounts respectively. 2. Structural View of DTD and XSD (cont’d)

13 60-56913  Fragment of XSD of the above DTD may be written as: 2. Structural View of DTD and XSD (cont’d)

14 60-56914 3. Dataset  The authors have gathered a representative samples of DTDs and XSDs for this comparative study, mostly from the online source xml.coverpages.org  They have obtained 109 DTDs and 93 XSDs for this study.

15 60-56915 4. Expressiveness of XSDs Single-Type  The authors tried to find out whether the expressive power of single-type SDTDs actually used in real world XSDs.  Most XSDs define local tree language, that is, can be defined by DTDs  Only 5 out of 30 XSDs that are used in this analysis, or only 15%, are true single-type SDTDs  All five XSDs were of the form: p …a 1 … q …a 2 … a 1 expr 1 a 2 expr 2 Which means, when a parent of an a is p (or q) use the rule for a 1 (or a 2 )

16 60-56916  XML Schema provides two kinds of types, simple and complex types  Simple type describes the character data an element can contain (like #PCDATA in DTDs)  Complex type specifies which elements may occur as children in a given element.  In XSDs, new types may derived from existing types using two mechanisms:  Extension  Restriction 4. Expressiveness of XSDs (cont’d) Derived Types

17 60-56917  A simple type can be extended to complex type to add attributes to elements  A complex type can be extended to add a sequence of additional elements to its content model or to add attributes  A simple type can be restricted to limit the acceptable range of values for that type  A complex type can be restricted to limit the set acceptable sub-trees 4. Expressiveness of XSDs (cont’d) Derived Types Simple type (%)Complex type (%) Extension2737 Restriction737 Table1: Relative use of derivation features in XSDs

18 60-56918 Out of 93 XSDs considered:  Approx. one fifth (20%) do not construct new type through derivation at all  Extension is used to define additional attributes in 58%, and to add new elements to a content model in 42%  Restriction of complex type is used only in 7%  Note that only 37% used extension of complex type which is parallel to inheritance in OOP.  Extension of simple type occurs in 27% of XSDs  Restriction of simple type is most heavily used (73%), which shows the shortcomings of DTDs which uses unrestrictive #PCDATA 4. Expressiveness of XSDs (cont’d) Derived Types

19 60-56919  6 XSDs have used the feature of finalizing a type definition, that is using an attribute that specify that the type can not be restricted nor extended  11 XSDs have used the abstract type definition that must be derived to new types from it.  Derived type can occur anywhere in the content model where the original type is allowed, but this can be prevented by applying block attribute to the original type. 2 XSDs have used this blocking feature.  Fixed attribute is usually used to indicate that an element or attribute is restricted to specific value. Only a single XSD used this feature.  Using substitutionGroup feature the name of an element can be substitute with other name. This feature is used by 10 XSDs. 4. Expressiveness of XSDs (cont’d) Derived Types

20 60-56920 5. Additional Features  The &-operator specifies that all elements must occur but their order is not significant, was available in SGML DTD, but is lost in XML DTD. (a 1 & a 2 & a 3  a 1 a 2 a 3 | a 1 a 3 a 2 | … | a 3 a 2 a 1 ). In XSDs this feature is restored by defining the xsd:all element. Only 4 XSDs used this operator  Elements of an XML document can be identified using ID attribute and referred by IDREF or IDREFS (also supported by DTDs). The IDs are unique throughout the document. Only 6 XSDs used this feature  Referring to elements can be accomplished by key/keyref pairs. Using a reference to a key implies that the element with the corresponding key should exist in the document. It is used by 4 XSDs.  One important feature of XSDs is the use of namespace. This allows to use elements and types in the current XSD that are defined elsewhere. Apart from the obvious inclusion of XML Schema namespace, 20 XSDs used this feature.

21 60-56921 6. Regular Expression Characterization  The second question the authors tried to answer is how sophisticated regular expression tend to be in the real world DTDs and XSDs.  For this analysis, the authors had to perform some preprocessing on the documents: DTD element definition were converted to a canonical form such as, was converted to the form (c 1 | c 2 )*, just to keep the structural DTD information XSDs were preprocessed using XSLT to the canonical form  For DTDs, total 11802 element definition was reduced to 750 canonical forms, and for XSDs, total 1016 element definition was reduced to 138 canonical forms, totaling to 838 for both types of schema.

22 60-56922 6. Regular Expression Characterization (cont’d)  Definition 4. A base symbol is a regular expression a, a?, or a* where a  ; a factor is of the form e, e?, or e*, where e is a disjunction of base symbols. A simple regular expression is , Ø, or a sequence of factors, such as, (a*+b*)(a+b)?b*(a+b)*.  The authors introduced a uniform syntax to denote subclass of simple regular expressions by specifying the allowed factors. They distinguish base symbols extended by ? Or *. Further, they distinguish between factors with one disjunct or with arbitrarily many disjuncts; the latter is denoted by (+…). Finally, factors can again be extended by * or ?. For example, they write RE((+a)*,a?) for the set of regular expression e 1 … e n where every e i is (a 1 +…+ a n )* for some a 1,…, a n   and n  1, or a? for some a  .

23 60-56923  Following is a table of possible factors in simple regular expressions and how they are denoted (a, a 1,..., a n   ). Table 2 6. Regular Expression Characterization (cont’d) FactorAbbr.FactorAbbr. a a* a? (a 1 + … + a n ) a a* a? (+a) (a 1 + … + a n )* (a 1 + … + a n )? (a 1 * + … + a n *) (a 1 * + … + a n *)* (+a)* (+a)? (+a*) (+a*)*

24 60-56924  The authors have analyzed the DTDs and XSDs to characterize their content models according to the subclasses defined above.  The result is represented in the Table 3 that list the non- overlapping categories of expression having a significant population (more than 0.5%)  Two major differences between DTDs and XSDs. XSDs have more simpleType elements (#PCDATA). This may be due to the fact that XSD introduces more distinct simpleType elements. It is now possible to fine tune the specification of an element’s content. XSDs have less expression in the category RE(a,(+a)*). This is most probably due to the nature of the XSDs in the sample since those describing data are over represented with respect to those describing meta documents 6. Regular Expression Characterization (cont’d)

25 60-56925 6. Regular Expression Characterization (cont’d) DTDs (%)XSDs (%) #PCDATA3448 EMPTY1610 ANY10 RE(a)55 RE(a, a?)210 RE(a, a*)810 RE(a, a?, a*)14 RE(a, (+a))33 RE(a, (+a)?)01 RE(a, (+a)*)202 RE(a, (+a)?, (+a)*)01 RE(a, (+a*)*)02 Total simple expression9297 Non-simple expression83 Table 3: Relative occurrence of various types of regular expressions given in % of element definitions

26 60-56926  The authors have compared DTDs and XSDs using different measures but did not observe any significant differences between them. More importantly, it is clear from different comparison that vast majority of expressions are simple both in DTDs (92%) and in XSDs (97%)  Some of the comparisons they have carried out are: Density Width and depth of canonical form Simple content model Star height 6. Regular Expression Characterization (cont’d)

27 60-56927  The density of a schema is defined as the number of elements occurring in the right hand side of its rule divided by the number of elements. 6. Regular Expression Characterization (cont’d)

28 60-56928  The table bellow show the fraction of DTDs and XSDs versus the fraction of their simple content models: the majority of documents have 90% or more simple content models 6. Regular Expression Characterization (cont’d)

29 60-56929  The star height of a regular expression is the maximum nesting depth of Kleene stars occurring in the expression. Content models with star height larger than 1 are very rare.  In DTDs presence of more 1 star height expression is due to the abundance of RE(a, (+a)*) type of expressions in DTDs with respect of XSDs. 6. Regular Expression Characterization (cont’d) star heightDTDsXSDs 06178 13817 214 30 00 Table 4: Star height observed in DTDs and XSDs

30 60-56930 7. Schema and Ambiguity  The XML 1.0 specification by W3C, requires that schema definition to be deterministic or one-unambiguous.  The authors checked whether the DTDs and XSDs in the study respect this requirement using the tool IBM’s XML Schema Quality Checker (SQC).  The authors found almost all of them follow the rule.  Only 3 out of 93 XSDs having one or more ambiguous content model of two canonical forms: c 1 ?(c 1 |c 2 )* and (c 1 c 2 )|(c 1 c 3 ).

31 60-56931  For DTDs, the first exception is a regular expression of the type: (… | c i | … | c i | …)*. But the authors claimed it to be only a typo, not a design feature.  The second type of ambiguous regular expression is of type: c 1 c 2 ?c 2 ?. The designer’s intention was clearly to state that c 2 may occur zero, one or two times.  This illustrates a shortcoming of DTDs that has been addressed in XSDs, as in the following example <xsd:element name=“c 2 ” type=“t 2 ” minOccurs=“0” maxOccurs=“2”/> 7. Schema and Ambiguity (cont’d)

32 60-56932 8. Errors  The authors found some of the errors with XSDs they have retrieved Only 30 out of 93 XSDs were found to pass a conformance test by SQC, that is to be complying the W3C specifications 19 XSDs were designed according to a schema older than 2001 specs. Some simple type have been omitted or added from one version of the specs to another causing the SQC to report errors. Some errors concern violation of the Datatypes part of the specs., like a regular expression wrongfully restricting xsd:string Some XSDs violating the specs. by specifying a type attribute for complexType element, or leaving out the name attribute for a top-level complexType element.

33 60-56933 9. Conclusion  Many features defined in the XML Schema specification are not widely used yet, especially those that are related to OO data modeling such as derivation of complex type extension.  The expressive power of XSDs under investigation is almost equivalent of that of DTDs, which means that disregarding some exceptions, these XSDs could as well have been written as DTDs. This might show that the level of sophistication offered by XSDs is not necessary for most of the applications, at least until now.

34 60-56934  The data type part of the XML Schema specs is heavily used, since it alleviates a major shortcoming of DTDs, namely the ability to specify the format and type of the text of an element, which, in XSDs, accomplish through restricting a simple type. Example:  The content models specified in both DTDs and XSDs tend to be very simple. For XSDs, 97% of all content model can be classified as simple expression. 9. Conclusion (cont’d)

35 60-56935 10. References 1.Bex, G. T., Neven, F. and Bussche, J. V., DTDs versus XML Schema: A Practical Study, In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, pages 79--84, Maison de la Chimie, Paris, France, June 17-18 2004. 2.http://www.webopedia.com/TERM/D/DTD.htmlhttp://www.webopedia.com/TERM/D/DTD.html 3.http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci831 325,00.htmlhttp://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci831 325,00.html 4.http://en.wikipedia.org/wiki/XML_Schemahttp://en.wikipedia.org/wiki/XML_Schema 5.http://www.w3schools.com/schema/default.asphttp://www.w3schools.com/schema/default.asp 6.http://www.w3schools.com/dtd/dtd_intro.asphttp://www.w3schools.com/dtd/dtd_intro.asp 7.IBM Corp. XML Schema Quality Checker, 2003, http://www.alphaworks.ibm.com/tech/xmlsqc http://www.alphaworks.ibm.com/tech/xmlsqc 8.R. Cover. The cover pages, 2003, http://xml.coverpages.org/http://xml.coverpages.org/ 9.P. Biron and A. Mathotra, XML Schema part 2: datatypes. W3C, May 2001, http://www.w3.org/TR/xmlschema-2/http://www.w3.org/TR/xmlschema-2/ 10.http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-01- 02/03-01-02.pdfhttp://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-01- 02/03-01-02.pdf

36 60-56936


Download ppt "0360-569 Semantic Web (Winter 2007) A Report on DTD vs XML Schema: A Practical Study By – Bex, G. J., Neven, F., Bussche, J. V. Presented By:Quazi Rahman."

Similar presentations


Ads by Google