Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Document Type Definitions (DTDs)

Similar presentations


Presentation on theme: "XML Document Type Definitions (DTDs)"— Presentation transcript:

1 XML Document Type Definitions (DTDs)

2 The purpose of using schemas The schema languages
Objectives The purpose of using schemas The schema languages DTD XML Schema RELAX NG Regular expressions a commonly used formalism in schema language for defining schema.

3 XML Languages and schemas
is a set of XML documents used in a domain and use a common set of elements and attributes. E.g. XHTML, MathML, SVG, CML, RecipeML schema: a formal definition of the syntax of an XML language define the collection of elements that could be used in the language together with all possible attributes and contents of each element. schema language: a notation (or langauge) for writing schemas

4 Validation instance document schema schema processor invalid valid
normalized instance document error message

5 General Requirements for designing a schema language
Expressiveness Efficiency Comprehensibility

6 : an alphabet (typically Unicode characters or element names)
Regular Expressions Commonly used in schema languages to describe sequences of characters or elements : an alphabet (typically Unicode characters or element names)  matches the string  ? matches zero or one  * matches zero or more ’s + matches one or more ’s   matches any concatenation of an  and a   |  matches the union of  and 

7 caption? ( col | colgroup )* thead? tfoot? ( tbody | tr )+
Examples A regular expression describing integers: 0 | -?(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* A regular expression describing the valid contents of table elements in XHTML: <!ELEMENT table caption? ( col | colgroup )* thead? tfoot? ( tbody | tr )+ >

8 DTD - XML Building Blocks
DTD - Table of Contents Introduction to DTD An introduction to the XML Document Type Definition. DTD - XML Building Blocks What XML building blocks are defined in a DTD. DTD Elements How to define the elements of an XML document using DTD. DTD Attributes How to define the legal attributes of XML elements using DTD. DTD Entities How to define XML entities using DTD.

9 DTD – Document Type Definition
XML DTD is a subset of the DTD formalism from SGML a part of XML 1.0 A starting point for development of more expressive schema languages Considers elements, attributes, and character data only processing instructions and comments are mostly ignored because they are semantically not part of a document

10 Checking Validity with DTD
A DTD processor (also called a validating XML parser) parses the input document (includes checking well-formedness) checks the root element name for each element, checks its contents and attributes checks uniqueness and referential constraints (ID/IDREF(S) attributes)

11 Internal subset (of a DTD)
This is an XML document with a Document Type Definition: <?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> The DTD is interpreted like this: !ELEMENT note (in line 3) defines the element "note" as having four elements: "to,from,heading,body". and so on.....

12 External subset (of a DTD)
This is the same XML document with an external DTD:  <?xml version="1.0"?> <!DOCTYPE note SYSTEM "note.dtd"> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

13 <!ELEMENT note (to,from,heading,body)>
note.dtd This is a copy of the file "note.dtd" containing the Document Type Definition: <?xml version="1.0"?> <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>

14 A means for people to use a common format for interchanging data.
Why use a DTD? A means for people to use a common format for interchanging data. provides an application independent way of sharing data. can use a DTD to verify that the XML document we produced or received from the outside world is valid.

15 2.8 Document Type Declaration (cont’d)
Document Type Definition [28] doctypedecl ::= '<!DOCTYPE' S Name ( S ExternalID)? S? ('[' intSubset ']' S?)? '>' [28a] DeclSep   ::=   PEReference | S [28b] intSubset ::= (markupdecl | DeclSep)* [29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment Notes: DTD = internal subset + External Subset internal subset defined by intSubset; external subset defined by an external entity specified by ExternalID

16 [75] ExternalID ::= 'SYSTEM' S SystemLiteral[9]
2.8 External Subset [75] ExternalID ::= 'SYSTEM' S SystemLiteral[9] | 'PUBLIC' S PubidLiteral S SystemLiteral External Subset [30] extSubset ::= TextDecl? extSubsetDecl [31] extSubsetDecl ::= ( markupdecl | conditionalSect | DeclSep )* cf.: [28b] intSubset ::= (markupdecl | DeclSep)*

17 Document Type Declarations
Associates a DTD schema with the instance document 1. Contains both internal and external subsets <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN” [ <!ELEMENT ...> ] > <html>...</html> 2. External Subset only <?xml version="1.1"?> <!DOCTYPE collection SYSTEM " <collection>...</collection> 3. Internal subset only <!DOCTYPE rootElement [ ... ] > <rootElement>...</rootElement> public identifier system identifier (a URI)

18 The declarations can also be given locally, as in this example:
2.8 Example XML documents <?xml version="1.0"?> <!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello, world!</greeting> The system identifier "hello.dtd" gives the URI of a DTD for the document. The declarations can also be given locally, as in this example: <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)> ]>

19 XML building blocks (content part)
(The content parts of) XML documents are made up of the following building blocks: Elements, Tags -- Start Tag, End Tag -- Attributes, PCDATA, CDATA Section Processing Instruction, Comment Entities, Discussed in the previous lecture.

20 Entities are used to define common texts like macros in PLs.
Entities (reviewed) Entities are used to define common texts like macros in PLs. Entity references are references to entities. format: if xxx is an entity name, then use &xxx; as its entity reference. e.g.;    is used to insert an extra space in an HTML document. Entities are expanded when a document is parsed by an XML parser. The following entities are predefined in XML: Entity References Character < < > > & & " " &apos; ' More about entity later.

21 DTD – Element Declaration
Declaring an Element which may occur in the document Format: <!ELEMENT element-name element-content > Types of element contents: EMPTY – no contents ANY no restriction on contents MIXED -- allow character data (character data only) or (character data + elements) ElementOnly -- allow elements only

22 Declare an element with empty content format:
EMPTY element content Declare an element with empty content format: <!ELEMENT element-name EMPTY> Example: <!ELEMENT img EMPTY> Valid Instances: <img/> <img></img> <img> </img>

23 Declared with the ‘ANY’ keyword : <!ELEMENT name ANY > Example:
ANY Element content Declare an element that can contain any combination of elements and text data. Declared with the ‘ANY’ keyword : <!ELEMENT name ANY > Example: <!ELEMENT E1 ANY> Valid instances (with respective to E1 only): <E1>begin<E2/>middle<E3> fff </E3> end</E1> <E1> dddd <E1> <E1/>

24 Elements with MIXED contents
Two cases: 1. Elements that can only contain text contents <!ELEMENT name (#PCDATA)> 2. Elements allowing text as well as element contents <!ELEMENT E0 (#PCDATA | E1 | E2 … )* > Example: <!ELEMENT note (#PCDATA)> <!ELEMENT em EMPTY> <!ELEMENT e1 (#PCDATA | note | em)* > <!ELEMENT e1 (note | #PCDATA | em) > (X) 1. no star 2. #PCDATA placed in wrong position Valid Instances: <e1> ddd <em/> cd <note>ttt</note> <em/> </e1> #PCDATA must appear first!.

25 Elements that can contain element contents only
Issue: how to declare the possible sequences of content elements. Solution: regular expressions over element (names) Definition: 1. CP ::= (name | choice | seq ) (‘+’ | ‘*’ | ‘?’ )? 2. choice ::= a list of two or more CPs separated by ‘|’ and is enclosed by ‘(‘ and ‘)’. 3. seq ::= a list of one or more CPs seprated by ‘,’ and is enclosed by ‘(‘ and ‘)’ ElementOnly elements: <!ELEMENT name CP – name (‘+’ | ‘*’ | ‘?’ )? >

26 Recursive definition of CP, seq and choice:
Basis: if a is a name, then a, a?, a+, a* are CPs (content particle) basic CP Closure: if b is a seq or choice, then b, b?, b+, b* are CPs. if b1, b2,… bn (n > 1) are CPs, then (b1 | b2 | … | bn) is a choice. if b1, b2,… bn (n > 0) are CPs, then (b1 , b2 , … , bn) is a seq. a is a children if a is a non-basic CP (i.e., a CP but is not a basic CP). Examples of children: Illegal : <!ELEMENT e1 e2*>, <!ELEMENT e1 e2> Legal : <!ELEMENT e1 (e2)>,<… (e2+)>, <… (e2)?>

27 <!ELEMENT note (to,from,heading,body)> <!ELEMENT note
More examples <!ELEMENT note (to,from,heading,body)> <!ELEMENT note (to, from, heading1 | heading2, body)> (X) (to, from, (heading1 | heading2), body)> (0) <!ELEMENT E1 ( (E1, E2) | (E1, E3, E2)) > (x, 1-ambiguous) Rewritten as … (E1, (E2 | (E3,E2)))> (0) Even though ( (E1, E2) | (E1, E3, E2)) is a choice particle, but, since it is 1-ambiguous, XML specification does not allow it to appear as the content model of an element.

28 3.2 Grammar of Element Type Declaration
[45] elementdecl ::= '<!ELEMENT' S Name S contentspec S? '>' [ VC: Unique Element Type Declaration] [46] contentspec ::= ‘EMPTY’ | ‘ANY’ | Mixed | children Examples: <!ELEMENT br EMPTY> <!ELEMENT container ANY> <!ELEMENT p (#PCDATA | emph)* > <!ELEMENT %name.para; %content.para; >

29 Official Grammar of ElementOnly content Models
[47] children ::= (choice | seq) ('?' | '*' | '+')? [48] cp ::= (Name | choice | seq) ('?' | '*' | '+')? [49] choice ::= '(' S? cp ( S? '|' S? cp )+ S? ')' [50] seq ::= '(' S? cp ( S? ',' S? cp )* S? ')' where each Name is the type of an element which may appear as a child. Examples: <!ELEMENT spec (front, body, back?)> <!ELEMENT div1 (head, (p | list | note)*, div2*)> <!ELEMENT dictionary-body p*> (x) Note: (x) <!ELEMENT spec body> (0) <!ELEMENT spec (body)> [49,50] has an additional VC: Proper Group/PE Nesting Validity constraint: Proper Group/PE Nesting 1. The ‘(‘ and ‘)’ in choice, seq, or Mixed constructs must occur in the same replacement text of a parameter entity. 2. For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text SHOULD contain at least one non-blank character, and neither the first nor last non-blank character of the replacement text SHOULD be a connector (| or ,).

30 Grammar of Mixed Content
Mixed-content Declaration [51] Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*' | '(' S? '#PCDATA' S? ')‘ Ex: <!ELEMENT b (#PCDATA)> <!ELEMENT p (ul , #PCDATA) > (x) <!ELEMENT p ( ul | #PCDATA | ol) > (x) <!ELEMENT p (#PCDATA | ul | ol) > (x) <!ELEMENT p (#PCDATA | ul | ol)* > (0) <!ELEMENT p ( #PCDATA |a |ul |b | i | em )*> (0)

31 (preface, toc, chapter+, index?) >
Attribute Definition ELEMENT declarations prescribe each element type that can appear in a document and define the permissible content of each element type. Ex: <!ELEMENT book (preface, toc, chapter+, index?) > To prescribe all attributes that can appear in the start tag of an element type, we use ATTLIST declaration.

32 To define permissible attributes associated with book element, we use:
ATTLIST declaration To define permissible attributes associated with book element, we use: <!ATTLIST book title CDATA #REQUIRED> <!ATTLIST book isbn CDATA #IMPLIED> 1. and 2. can be merged as : 3. <!ATTLIST book title CDATA #REQUIRED isbn CDATA #IMPLIED > Format: <!ATTLIST elm-name (attr-name attr-type attr-default-value)+ > Note: Attributes have a name, a type, a default-value and belong to an element.

33 The value is character data. (v1 | v2 | …|vk)
Attribute types Type Meaning CDATA The value is character data. (v1 | v2 | …|vk) The value must be one of the listed name tokens: v1 …vk . ID The value is an unique id. IDREF The value is a reference to an id. IDREFS The value is a list of IDREFs. NMTOKEN The value is a valid XML name token. NMTOKENS The value is a list of name tokens. ENTITY The value is an (unparsed) entity. ENTITIES The value is a list of (unparsed)entities. NOTATION (v1 | v2 | …|vk) The value must be one of the listed notation names :v1 …vk .

34 Attribute-default value
Meaning “v1” The attribute has a default value v its value can be overridden in the doc. #REQUIRED The attribute must be given explicitly in the document. #IMPLIED The attribute does not have to appear in the document. #FIXED “v1” The attribute value is fixed to v1 and could not be overridden in the doc. If specified in doc, its value must be ‘v1’

35 Attributes with default value
EX1: <!ELEMENT square EMPTY> <!ATTLIST square width CDATA "0“ > XML elements: <square width="100"></square> <square></square> <!-- has width attr with value 0 --> Ex2: <!ATTLIST payment type CDATA "check"> Below are equivalent XML Elements: <payment type="check“> …</payment> <payment > … </payment>

36 <!ATTLIST element-name attribute-name attribute-type #IMPLIED>
#IMPLIED attribute Syntax: <!ATTLIST element-name attribute-name attribute-type #IMPLIED> Ex: <!ATTLIST contact fax CDATA #IMPLIED> instance: 1. <contact fax=" “ /> 2. <contact /> Both 1 and 2 are valid but they are not equivalent.

37 <!ATTLIST elm-name attr-name attr-type #REQUIRED> Ex:
#REQUIRED attribute Syntax: <!ATTLIST elm-name attr-name attr-type #REQUIRED> Ex: <!ATTLIST person number CDATA #REQUIRED> instances: <person number="5677“/> <person/> <!-- (x) number attr must appear -->

38 #FIXED “value” attributes
Syntax: <!ATTLIST elm-name attr-name attr-type #FIXED "value“ > Ex: <!ATTLIST send sender CDATA #FIXED “Chen“ > Instances: <send sender=“Chen">…</send> <send >…</send> <send sender=“Wang">…</send> (x) Notes: 1. and 2. are equivalent. 3. is invalid.

39 Official Grammar of Attribute-List Declarations
[52] AttlistDecl ::= '<!ATTLIST' S Name AttDef* S? '>' [53] AttDef ::= S Name S AttType S DefaultDecl XML attribute types are classified into three kinds: string type (CDATA), enumerated types – name tokens or notations tokenized types (ID, IDREF,IDREFS, NMTOKEN…).

40 [54] AttType ::= StringType | TokenizedType | EnumeratedType
3.3.1 Attribute Types [54] AttType ::= StringType | TokenizedType | EnumeratedType [55] StringType ::= 'CDATA' [56] TokenizedType ::= 'ID' | 'IDREF' | 'IDREFS’ | 'ENTITY’ | 'ENTITIES' | 'NMTOKEN’ | 'NMTOKENS’ Notes: ID, IDREF and IDREFS used for cross references ENTITY(S) for referring to external unparsed objects NMTOKEN(S) restrict attribute value to be Nmtoken(s).

41 Ex:<!ATTLIST person
ID and IDREF(S) If an attribute is of ID type, the value of every occurrence of this attribute must be unique among all ID attribute values of the whole document. Ex:<!ATTLIST person age NMTOKEN #IMPLIED name ID #REQUIRED father IDREF #IMPLIED children IDREFS #IMPLIED > <!ATTLIST student sid ID #REQUIRED > Instances: name=“p1” and sid=“p1” violate ID constraint. <family> <person name=“p1” /><person name=‘p2” age=“3” /> <student sid=“p1” /> <student sid=“s2/> <person name=“p0” father=“p1” children=“p2 s2 3”/> </family>

42 Notation A notation in XML is a name used to identify a specific type of (non-xml) data like ppt, pdf, word, jpeg, gif, etc. Each notation must be declared and is associated with a system identifier and/or public identifier. <!NOTATION pdf public “-//adobe//portable doc format//EN” “ > We may limit the value of an attribute to be a notation name from a list of declared notation names

43 3.3.1 Enumerated Attribute Types
[57] EnumeratedType ::= NotationType | Enumeration [58] NotationType ::= 'NOTATION' S '(' S? Name (S? '|' S? Name)* S? ')' [59] Enumeration ::= '(' S? Nmtoken (S? '|' S? Nmtoken)* S? ')’ [58] is used to limit the attribute value to be one of the listed notation names.

44 Enumerated attribute values
Syntax: <!ATTLIST elm-name attr-name (v1|v2|..) def-value> Ex: <!ATTLIST payment type (check|cash) "cash"> <!ATTLIST light color (red |yellow | green ) #IMPLIED> instances: <payment type="check“/> <payment type="cash“/> <light color=‘red’/> <light/>

45 NOTATION attribute values
Syntax: <!ATTLIST elm-name attr-name NOTATION (v1|v2|..) def-value> Ex: <!ATTLIST file fileType NOTATION (word | pdf | ps ) “word”> <!ATTLIST img type NOTATION (gif | png | bitmap ) #MPLIED> Note: Each notation (pdf, ps, gif,…) must be declared in advance using <!NOTATION … > before it can be used. instances: <file fileType=“pdf“/> <img type=“gif“/>

46 Gramamr of Attribute Defaults
[60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED' | (('#FIXED' S)? AttValue) Ex: <!ATTLIST termdef id ID #REQUIRED name CDATA #IMPLIED > <!ATTLIST file format NOTATION (ps | pdf | word ) #REQUIRED > <!ATTLIST list type (bullets | ordered | glossary) "ordered"> <!ATTLIST form method CDATA #FIXED "POST">

47 White Space and End-of-line Handling
special attribute xml:space used to indicate if (markup) spaces should be preserved. <!ATTLIST poem xml:space (default | preserve) 'preserve‘> <e1_v1=“abc”_ _ v2 =“def”_ _ _/> <e1_ _v1=“abc”_v2 =“def”/> Every XML document must be normalized for end-of-line before parsing: in order to eliminate difference from different OSs #xD#xA --> #xA // \r\n or \r replaced by \n #D > #xA // this is done before parsing

48 2.12 Language Identification
the preserved attribute xml:lang may be inserted in documents to specify the language used inside an element. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 1766], "Tags for the Identification of Languages”. Example: <!ATTLIST poem xml:lang NMTOKEN 'fr'> <!ATTLIST gloss xml:lang NMTOKEN 'en'> <!ATTLIST note xml:lang NMTOKEN 'en'>

49 2.12 Language Identifications
<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> <p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p> <poem author=“李白” title=“將進酒” xml:lang=“zh-TW”> 君不見黃河之水天上來,奔流到海不復回。 君不見高堂明鏡悲白髮,朝如青絲暮成雪。 人生得意須盡歡,莫使金樽空對月。 天生我材必有用,千金散盡還復來。 烹羊宰牛且為樂,會須一飲三百杯。岑夫子,丹丘生,將進酒,杯莫停。 與君歌一曲,請君為我傾耳聽。鐘鼓饌玉何足貴,但願長醉不願醒。 古來聖賢皆寂寞,唯有飲者留其名。 陳王昔時宴平樂,斗酒十千恣歡謔。主人何為言少錢,徑須沽取對君酌。 五花馬,千金裘,呼兒將出換美酒,與爾同銷萬古愁。 </poem>

50 <poem author=“李白” title=“将进酒” xml:lang=“zh-CN”>
君不见黄河之水天上来,奔流到海不复回。 君不见高堂明镜悲白发,朝如青丝暮成雪。 人生得意须尽欢,莫使金樽空对月。 天生我材必有用, 千金散尽还复来。 烹羊宰牛且为乐,会须一饮三百杯。 岑夫子,丹丘生,将进酒,君莫停。 与君歌一曲, 请君为我侧耳听。 钟鼓馔玉不足贵,但愿长醉不愿醒。 古来圣贤皆寂寞,惟有饮者留其名。 陈王昔时宴平乐, 斗酒十千恣欢谑。 主人何为言少钱,径须沽取对君酌。 五花马,千金裘,呼儿将出换美酒,与尔同销万古愁。 </poem>

51 Entity references are references to entities.
DTD-Entities Entities used to define shortcuts to common text, like macros in programming languages. Entity references are references to entities. If name is an entity [name], then &name; (or %name; but not both) is its reference Entities can be declared internal ( contents in the same doc as its declaration) or external (contents external to its declaration) Two more classifications later.

52 Internal Entity Declaration
Syntax: <!ENTITY entity-name "entity-value"> DTD Example: <!ENTITY p1 “Peter"> <!ENTITY birthday “2/12/2000"> XML example: <baby>&p1; &birthday;</baby> Equivalent to : <baby>Peter 2/12/2000</baby>

53 External Entity Declaration
advantage: reuse; more modular Syntax: <!ENTITY entity-name SYSTEM “aURI"> DTD Example: <!ENTITY writer SYSTEM " <!ENTITY copyright SYSTEM " XML example: <author>&writer;&copyright;</author>

54 Some large DTD Examples
XHTML 1.0 XHTML 1.0 DTD DocBook 5.0 DTD docbook.dtd SVG 1.1 svg 1.1 dtd

55 Structure of XML Documents
Logical Structure Elements Character data Physical Structure Entities External parsed entity Document entity Document Unit Sub-unit External unparsed entity

56 Each XML document has one entity called the document entity,
4. Physical Structures An XML document may consist of one or many storage units called entities; have content identified by name. may have an associated URI Each XML document has one entity called the document entity, the starting entity for the XML processor and may contain the whole document. the only kind of entities without a name. Entities may be either parsed or unparsed. unparsed --> not to be analyzed to XML processors. used for non-xml data (e.g. image file).

57 Properties of an entity
entity name: Every entity but the document entity has a name entity reference: if xxx is the name of an entity, then &xxx; (or %xxx;) is its entity reference content: replacement text: the text to be substituted for all occurrences of its reference entity value : the literal value appearing in an entity declaration. Internal or external: external  content from external files internal  content from part of its declaration

58 general or parameter parsed or unparsed Note:
general  to be referenced and expanded in document region parameter  to be expanded in DTD region and hence references can appear in DTD region only parsed or unparsed parsed  part of an xml documents unparsed  non-xml data or xml-data but not intended to processed by xml parser. unparsed entities are always external and general. Note: Since unparsed entities must be general and external, there are only 5 kinds of entities.

59 Parsed entity and unparsed entity
An unparsed entity is a resource whose contents are not to be processed by XML processor. has an associated notation, identified by name. must be an external entity (with publicId and/or SystemId) referenced by [entity] name (instead of entity reference) occurring only in the value of ENTITY or ENTITIES attributes. Parsed entities are entities whose contents need to be processed by XML Processor. referenced by using entity references. contents are referred to as its replacement text;

60 external general parsed entity. internal general parsed entity
Examples external general parsed entity. <!ENTITY legal SYSTEM " internal general parsed entity <!ENTITY nccu “National Chengchi University”> internal parameter parsed entity <!ENTITY % colorValues “(male | female)”> external general unparsed entity. <!NOTATION PDF SYSTEM “ <!ENTITY cover1 SYSTEM NDATA PDF> Note: Notation and unparsed entity are rarely used in practice.

61 Example of unparsed entity usage
<! DOCTYPE BookCategory [... <!ATTLIST BOOK cover ENTITY #REQUIRED format NOTATION (pdf | word | ps) “word” > <!NOTATION pdf SYSTEM “ <!ENTITY cover1 SYSTEM NDATA pdf> ]> ... <BookCategory> <BOOK cover=“cover1” format=“pdf” > … </BOOK> Type of cover1

62 General entity and parameter entity
Parameter entities are parsed entities for use in grammar (DTD ). referenced by the form: %name; General entities are entities for use in the document content. sometimes simply called entity. referenced by the form: &name; Comparisons: use different syntax in DTD for definition. use different forms of references recognized in different contexts (grammar v.s. data).

63 external general parsed entity. internal general parsed entity
Examples external general parsed entity. <!ENTITY legal SYSTEM " internal general parsed entity <!ENTITY nccu “National Chengchi University”> internal parameter parsed entity <!ENTITY % colorValues “(male | female)”> external parameter parsed entity. <!ENTITY % html40 SYSTEM " Notes: All parameter entities are parsed entities Parameter entities carry grammar information. General entities carry data contents.

64 4.1 Character and Entity References
A character reference refers to a specific character in the ISO/IEC character set. Character Reference [66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

65 4.1 Character and Entity References (cont’d)
[67] Reference ::= EntityRef | CharRef [68] EntityRef ::= '&' Name ';' [69] PEReference ::= '%' Name ';’

66 internal entity unparsed entity external entity Entity Declaration
4.2 Entity Declarations Entity Declaration [70] EntityDecl ::= GEDecl | PEDecl [71] GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>' [72] PEDecl ::= '<!ENTITY' S '%' S Name S PEDef S? '>' [73] EntityDef ::= EntityValue[9] | ( ExternalID NDataDecl?) [74] PEDef ::= EntityValue | ExternalID notes: 1. General entities can only be referenced at non-DTD region 2. Parameter entities are referenced at DTD internal entity unparsed entity external entity

67 Review of important literals
[9] EntityValue ::= ‘”’ ([^%&”] | PEReference | Reference)* ‘”’ | “’” ([^%&'] | PEReference | Reference)* “’” [10] AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" [11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'") [12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'" [13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] |

68 Entities defined by EntityValue is called an internal entity.
4.2.1 Internal Entities Entities defined by EntityValue is called an internal entity. the content of the entity is given in the declaration. no separate physical storage object, Some processing of entity and character references in the literal entity value may be required to produce the correct replacement text. An internal entity is always a parsed entity. Example of an internal entity declaration: <!ENTITY Pub-Status "This is a pre-release of the specification.">

69 If the entity is not internal, it is an external entity.
4.2.2 External Entities If the entity is not internal, it is an external entity. External Entity Declaration [75] ExternalID ::= 'SYSTEM' S SystemLiteral[9] | 'PUBLIC' S PubidLiteral S SystemLiteral [76] NDataDecl ::= S 'NDATA' S Name [ VC: Notation Declared ] If the NDataDecl is present, this is a general unparsed entity; otherwise it is a parsed entity. [VC: Notation Declared]: The Name must match the declared name of a notation. SystemLiteral is called the entity’ system identifier, which is a URI. PubidLiteral is called the entity’s public identifier, which the XML processor may use to produce an alternative URI.

70 Examples of external entity declaration
<!ENTITY open-hatch SYSTEM " <!ENTITY open-hatch PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN" " > <!ENTITY hatch-pic SYSTEM "../grafix/OpenHatch.gif" NDATA gif >

71 4.3 Parsed Entities 4.3.1 The Text Declaration
External parsed entities may each begin with a text declaration. Text Declaration [77] TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>' Notes: must be placed at the beginning of an external parsed entity if appearing.

72 4.3.2 Well-formed Parsed Entities
The document entity is well-formed if it matches the production labeled document[1] . An external general parsed entity is well-formed if it matches the production labeled extParsedEnt[78] . All external parameter entities are well-formed by definition. Well-Formed External Parsed Entity [78] extParsedEnt ::= TextDecl? content

73 4.3.2 Well-Formed Parsed Entities (cont’d)
An internal general parsed entity is well-formed if its replacement text matches the production labeled content[43]. All internal parameter entities are well-formed by definition. A consequence of well-formedness in entities: the logical and physical structures in an XML document are properly nested; i.e., no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another 有始有終

74 <!ENTITY e1 “<a a1=‘b1’ “ > (x)
Examples <!ENTITY e1 “<a a1=‘b1’ “ > (x) <!ENTITY e2 “ a2=‘b2’ > “ > (x) <!ENTITY e3 “<a a1=‘b1’ a2=‘b2’ >“ > (x) <!ENTITY e4 “</a>” > (x) &e1;&e2;test&e4; <!-- not well-formed --> &e3;test&e4; <!-- not well-formed --> <!ENTITY ok1 “<a a1=‘b1’> test </a>” > (o) <!ENTITY ok2 “<a a1=‘b1’> &cnt; </a>” > (o) <!ENTITY cnt “test” > (o) &ok1; <!-- well-formed --> &ok2; <!– well-formed -->

75 4.3.3 Character Encoding in Entities
External parsed entities may use different encoding for their characters. All XML processors must support UTF-8 and UTF-16. must declare encoding in text declaration for encoding other than UTF-8 or UTF-16. Encoding Declaration [80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'"EncName "'" ) [81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* /* Encoding name contains only Latin characters */ Examples: <?xml encoding='UTF-8'?> <?xml encoding=’Big-5'?>

76 4.4 XML Processor Treatment of Entities and References
The contexts in which character references, entity references, unparsed entity names and notation names may appear: Reference in Content[43] : <e>…</e> in Attribute Value [10] : <e name1="…" > in Entity Value[9] : <!ENTITY e "…" > in DTD[28a] : <!DOCTYPE root SYSTEM "aURI" [ … ]> [Name] Occurs as Attribute Value [10] : <e name1="…" >

77 Context in which entity or character references may occur
1. Reference in Content : as a reference in content. EX: <p>He said: &WhatHeSaid; </p> 2. Reference in Attribute Value : as a reference within either the value of an attribute in a start-tag, or a default value in an attribute declaration; corresponds to the nonterminal AttValue. ex: <A HREF='&home;/start.html'> ex: <!ATTLIST A HREF CDATA ‘&home;/index.html’> 3. Occurs as Attribute Value: as a Name, not a reference, appearing as the value of an attribute declared as type ENTITY, or ENTITIES or NATATION.

78 4.4 Context in which entity or character references may occur
ex: <!ENTITY Apicture SYSTEM " NDATA GIF> <!ATTLIST World src ENTITY #REQUIRED> … <World src=’Apicture'> 4. Reference in Entity Value : as a reference in rule EntityValue. ex: <!ENTITY PLX "Perl &heart; XML!"> <!ENTITY % penty "Perl %x1;&x2; XML! "> 5. Reference in DTD : as a reference in internal or external subsets of the DTD, but outside of any EntityValue or AttValue. ex: <!ELEMENT %Para; (#PCDATA| %ParaBits; )*> %manyElements; <!ATTLIST … >

79 Example : Contexts in which entities or entity references occur
<?xml version=“1.0” ?> <!DOCTYPE example System “aURI” [ <!ENTITY gEnty “_&gEnty2;4AAA%pEnty3;4_ ”> <!NOTATION gif system ‘gif uri” > <!ENTITY picEnty system “aURI” NDATA gif> &e1Pe;5 <!ATTLIST e1 icon ENTITY #FIXED ‘picEnty’3 name CDATA “__&refInAttvalue;2___” > ]> <example> … &gEnty;1 … <e1 icon=“picEnty”3 name=“another &referenceInAttributeValue;2” /> &ReferenceInContent;1 … </example> Note: Although parameter entity can be declared in internal subset, its reference cannot appear in internal subset. Hence &e1Pe;5 in fact is not correct.

80 general v.s. parameter entities:
4.4 summary on entities internal v.s. external: internal ==> content given in the declaration external ==> content obtained outside the declaration ex1: <!ENTITY Pub-Status “this is …”> ex2: <!ENTITY % book-format SYSTEM “ > ex3: <!ENTITY book1 SYSTEM “bybook.doc” NDATA WORD> general v.s. parameter entities: general ==> used in document instance parameter ==> used in document declaration(DTD) ex: ex1==> general; ex2=> PE parsed v.s. unparsed entities: parsed => XML processor will parse it ==> ex1, ex2 unparsed => XML processopr need’t parse it. ==> ex3 note: unparsed entities must be general and external.

81 4.5 Construction of Internal Entity Replacement Text
Two forms of the entity's value of an internal entity. literal entity value : the quoted string actually present in the entity declaration, corresponding to the non-terminal EntityValue. replacement text : the content of the entity, after replacement of character references and parameter-entity references. Notes: 1. General-entity references in literal entity value are not expanded to produce replacement text . 2. It is the replacement text of the entity that is substituted for every occurrence of its entity reference.

82 <!ENTITY % pub "Éditions Gallimard" >
4.5 Example <!ENTITY % pub "Éditions Gallimard" > <!ENTITY rights "All rights reserved" > <!ENTITY book "La Peste: Albert Camus, © 1947 %pub;. &rights;" > => Entity book has replacement text: “La Peste: Albert Camus, © 1947 Éditions Gallimard. &rights;” Note: No forward reference for PE is permitted. Hence entity ‘book’ could not be put before ‘pub’ entity.

83 Rules: from internal entity value to replacement text
normal character(c matches [^'&"%]or is data '," ) : a-|cb  ac-|b character reference(Included) : a-|&#xxxx;b  a ch(xxxx)-|b parameter entity reference (Included in Literal): a-|%pe;b  a-|rptxt'(pe)b general entity reference (Bypassed): a-|≥b  a≥-| b If -| b * a -| e, then define rptxt(b) = a. Notations: ch(xxxx) : char data with code point #xxxx rptxt(entity) : replacement text of ge/pe entity rptxt'(e) ; rptxt(e) with ' and '' treated as normal literals.

84 internal parsed (general/paramter) entity
Contents of entities literal entity value replacement text internal parsed (general/paramter) entity quoted string (a) defined by the rules of EntityValue rptxt(a) external parsed (genral/parameter) whole text in the entity same as entity value with optional text declaration stripped.

85 Rules: (note the asymmetry b/t char and non-char inclusion)
4.4.2 Included An entity is said to be included when its replacement text is retrieved and processed, in place of the reference itself, as though it were part of the document at the location the reference was recognized. The replacement text may contain character data (and markup if it is a general entity) , which must be recognized in the usual way, Rules: (note the asymmetry b/t char and non-char inclusion) a -| &#dddd;b  a ch(dddd) -| b // char inclusion a -| ≥b  a -| rptxt(ge)b // ge inclusion // pe inclusion is not used in xml processing. a -| %pe;b  a -| rptxt(pe)b // pe inclusion

86 <!ENTITY AC "The &W3C; Advisory Council">
Example Ex: <!ENTITY AC "The &W3C; Advisory Council"> <!ENTITY W3C "WWW Consortium"> ==>-| &AC; ==>-|The &W3C; Advisory Council</x> ==>The -|&W3C; Advisory Council</x> ==>The -|WWW Consortium Advisory Council</x> ==> So, if <e1 at1=“aaa&AC;zzz”/> then e1 has attribute at1 with value “aaa The WWW Consortium Advisory Councilzzz”.

87 3.3.3 Attribute-value normalization
When: after end-of-line processing but before passed to app. 0. End-of-line processing ( &#XD &#XA  &#xA) Steps: initially nv=“” // normalized value Repeat until end of input. character reference => append the referenced character to the normalized value (e.g., &  ‘&’ ) entity reference => (include it:) recursively apply step 1 to the replacement text of the entity. white space character (#x20, #xD, #xA, #x9) => append a space character (#x20) to the normalized value. O/w (other character ) =>append the character to the normalized value. 2. If not CDATA type => removing leading/trailing spaces and replace sequences of space (#x20) characters by a single space (#x20) character Notes : 1. char and entity references are not treated equal. 2. White spaces are normalized to space.

88 Rules: attribute value normalization
-- a.k.a from attvalue to normalized attribute value. normal char: (c matches [^'&"<]-S or is data '," ) a -| c b  a c -| b char reference (included) : a -| &#xxxx; b  a ch(xxxx) -| b (internal) ge reference (included in literal): a -| ≥ b  a -| rptxt'(ge) b white space: where w is one of (#x20, #xD, #xA, #x9) and D is space a -| wb  aD -| b If -| b * a -| e, then define nv1(b) = a. if CDATA  nv(b) = nv1(b) O/W  nv(b) = nv1(b) but remove leading/trailing spaces and replace sequences of space (#x20) characters by a single space (#x20) character.

89 <!ENTITY d " "> => rptxt(d) = [cr] since -|  [cr]-|
Examples <!ENTITY d " "> => rptxt(d) = [cr] since -| &#xd  [cr]-| <!ENTITY a " "> => rptxt(a) = [lf] since -| &#xA  [lf]-| <!ENTITY da " "> => rptxt(da) = [cr][lf] since -|  [cr]-|  [cr][lf] -| Attribute spec a is CDATA a is NMTOKEN(S) att=“ xyz” “[ ][ ]xyz” “xyz” [cr][lf][cr][lf]xyz [lf][lf]xyz [][]xyz  xyz EndOfLine-processing normalize non-CDATA type att= &d;&d;A&a;&a;B&da;" “[][]A[][]B[][]” “A[]B” att= " A B " [cr][cr]A[lf][lf] B [cr][lf] [cr]A[lf][lf]B [cr][lf]

90 Same as Included except that
4.4.5 include in literal Same as Included except that a single or double quote character in the replacement text is always treated as a normal data character and will not terminate the literal. additional rules: a -|‘b  a‘ -| b a -|”b  a” -| b Example: this is well-formed: <!ENTITY % YN '"Yes"' > <!ENTITY WhatHeSaid "He said %YN;" > while this is not: <!ENTITY EndAttr "27'" > <element attribute='a-&EndAttr;>

91 4.4.8 included as PE same as ‘included’ but the replacement text is enlarged by the attachment of one leading and one following space (#x20) character. rule : a -|%pe;b  a -|Drptxt(pe)Db // where D is space ex: 1. <!ENTITY % contents “ANY”> 2. <!ELEMENT e1%contents;> 2. is equ. to. <!ELEMENT e1 ANY > instead of <!ELEMENT e1ANY>

92 4.4 XML Processor Treatment of Entities and References

93 4.6 Predefined Entities Entity and character references can both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references "<" and "&" may be used to escape < and & when they occur in character data. <!ENTITY lt "&#60;"> // < double escaping required for < and & <!ENTITY amp "&#38;"> // & well-formed replacement text <!ENTITY gt ">"> // > double escaping harmless but <!ENTITY apos "'"> // ‘ not needed for >,' and ". <!ENTITY quot """> // “ ex: The string "-|AT&T;” ==> "AT-|&T;" ==> “AT&-|T;”. If define 2. as “&" => -|AT&T;” ==> “AT-|&T;” ==> err.

94 From content to next character data in the content
normal character(c matches [^&<]) : // after EOL processing a-|cb  ac-|b character reference(Included) : a-|&#xxxx;b  a ch(xxxx)-|b (internal or external) general entity reference (Included): a-|≥b  a-|rptxt(ge)b begin of markup (end of char data) a-|<b  If -| b * a -| e, or a -| <g then define nxt(b) = a. Notation: nxt(cnt) : next char data of the cnt, which is a text satisfies the grammar rule content.

95 4.7 Notation Declarations
Notations identify by name the format of unparsed entities e.g., GIF, JPEG, DOC,BMP,… Notation Declarations [82] NotationDecl ::= '<!NOTATION' S Name S (ExternalID | PublicID) S? '>' [83] PublicID ::= 'PUBLIC' S PubidLiteral 4.8 Document Entity serves as the root of the entity tree and a starting-point for an XML processor. unlike other entities, the document entity has no name and might well appear on a processor input stream without any identification at all.

96 Appendix D. Expansion of Entity and Character References
<!ENTITY example "<p>An ampersand (&#38;) may be escaped numerically (&#38;#38;) or with a general entity (&amp;).</p>" > ==> ENTITY example has value(replacement text): <p>An ampersand (&) may be escaped numerically (&#38;) or with a general entity (&amp;).</p> A reference in the document to “&example;” cause the text to be reparsed: ==> An ampersand (&) may be escaped numerically (&) or with a general entity (&).

97 3 <!ELEMENT test (#PCDATA) > 4 <!ENTITY % xx '%zz;'>
D. More complex example 1 <?xml version='1.0'?> 2 <!DOCTYPE test [ 3 <!ELEMENT test (#PCDATA) > 4 <!ENTITY % xx '%zz;'> 5 <!ENTITY % zz '<!ENTITY tricky "error-prone" >' > 6 %xx; 7 ]> 8 <test>This sample shows a &tricky; method.</test> line4 => xx has value “%zz;” line5 => zz has value “<!ENTITY trickey “error-prone”>” line6 => %xx; => %zz; => <!ENTITY trickey “error-prone”> declared line 8 => element test has content: “This sample shows a error-prone method.”

98 [61] conditionalSect ::= includeSect | ignoreSect
3.4 Conditional Sections Conditional sections are portions of the document type declaration external subset which are included in, or excluded from, the logical structure of the DTD based on the keyword which governs them. Conditional Section [61] conditionalSect ::= includeSect | ignoreSect [62] includeSect ::= '<![' S? 'INCLUDE' S? '[' extSubsetDecl ']]>' [63] ignoreSect ::= '<![' S? 'IGNORE' S? '[’ ignoreSectContents* ']]>' [64] ignoreSectContents ::= Ignore ('<![' ignoreSectContents ']]>' Ignore)* [65] Ignore ::= Char* - (Char* ('<![' | ']]>') Char*) Note: Nested conditional section allows.

99 <!ENTITY % draft 'INCLUDE' > <!ENTITY % final 'IGNORE' >
3.4 Conditional Sections Example: <!ENTITY % draft 'INCLUDE' > <!ENTITY % final 'IGNORE' > <![%draft;[ <!ELEMENT book (comments*, title, body, supplements?)> ]]> <![%final;[ <!ELEMENT book (title, body, supplements?)>


Download ppt "XML Document Type Definitions (DTDs)"

Similar presentations


Ads by Google