Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML for Information Management – Day 3 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

Similar presentations


Presentation on theme: "XML for Information Management – Day 3 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:"— Presentation transcript:

1 XML for Information Management – Day 3 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/ 26.4.-30.4.2010

2 XML for Information Management – Day 3 Airi Salminen 2 1. Structured documents 2.Formal grammars in XML 3. Natural languages in XML documents 4. Adding meaning by markup 5. Text indexing 6. Logical structure of XML documents Outline

3 XML for Information Management – Day 3 Airi Salminen 3 1. Structured documents Structured document ‣structure, content, and external presentation can be separated from each other and processed separately ‣structural components have names ‣structural components can be recognized by software modules ‣possible to define the structure

4 XML for Information Management – Day 3 Airi Salminen 4 Structured document Structure Content Layout 1. Structured documents an open language standard, e.g. SGML, XML different languages for defining the layout, e.g., CSS and XSL for XML different languages for defining the structure, e.g., DTD, XML Schema, RELAX NG for XML

5 XML for Information Management – Day 3 Airi Salminen 5 Structured document Structure Content Layout 1. Structured documents Example DTD.txt rhymes-with-ext-dtd.txt rhymes-with-ext-dtd.xml rhymes-style.txtrhymes-style.css rhymes-with-style-and-ext-dtd.xml rhymes-with-style-and-ext-dtd.txt

6 XML for Information Management – Day 3 Airi Salminen 6 Management of structured documents ‣document management ‣management of the data contained in documents 1. Structured documents

7 XML for Information Management – Day 3 Airi Salminen 7 Characteristics in the management of structured documents ‣Design. Adopting the approach of structured document management in an environment often requires careful planning before the creation of documents. Includes schema design and layout design. ‣Content production. Content can be produced by different types of software, e.g. by a syntax-directed editor. Checking the validity against the schema. ‣Evolution. Schema versioning, layout versioning. ‣Operations. Most typical operation is some kind of transformation. ‣Software. Many kinds of software systems used. 1. Structured documents

8 XML for Information Management – Day 3 Airi Salminen 8 2. Formal grammars in XML ‣terminal symbols (alphabet) ‣nonterminal symbols ‣production rules ‣start symbol The language defined by a grammar consists of all those strings over the alphabet that can be generated by starting with the start symbol and then applying the production rules until no nonterminal symbols are present. A formal grammar is a way to describe the syntax of language.

9 XML for Information Management – Day 3 Airi Salminen 9 In XML there are two kinds of formal grammars with their own notations: ‣the grammar defining the XML syntax in the XML specification ‣DTD 2. Formal grammars in XML

10 XML for Information Management – Day 3 Airi Salminen 10 The XML specification uses the EBNF (Extended Backus-Naur Form) notation with metasymbols ?, *, +, |, and ( ) The syntax of XML 1.0 is described by production rules numbered from [1] to [89]. A subset of the rules included in the first edition have been left out in later editions, some other have been added, for example, [28a], [28b]. The notation of XML syntax is decribed in Section 6 of the specification: 6. Notation.6. Notation 2. Formal grammars in XML

11 XML for Information Management – Day 3 Airi Salminen 11 A? A is optional A | BA and B are alternatives A +A occurs once or more A*A may be missing or occurs once or more A - BA but not B A BB after A ( )grouping document ::= prolog element Misc* prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? Misc::= Comment | PI | S Comment::= '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->' 2. Formal grammars in XML Example rules in XML 1.0:

12 XML for Information Management – Day 3 Airi Salminen 12 Production rules in a DTD: DTD does not describe in the element type declarations the concrete syntax of elements, only their hierarchic structure. The details of the concrete syntax (begin-tag, end-tag, etc.) are described in the XML specification. 2. Formal grammars in XML

13 XML for Information Management – Day 3 Airi Salminen 13 XML spesification defines the concrete syntax of XML documents. The distinction between the concrete and abstract syntax of XML is not quite clear. W3C has developed four slightly different models to describe the abstract syntax: 2. Formal grammars in XML XML Information Set DOM model XPath 1.0 model XQuery 1.0 and XPath 2.0 data model Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press.

14 XML for Information Management – Day 3 Airi Salminen 14 3. Natural languages in XML documents Natural language may occur in XML marked up text in the: content of elements markup element, attribute, and entity names attribute values comments

15 XML for Information Management – Day 3 Airi Salminen 15 3. Natural language in XML documents human individuals in reading the markedup text information access communicating with other individuals about the schema or marked up content some software applications, for example, text analysis software Natural language in the markup is NOT utilized by the XML processor, BUT it can be utilized by

16 XML for Information Management – Day 3 Airi Salminen 16 4. Adding meaning by markup It is important that the element and attribute names are meaningful to human readers. The names are not useful in information access Where wilt thou lead me? speak; I'll go no further. Mark me.

17 XML for Information Management – Day 3 Airi Salminen 17 4. Adding meaning by markup  Natural language in XML documents provides semantic information to human readers and for human communication.  Meaningful markup is useful for human users in information retrieval and in specifying transformations.  Markup may provide rich semantic and linguistic information.

18 XML for Information Management – Day 3 Airi Salminen 18 4. Adding meaning by markup Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218. She smelled like trees. She smelled like trees Example of combining structural, semantic and linguistic markup:

19 XML for Information Management – Day 3 Airi Salminen 19 4. Adding meaning by markup She smelled like trees. She smelled like trees. Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218. Another markup for the same text:

20 XML for Information Management – Day 3 Airi Salminen 20 4. Adding meaning by markup Some other examples: http://nrrc.mitre.org/NRRC/Docs_Data/MPQA_04/approval_time.htm http://www.cs.cmu.edu/~awb/festival_demos/sable.html http://www.etang.umontreal.ca/bwp1800/essays/flanders_encoding4.html

21 XML for Information Management – Day 3 Airi Salminen 21 4. Adding meaning by markup  In Semantic Web semantic information about the meaning of markup vocabulary of documents is available as additional metadata in a formal, standardized form.  The concepts and meanings are defined in formal ontologies.  Software applications can understand the meanings.

22 XML for Information Management – Day 3 Airi Salminen 22 5. Text indexing documents index search engine query answer In information retrieval environments collections of natural language documents are usually indexed, retrieval is based on the index terms included in the index.

23 XML for Information Management – Day 3 Airi Salminen 23 6. Logical structure of XML documents declarations elements comments processing instructions Components of the logical structure

24 XML for Information Management – Day 3 Airi Salminen 24 6. Logical structure of XML documents document ::= prolog element Misc* declarations comments processing instructions elements comments processing instructions comments processing instructions

25 XML for Information Management – Day 3 Airi Salminen 25 ‣XML declaration [23]23 ‣document type declaration [28]28 ‣markup declaration [29]29 element type declaration [45]45 attribute list declaration [52]52 entity declaration [70]70 notation declaration [82]82 ‣encoding declaration [80]80 ‣standalone document declaration [32]32 ‣text declaration [77]77 Declarations: 6. Logical structure of XML documents to constrain the logical structure to constrain the physical structure

26 XML for Information Management – Day 3 Airi Salminen 26 Typical element type declarations: 6. Logical structure of XML documents mixed content defined element content defined empty element defined

27 XML for Information Management – Day 3 Airi Salminen 27 6. Logical structure of XML documents empty element defined: two forms of the element allowed in a well-formed document:

28 XML for Information Management – Day 3 Airi Salminen 28 6. Logical structure of XML documents element content: definition by content models with metasymbols * iteration (none or more) + iteration (once or more) | alternatives ?optional,successive ( )grouping #PCDATA is not accepted in the content model! <!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))> Example from XHTML 1.0 Strict DTD:XHTML 1.0 Strict DTD

29 XML for Information Management – Day 3 Airi Salminen 29 6. Logical structure of XML documents mixed content: definition has basically two forms (#PCDATA) (#PCDATA | e1 | … | en)* #PCDATA is always included in the content specification and comes first in the list of alternatives examples:

30 XML for Information Management – Day 3 Airi Salminen 30 to define the set of attributes pertaining to a given elemen type to establish type constraints for these attributes to provide default values for attributes Attribute list declarations 6. Logical structure of XML documents

31 XML for Information Management – Day 3 Airi Salminen 31 attribute name attribute type: string constraint: the attribute must be specified for all elements of type poem element type 6. Logical structure of XML documents

32 XML for Information Management – Day 3 Airi Salminen 32 Defining constraints #REQUIRED : attribute must always be provided in all elements of the given type #IMPLIED : attribute can be provided in a element; no default value is provided AttValue: default value is given between single or double quotes #FIXED AttValue: instances of the attribute must match the given default value [60] DefaultDecl ::= '#REQUIRED' |60 '#IMPLIED'| (('#FIXED' S) ? AttValue) 6. Logical structure of XML documents

33 XML for Information Management – Day 3 Airi Salminen 33 Attribute types [54] AttType ::= StringType | TokenizedType | EnumeratedType54 ENTITY, ENTITIES: entity names NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names ID: names that uniquely identify elements IDREF, IDREFS: references to ID type identifiers tokenized types: enumerated types: NOTATION, NOTATIONS: identify notations enumeration 6. Logical structure of XML documents

34 XML for Information Management – Day 3 Airi Salminen 34 <!DOCTYPE text [ <!ATTLISTline idID #REQUIRED seelineIDREFS#IMPLIED> ]> This is the first line This is the second line, but look at the first too 6. Logical structure of XML documents

35 XML for Information Management – Day 3 Airi Salminen 35 6. Logical structure of XML documents She smelled like trees. XML-aware web browsers support the visualization of the tree structure: exampleexample

36 XML for Information Management – Day 3 Airi Salminen 36 6. Logical structure of XML documents Different abstract models to decribe the tree in slightly different ways. This life of ours would not cause you sorrow if you thought of it as like the mountain cherry blossoms which bloom and fade in a day.

37 XML for Information Management – Day 3 Airi Salminen 37 poem line Author Murasaki Shikibu line born 974 This life of ours would not cause you sorrow if you thought of it as like which bloom and fade in a day. the mountain cherry blossoms Root node Element node Attribute node The poem is translated from Japanese by Kenneth Rexroth Text node Comment node poem 6. Logical structure of XML documents Node types of XPath 1.0


Download ppt "XML for Information Management – Day 3 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:"

Similar presentations


Ads by Google