Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.4.

Similar presentations


Presentation on theme: "Advanced Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.4."— Presentation transcript:

1 Advanced Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.4

2 Text markup languages SGML SGML HTML HTML XML XML

3 Document Markup languages Text Markup: additional information that is added to the text (but is not part of the text) to provide such things as Text Markup: additional information that is added to the text (but is not part of the text) to provide such things as –Formatting instructions –Structural information –Semantics Markup languages have evolved from providing instructions on the printing style of each document part, to providing information on the function of each document part. Markup languages have evolved from providing instructions on the printing style of each document part, to providing information on the function of each document part. –Example: Instead of marking a section heading to be printed with large font size and boldface font, it is marked up as “section heading”. – Allows formatting to be interpreted differently in different situations. –Allows documents originating from different sources to be appear more uniformly.

4 Document Markup languages (cont.) Formatting markup languages Formatting markup languages –TeX, Troff and similar typesetting languages interleave actual document text and formatting instructions. A processor reads such files and produces a formatted output suitable for printing. Web markup Languages: Web markup Languages: –SGML: A metalanguage for structuring large documents. It defines a document structure and its associated markup conventions. –HTML: A hypertext language used for linking and displaying Web documents. A browser reads such files and produces output suitable for display. – XML: A subset of SGML, used for semantic markup. An increasingly popular language for data exchange.

5 Standard Generalized Markup Language (SGML) SGML evolved from an earlier markup language called Generalized Markup Language (GML) created by IBM in the 1960s. SGML evolved from an earlier markup language called Generalized Markup Language (GML) created by IBM in the 1960s. Strictly speaking, SGML is not a markup language, but a metalanguage: A language for defining markup languages. Strictly speaking, SGML is not a markup language, but a metalanguage: A language for defining markup languages. It is instantiated to specific languages by defining individual document types and their corresponding markup languages. It is instantiated to specific languages by defining individual document types and their corresponding markup languages. Well known instance of SGML: HTML Well known instance of SGML: HTML

6 SGML(cont.) An SGML document consists of three major parts: An SGML document consists of three major parts: –1. The SGML declaration describes the document character set, the codes used to identify and delimit markup sequences, and so on. –2. The Document Type Definition (DTD) defines the model for documents: the various elements of the document, how these elements relate to one another, what are their possible attributes, and so on. –3. The document instance contains the marked-up contents of the document. The instance contains a reference to the DTD to be used in interpreting it. SGML markup does not describe the semantics of the markup. SGML markup does not describe the semantics of the markup. The semantics of elements and attributes are described in a separate document (or in comments embedded in the DTD). The semantics of elements and attributes are described in a separate document (or in comments embedded in the DTD).

7 SGML(cont.) Basic DTD syntax: Basic DTD syntax: –ELEMENT: defines a tag. –ATTLIST: defines the possible attributes of a tag. – PCDATA, NDATA: indicate ASCII text or binary data, respectively. – – – indicates that both the begin and end tags are required, – O indicates that the begin tag is required and the end tag is optional, etc. –The symbols |, ? * + are regular expression syntax, indicating (respectively) disjunction, concatenation, zero or one occurrences, any number of occurrences, one or more occurrences. –Much of the syntax of SGML is similar to the syntax of XML (to be discussed later in more detail).

8 SGML (cont.) Example: SGML specification for electronic mail messages. Example: SGML specification for electronic mail messages. –DTD: e-mail - - (prolog, contents)> e-mail - - (prolog, contents)> prolog - - (sender, address+, subject?, Cc*)> prolog - - (sender, address+, subject?, Cc*)> (sender | address | subject | Cc) – O (#PCDATA)> (sender | address | subject | Cc) – O (#PCDATA)> contents - - (par | image | audio)+> contents - - (par | image | audio)+> par - O (ref | #PCDATA)+> par - O (ref | #PCDATA)+> ref - O EMPTY> ref - O EMPTY> (image audio) - - (#NDATA)> (image audio) - - (#NDATA)> <!ATTLIST e-mail id ID #REQUIRED date_sent DATE #REQUIRED status (secret | public) public> <!ATTLIST ref id ID #REQUIRED> <!ATTLIST (image | data) id ID #REQUIRED>

9 SGML (cont.) An example document with previous DTD: An example document with previous DTD: <prolog> Pablo Naruda Pablo Naruda Federico Garcia Lorca Federico Garcia Lorca Ernest Hemingway Ernest Hemingway Pictures of my house in Isla Negra Pictures of my house in Isla Negra Gabriel Garcia Marquez Gabriel Garcia Marquez </prolog><contents> As promised in my previous letter, I am sending two digital pictures to show you my house and the splendid view of the Pacific Ocean from my bedroom (photo ). "photo1.gif" "photo1.gif" "photo2.jpg" "photo2.jpg" <regards from the south, Pablo. </contents></e-mail>

10 Hypertext Markup language HTML is an instance of SGML, created in 1989. HTML is an instance of SGML, created in 1989. HTML is the standard language for storing documents on the World Wide Web. HTML is the standard language for storing documents on the World Wide Web. HTML follows the SGML conventions, and has a DTD, but documents do not make explicit reference to this DTD. HTML follows the SGML conventions, and has a DTD, but documents do not make explicit reference to this DTD. HTML is under continuous evolution (currently, version 4). HTML is under continuous evolution (currently, version 4).

11 HTML (cont.) Some notable features of HTML: Some notable features of HTML: –Tags that determine the way certain text, such as titles, is rendered on the screen. –Tags that are links to other documents, letting users navigate from document to document. –Markup for forms, that let the user fill out information and electronically –send or e-mail the data to the document author, initiate sophisticated searches, or order goods or services. –Tags for embedding other types of media such as pictures or audio. –Tags for embedding programs (using Java applets or JavaScript). –Tags for storing metadata.

12 HTML (cont.) Example: HTML document and how it is seen in the browser. Example: HTML document and how it is seen in the browser.<html><head> HTML Example HTML Example </head><body> HTML Example HTML Example <p><hr><p> HTML has many tags, among them: links to other links to other pages</a> paragraphs (p), headings (h1, h2, etc.) paragraphs (p), headings (h1, h2, etc.) font types (b,i), font types (b,i), horizontal rules (hr), horizontal rules (hr), indented lists and items (ul, li), indented lists and items (ul, li), images (img), tables, forms etc. images (img), tables, forms etc.</ul><p><hr><p> This page is always under construction. </body></html>

13 HTML (cont.) Cascade Style Sheet (CSS): Definition of style rules that tell a browser how to present a document. Cascade Style Sheet (CSS): Definition of style rules that tell a browser how to present a document. CSS provide Web authors a powerful tool for improving the aesthetics of Web pages. CSS provide Web authors a powerful tool for improving the aesthetics of Web pages. Example: The following code defines the color and font-size properties for H1 and H2 elements. It tells the browser to show level-one headings in an extra- large, red font, and to show level two headings in a large, blue font. Example: The following code defines the color and font-size properties for H1 and H2 elements. It tells the browser to show level-one headings in an extra- large, red font, and to show level two headings in a large, blue font.<head> CSS Example CSS Example h1 { font-size: x-large; color: red } h2 { font-size: large; color: blue } </style></head>

14 Extensible Markup Language (XML) Defined by the WWW Consortium (W3C). Defined by the WWW Consortium (W3C). Originally intended as a document markup language to replace HTML as the language for publishing documents on the Web. Originally intended as a document markup language to replace HTML as the language for publishing documents on the Web. Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML. Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML. Extensible (i.e., a meta-language): Users can add new tags, and separately specify how a tag should be handled for display. Extensible (i.e., a meta-language): Users can add new tags, and separately specify how a tag should be handled for display. The ability to specify new tags and to create nested tag structures make XML useful for exchange of data (not just documents). The ability to specify new tags and to create nested tag structures make XML useful for exchange of data (not just documents). Much of the use of XML has been in data exchange applications, not as a replacement for HTML. Much of the use of XML has been in data exchange applications, not as a replacement for HTML.

15 XML (cont.) Data interchange is critical in today’s networked world. Paper flow of information between organizations is being replaced by electronic flow of information. Data interchange is critical in today’s networked world. Paper flow of information between organizations is being replaced by electronic flow of information. –Banking: funds transfer. –Order processing; especially inter-company orders. –Scientific data: chemistry, genetics. With XML, each application area sets its own standards for representing information. With XML, each application area sets its own standards for representing information. Each XML-based standard defines the valid elements, using Each XML-based standard defines the valid elements, using –XML-type specification (DTD or XML schema) to specify the syntax. –Textual descriptions of the semantics. A wide variety of tools is available for parsing, browsing and querying XML documents or data. A wide variety of tools is available for parsing, browsing and querying XML documents or data.

16 Basic Structure Basic Syntax: Basic Syntax: –ELEMENT: Section of data beginning with and ending with matching. –Mixing text with elements is allowed. Example: Example:<university> 123-456 123-456 Johnson Johnson Information Systems Information Systems This enrollment is still incomplete. This enrollment is still incomplete. 123-4567 123-4567 INFS-623 INFS-623 </university>

17 Attributes Elements can have attributes. Elements can have attributes. –Attributes are specified by name=value pairs inside the starting tag of an element. –An element may have several attributes, but each attribute name may occur only once. Example: Example: 123-456 123-456 Johnson Johnson Information Systems Information Systems </student> –Note that the same information could also be specified with a subelement: graduate graduate –Suggestion: Use attributes for identifiers of elements, and use subelements for contents.

18 Namespaces XML data has to be exchanged between organizations. XML data has to be exchanged between organizations. –The same tag name may have different meaning in different organizations, causing confusion on exchanged documents. –Specifying a unique string as an element name avoids such confusion. –Better solution: use unique-name:element-name. –Avoid using long unique names all over the document by using XML namespaces. Example: Example: …<GMU:student> Johnson Johnson Information Systems Information Systems </GMU:student>…</university>

19 Schemas Database schemas constrain what information can be stored, and the data types of stored values. Database schemas constrain what information can be stored, and the data types of stored values. XML documents are not required to have an associated schema. XML documents are not required to have an associated schema. However, schemas are very important for XML data exchange, to allow a site to automatically interpret data received from another site. However, schemas are very important for XML data exchange, to allow a site to automatically interpret data received from another site. Two mechanisms for specifying XML schema: Two mechanisms for specifying XML schema: –Document Type Definition (DTD): Widely used. –XML Schema: Newer, more complex. We shall discuss only DTD schemas. We shall discuss only DTD schemas.

20 Document Type Definition (DTD) A DTD specifies the type of an XML document. It constraints the structure of XML data by declaring A DTD specifies the type of an XML document. It constraints the structure of XML data by declaring –The elements that can occur. –The subelements that can/must occur inside an element, and how many times. –The attributes that an element can/must have. A DTD does not constrain data types: A DTD does not constrain data types: –All values represented as strings in XML. DTD syntax: DTD syntax:

21 DTD (cont.) Subelements are either Subelements are either –Names of other elements. –#PCDATA (parsed character data – character strings). –EMPTY (no subelements) or ANY (anything can be a subelement). Example: Example: – – Subelement specification may have regular expressions. Subelement specification may have regular expressions.Notation: –| – alternatives –+ – 1 or more occurrences –* – 0 or more occurrences Example: Example:

22 DTD (cont.) Example: University DTD with information on students, courses and enrollments. Example: University DTD with information on students, courses and enrollments. <!DOCTYPE university [ ]>

23 DTD (cont.) Attribute specifications include three components: Attribute specifications include three components: –Attribute name –Attribute type: CDATA (character data). CDATA (character data). ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs). ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs). –Attribute value information: The value must be specified in each element (#REQUIRED). The value must be specified in each element (#REQUIRED). There is a default value (value). There is a default value (value). Examples: Examples: – – –<!ATTLIST student student-id ID # REQUIRED enrollments IDREFS # REQUIRED >

24 DTD (cont.) Attributes of type ID, IDREF and IDREFS: Attributes of type ID, IDREF and IDREFS: –An element can have at most one attribute of type ID. –The ID attribute value of each element in an XML document must be distinct (hence, the ID attribute value is an identifier). –An attribute of type IDREF must contain the ID value of an element in the same document. –An attribute of type IDREFS contains a set of (0 or more) ID values. –Each ID value must contain the ID value of an element in the same document.

25 DTD (cont.) Example: A University DTD with ID and IDREF attribute types. Example: A University DTD with ID and IDREF attribute types. <!DOCTYPE university [ <!ATTLIST enrollment enrollment-no ID # REQUIRED enrolled-student IDREF # REQUIRED> <!ATTLIST student student-id ID # REQUIRED enrollments IDREFS # REQUIRED> … declarations for course-title, grade, student-name, and major …]>

26 DTD (cont.) Example: XML data for the previous DTD. Example: XML data for the previous DTD.<university> Johnson Johnson Information Systems Information Systems Information Retrieval Information Retrieval B+ B+ Database Systems Database Systems A A </university>

27 DTD (cont.) DTD has several limitations: DTD has several limitations: –No typing of text elements and attributes (all values are strings). –Difficult to specify unordered sets of subelements (order is often irrelevant in databases). –IDs and IDREFs are untyped; for example, the enrollments attribute of a student may contain a reference to another student, which is meaningless (enrollments should ideally be constrained to refer to enrollmentelements). XML Schema is a more sophisticated schema language which addresses these drawbacks of DTDs, and offers many more features. XML Schema is a more sophisticated schema language which addresses these drawbacks of DTDs, and offers many more features. XML schema is significantly more complicated than DTDs, and is not yet widely used. XML schema is significantly more complicated than DTDs, and is not yet widely used.

28 Querying and transforming XML data Querying XML data and translation of information from one XML schema to another are closely related, and are handled by the same tools. Querying XML data and translation of information from one XML schema to another are closely related, and are handled by the same tools. Standard XML querying/translation languages Standard XML querying/translation languages –Xpath: Simple language consisting of path expressions. –XSLT: Simple language designed for translation from XML to XML and XML to HTML. –Xquery: An XML query language with a rich set of features.

29 Tree model of XML data Query and transformation languages are based on a tree model of XML data. Query and transformation languages are based on a tree model of XML data. An XML document is modeled as a tree, with nodes corresponding to elements and attributes. An XML document is modeled as a tree, with nodes corresponding to elements and attributes. –Element nodes have children nodes, which can be attributes or subelements. –Text in an element is modeled as a text node child of the element. –Children of a node are ordered according to their order in the XML document. –Element and attribute nodes (except for the root node) have a single parent, which is an element node. –The root node has a single child, which is the root element of the document.

30 Tree model of XML data (cont.) The XML tree for the university example The XML tree for the university example

31 XSLT XSLT stands for Extensible Stylesheet Language Transformations XSLT stands for Extensible Stylesheet Language Transformations It is used to transform XML documents into other kinds of documents, e.g. HTML, PDF, XML, … It is used to transform XML documents into other kinds of documents, e.g. HTML, PDF, XML, … XSLT uses two input files: XSLT uses two input files: –The XML document containing the actual data –The XSL document containing both the “framework” in which to insert the data, and XSLT commands to do so

32 XSLT Architecture Source XML doc XSL stylesheet XSL processor Target Document

33 Some special transforms XML to HTML— for old browsers XML to HTML— for old browsers XML to LaTeX—for TeX layout XML to LaTeX—for TeX layout XML to SVG—graphs, charts, trees XML to SVG—graphs, charts, trees XML to tab-delimited—for db/stat packages XML to tab-delimited—for db/stat packages XML to plain-text—occasionally useful XML to plain-text—occasionally useful XML to XSL-FO formatting objects XML to XSL-FO formatting objects

34 XSLT Data Model XSLT reads an XML documents as a source tree XSLT reads an XML documents as a source tree Transforms the documents into a result tree Transforms the documents into a result tree Transformations are specified in a stylesheet Transformations are specified in a stylesheet To navigate the tree XSLT uses XPath To navigate the tree XSLT uses XPath

35 Introduction to XPath XPath is a syntax for addressing parts of an XML document by XPath is a syntax for addressing parts of an XML document by –describing paths through the document hierarchy –specifying constraints to match against the document's structure XSL uses XPath expressions to XSL uses XPath expressions to –determine which elements match a template –select nodes upon which to perform operations

36 XPath Basics XPath expressions superficially resemble UNIX pathnames, e.g. poem/stanza/line refers to "all line elements which are children of stanza elements which are children of poem elements" XPath expressions superficially resemble UNIX pathnames, e.g. poem/stanza/line refers to "all line elements which are children of stanza elements which are children of poem elements" XPath expressions are evaluated relative to a "context node", which is analogous to the "current working directory" in UNIX or DOS. The XPath expression for this is "." XPath expressions are evaluated relative to a "context node", which is analogous to the "current working directory" in UNIX or DOS. The XPath expression for this is "."

37 XPath Basics: a Simple Example Consider the following XML document: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! Consider the following XML document: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

38 XPath Basics: a Simple Example (cont.) The XPath " poem/stanza/line " selects Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! The XPath " poem/stanza/line " selects Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

39 XPath Basics: wildcards The XPath " poem/stanza/* " selects Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! The XPath " poem/stanza/* " selects Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

40 XPath Basics: descendants The XPath " poem//punch " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! The XPath " poem//punch " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

41 XPath Basics: sequencing " poem/stanza/line[1] " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! " poem/stanza/line[1] " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

42 XPath Basics: sequencing (cont.) " poem/stanza/line[position() = last()] " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! " poem/stanza/line[position() = last()] " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

43 XPath Basics: selecting text nodes " poem/author/text() " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! " poem/author/text() " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

44 XPath Basics: conditionals " poem/stanza[punch] " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! " poem/stanza[punch] " selects: Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

45 XPath Basics: conditionals: equality “ //line[text()="I'm a poet"] ” Roses Ima Poet Roses are red violets are blue I'm a poet and you're not! “ //line[text()="I'm a poet"] ” Roses Ima Poet Roses are red violets are blue I'm a poet and you're not!

46 A simple XSL example File data.xml : File data.xml : Hello World! Hello World! File render.xsl : File render.xsl :

47 Stylesheet (.xsl file) It is a well-formed XML document It is a well-formed XML document It is a collection of template rules It is a collection of template rules A template rule consists of pattern and a template A template rule consists of pattern and a template Pattern is specified in Xpath and locates the node of the XML tree. Pattern is specified in Xpath and locates the node of the XML tree. The located node is replaced by the template in the result tree The located node is replaced by the template in the result tree

48 The.xsl file An XSLT document has the.xsl extension An XSLT document has the.xsl extension The XSLT document begins with: The XSLT document begins with: Contains one or more templates, such as: Contains one or more templates, such as:...... And ends with: And ends with:

49 Finding the message text The template says to select the entire file The template says to select the entire file –You can think of this as selecting the root node of the XML tree Inside this template, Inside this template, – selects the message child –Alternative Xpath expressions that would also work:./message./message /message/text() /message/text()./message/text()./message/text()

50 Putting it together The XSL was: The XSL was: The chooses the root The chooses the root The is written to the output file The is written to the output file The contents of message is written to the output file The contents of message is written to the output file The is written to the output file The is written to the output file The resultant file looks like: Hello World! The resultant file looks like: Hello World!

51 How XSLT works The XML text document is read in and stored as a tree of nodes The XML text document is read in and stored as a tree of nodes The template is used to select the entire tree The template is used to select the entire tree The rules within the template are applied to the matching nodes, thus changing the structure of the XML tree The rules within the template are applied to the matching nodes, thus changing the structure of the XML tree –If there are other templates, they must be called explicitly from the main template Unmatched parts of the XML tree are not changed Unmatched parts of the XML tree are not changed After the template is applied, the tree is written out again as a text document After the template is applied, the tree is written out again as a text document

52 Where XSLT can be used A server can use XSLT to change XML files into HTML files before sending them to the client A server can use XSLT to change XML files into HTML files before sending them to the client A modern browser can use XSLT to change XML into HTML on the client side A modern browser can use XSLT to change XML into HTML on the client side –This is what we will mostly be doing in this class Most users seldom update their browsers Most users seldom update their browsers –If you want “everyone” to see your pages, do any XSL processing on the server side

53 Modern browsers Internet Explorer 6 best supports XML Internet Explorer 6 best supports XML Netscape 6 supports some of XML Netscape 6 supports some of XML Internet Explorer 5.x supports an obsolete version of XML Internet Explorer 5.x supports an obsolete version of XML –If you must use IE5, the initial PI is different (you can look it up if you ever need it)

54 xsl:value-of selects the contents of an element and adds it to the output stream selects the contents of an element and adds it to the output stream –The select attribute is required –Notice that xsl:value-of is not a container, hence it needs to end with a slash Example (from an earlier slide): Example (from an earlier slide):

55 xsl:for-each xsl:for-each is a kind of loop statement xsl:for-each is a kind of loop statement The syntax is Text to insert and rules to apply The syntax is Text to insert and rules to apply Example: to select every book ( //book ) and make an unordered list ( ) of their titles ( title ), use: Example: to select every book ( //book ) and make an unordered list ( ) of their titles ( title ), use:

56 Filtering output You can filter (restrict) output by adding a criterion to the select attribute’s value: You can filter (restrict) output by adding a criterion to the select attribute’s value: This will select book titles by Terry Pratchett This will select book titles by Terry Pratchett

57 Filter details Here is the filter we just used: Here is the filter we just used: author is a sibling of title, so from title we have to go up to its parent, book, then back down to author author is a sibling of title, so from title we have to go up to its parent, book, then back down to author This filter requires a quote within a quote, so we need both single quotes and double quotes This filter requires a quote within a quote, so we need both single quotes and double quotes Legal filter operators are: = != < > Legal filter operators are: = != < > –Numbers should be quoted, but apparently don’t have to be

58 But it doesn’t work right! Here’s what we did: Here’s what we did: This will output and for every book, so we will get empty bullets for authors other than Terry Pratchett This will output and for every book, so we will get empty bullets for authors other than Terry Pratchett There is no obvious way to solve this with just xsl:value-of There is no obvious way to solve this with just xsl:value-of

59 xsl:if xsl:if allows us to include content if a given condition (in the test attribute) is true xsl:if allows us to include content if a given condition (in the test attribute) is true Example: Example: This does work correctly! This does work correctly!

60 xsl:choose The xsl:choose... xsl:when... xsl:otherwise construct is XML’s equivalent of Java’s switch... case... default statement The xsl:choose... xsl:when... xsl:otherwise construct is XML’s equivalent of Java’s switch... case... default statement The syntax is:... some code...... some code... The syntax is:... some code...... some code... xsl:choose is often used within an xsl:for-each loop

61 xsl:sort You can place an xsl:sort inside an xsl:for- each You can place an xsl:sort inside an xsl:for- each The attribute of the sort tells what field to sort on The attribute of the sort tells what field to sort on Example: by Example: by –This example creates a list of titles and authors, sorted by author

62 xsl:text... helps deal with two common problems:... helps deal with two common problems: –XSL isn’t very careful with whitespace in the document This doesn’t matter much for HTML, which collapses all whitespace anyway (though the HTML source may look ugly) This doesn’t matter much for HTML, which collapses all whitespace anyway (though the HTML source may look ugly) gives you much better control over whitespace; it acts like the element in HTML gives you much better control over whitespace; it acts like the element in HTML –Since XML defines only five entities, you cannot readily put other entities (such as ) in your XSL &nbsp; almost works, but is visible on the page &nbsp; almost works, but is visible on the page Here’s the secret formula for entities: Here’s the secret formula for entities: &nbsp;

63 Creating tags from XML data Suppose the XML contains Dr. Abolhassani's Home Page http://sharif.edu/~abolhassani Suppose the XML contains Dr. Abolhassani's Home Page http://sharif.edu/~abolhassani And you want to turn this into Dr. Abolhassani's Home Page And you want to turn this into Dr. Abolhassani's Home Page We need additional tools to do this! We need additional tools to do this!

64 Creating tags--solution 1 Suppose the XML contains Dr. Abolhassani's Home Page http://sharif.edu/~abolhassani Suppose the XML contains Dr. Abolhassani's Home Page http://sharif.edu/~abolhassani adds the named attribute to the enclosing tag adds the named attribute to the enclosing tag The value of the attribute is the content of this tag The value of the attribute is the content of this tag Example: Example: Result: Dr. Abolhassani's Home Page Result: Dr. Abolhassani's Home Page

65 Creating tags--solution 2 Suppose the XML contains Dr. Abolhassani's Home Page http://sharif.edu/~abolhassani Suppose the XML contains Dr. Abolhassani's Home Page http://sharif.edu/~abolhassani An attribute value template (AVT) consists of braces { } inside the attribute value An attribute value template (AVT) consists of braces { } inside the attribute value The content of the braces is replaced by its value The content of the braces is replaced by its value Example: Example: Result: Dr. Abolhassani's Home Page Result: Dr. Abolhassani's Home Page

66 Modularization Modularization--breaking up a complex program into simpler parts--is an important programming tool Modularization--breaking up a complex program into simpler parts--is an important programming tool –In programming languages modularization is often done with functions or methods –In XSL we can do something similar with xsl:apply-templates For example, suppose we have a DTD for book with parts titlePage, tableOfContents, chapter, and index For example, suppose we have a DTD for book with parts titlePage, tableOfContents, chapter, and index –We can create separate templates for each of these parts

67 Book example Table of Contents Table of Contents Etc. Etc.

68 xsl:apply-templates The element applies a template rule to the current element or to the current element’s child nodes The element applies a template rule to the current element or to the current element’s child nodes If we add a select attribute, it applies the template rule only to the child that matches If we add a select attribute, it applies the template rule only to the child that matches If we have multiple elements with select attributes, the child nodes are processed in the same order as the elements If we have multiple elements with select attributes, the child nodes are processed in the same order as the elements

69 Applying templates to children XML Gregory Brill XML Gregory Brill by by With this line: XML by Gregory Brill Without this line: XML

70 Tools for XSL Development There are a number of free and commercial XSL tools available There are a number of free and commercial XSL tools available –XSLT processors: MSXML, which currently supports the latest XSLT specification (native Win32) MSXML, which currently supports the latest XSLT specification (native Win32) Xalan from Apache (C++, Java) Xalan from Apache (C++, Java) –Editors and browsers Internet Explorer 6.0 Internet Explorer 6.0 XML Spy (commercial) XML Spy (commercial)

71 Cocoon Cocoon is Apache’s dynamic XML Publishing Framework. Cocoon is Apache’s dynamic XML Publishing Framework. Cocoon uses XSLT. Cocoon uses XSLT. Cocoon allows separation of content, logic and presentation. making sure people can interact and collaborate on a project, without stepping on each other toes, and component-based web development. Cocoon allows separation of content, logic and presentation. making sure people can interact and collaborate on a project, without stepping on each other toes, and component-based web development. Cocoon is a web-application that runs using Apache Tomcat (Cocoon.war). Cocoon is a web-application that runs using Apache Tomcat (Cocoon.war).

72 What Cocoon can do

73 Cocoon Pipeline Cocoon introduced the idea of a pipeline to handle a request. A pipeline is a series of steps for processing a particular kind of content.

74 Sitemap In Cocoon, configuration information for the pipelines that an application requires is defined in a file named sitemap.

75 References Specifications: Specifications: – http://www.w3.org/Style/XSL http://www.w3.org/Style/XSL – http://www.w3.org/TR/xslt http://www.w3.org/TR/xslt – http://www.w3.org/TR/xpath http://www.w3.org/TR/xpath – http://www.w3.org/TR/xsl http://www.w3.org/TR/xsl An excellent XSLT tutorial: An excellent XSLT tutorial: –http://www.cafeconleche.org/books/bible2/chapters/ch17.html http://www.cafeconleche.org/books/bible2/chapters/ch17.html Another tutorial: Another tutorial: –http://www.w3schools.com/xsl http://www.w3schools.com/xsl Microsoft (MSXML3): Microsoft (MSXML3): –http://msdn.microsoft.com/xml http://msdn.microsoft.com/xml Saxon: Saxon: –http://saxon.sourceforge.net/ http://saxon.sourceforge.net/ Xalan: Xalan: –http://xml.apache.org./xalan/overview.html http://xml.apache.org./xalan/overview.html

76 Extended document standards You can define your own XML tag sets, but here are some already available: You can define your own XML tag sets, but here are some already available: –XHTML: HTML redefined in XML –SMIL: Synchronized Multimedia Integration Language –MathML: Mathematical Markup Language –SVG: Scalable Vector Graphics –DrawML: Drawing MetaLanguage –ICE: Information and Content Exchange –ebXML: Electronic Business with XML –cxml: Commerce XML –CBL: Common Business Library


Download ppt "Advanced Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.4."

Similar presentations


Ads by Google