Processing of structured documents Spring 2003, Part 1 Helena Ahonen-Myka.

Slides:



Advertisements
Similar presentations
Mathematical Preliminaries
Advertisements

XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Copyright © 2003 Pearson Education, Inc. Slide 7-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Dr. Alexandra I. Cristea CS 253: Topics in Database Systems: XPath, NameSpaces.
22-Sep-06 CS6795 Semantic Web Techniques 0 Extensible Markup Language.
Programming Language Concepts
Data Structures Using C++
XML and Databases Exercise Session 3 (courtesy of Ghislain Fourny/ETH)
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Basic HTML Workshop LIS Web Team Spring 2007.
How to convert a left linear grammar to a right linear grammar
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Squares and Square Root WALK. Solve each problem REVIEW:
XML: Extensible Markup Language
Dr. Alexandra I. Cristea XHTML.
Dr. Alexandra I. Cristea XPath and Namespaces.
25 seconds left…...
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
The Pumping Lemma for CFL’s
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
M.Sc. of Advanced Software Engineering CO7206 System Reengineering XML & AST Many Slides are by Georgios Koutsoukos.
1 COS 425: Database and Information Management Systems XML and information exchange.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
Chapter 3: Formal Translation Models
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 XML Taken from Chapter 7.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
CREATED BY ChanoknanChinnanon PanissaraUsanachote
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Of 33 lecture 3: xml and xml schema. of 33 XML, RDF, RDF Schema overview XML – simple introduction and XML Schema RDF – basics, language RDF Schema –
Grammars CPSC 5135.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
CMSC 330: Organization of Programming Languages Context-Free Grammars.
Jeff Ullman: Introduction to XML 1 XML Semistructured Data Extensible Markup Language Document Type Definitions.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic.
The eXtensible Markup Language (XML). Presentation Outline Part 1: The basics of creating an XML document Part 2: Developing constraints for a well formed.
SDPL 2004Notes 2: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s.
Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s.
Martin Kruliš by Martin Kruliš (v1.1)1.
XML Extensible Markup Language
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
1 Introduction to XML Babak Esfandiari. 2 What is XML? introduced by W3C in 98 Stands for eXtensible Markup Language it is more general than HTML, but.
XML: Extensible Markup Language
Chapter 3 – Describing Syntax
XML in Web Technologies
XML Data Introduction, Well-formed XML.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
COMPILER CONSTRUCTION
Presentation transcript:

Processing of structured documents Spring 2003, Part 1 Helena Ahonen-Myka

2 Course organization laudatur course, 2 cu lectures (in Finnish) Tue 12-14, Thu 10-12, A217 not obligatory exercise sessions Mon 16-18, Tue 14-16, C454 course assistant: Olli Lahti not obligatory project work included

3 Requirements Exam (Thu 6.3. at 16-20): 45 points Project (deadline Fri 14.3.): 15 points integrated into the exercise sessions obligatory to return a report; attending the exercise sessions voluntary Maximum of points: 60

4 Outline (preliminary) 1. Structure representations grammatical descriptions data model issues, information sets (XML DTD,) XML Schema 2. Processing, transferring XML data SAX, DOM Web services (SOAP, WSDL, UDDI)

5 Outline Traversing and querying structured documents XPath XML Query 4. XML Linking 5. Metadata: RDF

6 Prerequisites You should know the basics of XML DTD, elements, attributes, syntax XSLT (basics), formatting some programming experience is needed

7 Project work Project work is integrated into the weekly exercises A ”large” example that lets us play with the concepts and tools discussed in the course Each exercise session includes one subtask solution is discussed in the exercise session Solutions to the subtasks have to be presented as a report (written in HTML) Return a report by (as a URL; instructions are given later)

8 1. Structure descriptions Regular expressions, context-free grammars -> What is XML? (XML Document type definitions) data modelling, information sets XML Schema

9 Regular expressions A way to describe a set of strings over an alphabet (of chars, events, elements…) many uses: text searching (e.g. emacs, grep, perl) in grammatical formalisms (e.g. XML DTDs) relevant for document structures: what kind of structural content is allowed for different document components

10 Regular expressions A regular expression over alphabet  is either  (an empty set)  (epsilon; sometimes lambda ) a, where a   R | S (choice; sometimes R  S) R S (catenation) or R* (Kleene closure) where R and S are regular expressions

11 Regular expressions Regular expression E denotes a language (a set of strings) L(E): L(  ) =  (empty set) L(  ) = {  } (singleton set of empty string) L(a) = {a} (singleton set of a   ) L(R|S) = L(R)  L(S) = {w | w  L(R) or w  L(S)} L(RS) = L(R)L(S) = {xy | x  L(R) and y  L(S)} L(R*) = L(R)* = {x 1 …x n | x k  L(R), k=1,…,n; n  0}

12 Example structure of an article:  = {title, author, date, section} title followed by an optional list of authors, followed by an optional date, followed by one or more sections: title author* (date |  ) section section* common abbreviations: E? = (E |  ); E + = E E* -> title author* date? section +

L(title author* date? section+) includes: title author date section section section title section title author author section

14 Expressive power of regular expressions operations: Catenation -> sequential order Choice -> also optional parts Closure -> repetition, optional repetition Operations can be nested -> more complex expressions … but we cannot express nested structures -> context-free grammars

16 Context-free grammars Used widely for syntax specification (programming languages) G = (V, , P, S) V: the alphabet of the grammar G; V =   N  : the set of terminal symbols; N = V-  : the set of nonterminal symbols P: set of productions S  N: the start symbol

17 Productions and derivations Productions: A -> , where A  N,   V* e.g. A -> aBa (1) Let ,   V*. String  derives  directly,  => , if  =  A ,  =  for some ,   V*, and A ->  is a production of the grammar e.g. AA => AaBa (assuming prod. 1 above)

18 Language generated by a context-free grammar  derives ,  =>* , if there is a sequence of 0 or more direct derivations that transforms  to  The language generated by a CFG G: L(G) = {w   * | S =>* w} L(G) is a set of strings: to model structural elements, we consider parse trees

19 Parse trees of a CFG Aka syntax trees or derivation trees nodes labelled by symbols of V (or by  ): internal nodes by nonterminals, root by start symbol leaves using terminal symbols (or  ) parent with label A can have children labeled by X 1,…,X k only if A -> X 1 …X k is a production

20 CFGs for document structures Nonterminals represent document structures e.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList ->  problem: obscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs

21 Extended CFGs (ECFGs) Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData Let ,   V*. String  derives  directly,  => , if  =  A ,  =  for some ,   V*, and A -> E is a production such that   L(E) e.g. Ref => Author Author Author Title PublData

22 Language generated by an ECFG Defined similarly to CFGs Theorem: Languages generated by extended and ordinary CFGs are the same -> expressive power is the same

23 Parse trees of an ECFG Similar to parse trees of an ordinary CFG, except that… parent with label A can have children labeled by X 1,…,X k when A -> E is a production such that X 1 …X k  L(E) -> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

24 What is XML? metalanguage that can be used to define markup languages gives syntax for defining extended context free grammars (DTDs) XML documents that adhere to an ECFG are strings in that language document types (grammars)- document instances (strings in the language)

25 XML encoding of structure XML document is essentially a parenthesized linear encoding of a parse tree corresponds to a preorder walk start of inner node (element) A denoted by a start tag, end denoted by end tag leaves are content strings (or empty elements) + certain extensions (especially attributes) + certain restrictions

26 Terminal symbols in practice Leaves of parse trees are normally labeled by single characters (symbols of  ) too granular in practice for XML documents: instead, terminal symbols which stand for all values of a type e.g. #PCDATA in XML for variable length content of data characters richer data types in other XML schema formalisms

27 An example DTD <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> ]>

Ashok Malhotra 123 IBM Ave. Hawthorne NY And a document:

29 Context-free vs. context- sensitive DTDs describe context-free languages e.g. element orderDate has always the same structure Some other schema declaration languages allow context-sensitive structures e.g. orderDate could be different for different products or text paragraph could have different structure restrictions in normal text and in a footnote