Presentation is loading. Please wait.

Presentation is loading. Please wait.

Documents and Document Modeling Week 12 Lecture notes INF 380E: Perspectives on Information 1.

Similar presentations


Presentation on theme: "Documents and Document Modeling Week 12 Lecture notes INF 380E: Perspectives on Information 1."— Presentation transcript:

1 Documents and Document Modeling Week 12 Lecture notes INF 380E: Perspectives on Information 1

2 The Computational Perspective (review) Computing systems let us store and manipulate information The only thing a computing system can do is follow instructions an algorithm is a set of instructions – an ordered set of unambiguous, executable steps that defines a terminating process a program is a representation of an algorithm encoded for use by a computing system 2

3 Abstraction “distinction between the external properties of an entity and the details of the entity’s internal composition” abstraction allows development of algorithmic processing for classes of objects we can define a data structure that encodes information and algorithms to process and manipulate that information 3

4 Abstraction How do we decide on a general data structure for some variety of information object? We determine the important, essential features that we need to represent – in order to achieve our desired functions And define an abstract class that generalizes each individual object 4

5 Abstraction for document processing Document processing systems use computers to manipulate and process information in documents. – frequently using XML and similar technologies Relies on document modeling – “A set of techniques for designing systems and representing information in order to make more efficient and more functional the creation, management, and exploitation of document-like content” 5

6 Early Text Processing Files with text and markup:.pa odd;.font Times;.size 14;.it;.ce;.in +5 -5;.sk 3p b ;.sk 2p a;.kp next;.toc include; Assembling Your Silex […] were “batch” processed, creating formatted text: Assembling Your Silex 6

7 Abstracting from the specific Phase 1: Simple Macros – The text and macro “call” in the main source file :format17;Assembling Your Silex [...] – The macro “expansion” in a (possibly) separate location format17 = { “.pa odd;.font Times;.size 14;.it;.ce;.in +5 - 5;.sk 3p b ;.sk 2p a;.kp next;.toc include” } 7

8 to the general Phase 2: Descriptive Markup – The text and descriptive markup in the main source file :title;Assembling Your Silex [...] – The macro “expansion” in a (possibly) separate location title = { “.pa odd;.font Times;.size 14;.it;.ce;.in +5 -5;.sk 3p b ;.sk 2p a;.kp next;.toc include” } 8

9 Indirection Indirection in information systems allows for greater modularity – e.g. separating the formatting instructions from textual component they apply to In this case, it was a first step towards recognizing the "document genre" and defining an abstract class around it 9

10 Typical text components title author date abstract section, subsection, subsubsection section title, subsection title … etc paragraph extract (long quotation) equation diagram footnote 10

11 Genre-specific text components Playscripts: act, scene, stage direction, line, character, cast list. Poetry: title, author, verse, stanza, couplet, line, half- line Scientific article: title, author, affiliation, address, date submitted, date revised, keywords, abstract, introduction, methodology, results, discussion, conclusion, diagram, equation, plate, graph, chart, bibliography, bibliography item,.... date 11

12 Document modeling Analysis: Conceptually distinguish the “logical” structure of a document from appearance. – Sometimes called “document analysis” Markup: Identify the logical components of a document with descriptive markup (“tags”) from a given markup vocabulary To exploit the marked up document… – Develop and apply a stylesheet that associates each markup tag (type) with the appropriate processing instructions Or invoke or adapt an existing stylesheet 12

13 Document modeling activity Form a group with someone you didn't work with last week. identify content objects / textual components of a recipe do they consist of: – other components – character data – a mixture of character data and other components how do they occur? – exactly once – zero or once – one or more times – zero, one, or more times 13

14 <!DOCTYPE recipe [ ]> 14

15 Two kinds of abstraction in document modeling Genre level – Determining the "document model" for some type of documents – Results in a schema -- an XML vocabulary for marking up members of that document genre Document level – Determing what documents are in general – Results in the high-level XML data structure – a directed acyclic graph (a tree) 15

16 Data structures Generally, “the conceptual shape or arrangement of data”. You want a data structure that – fits the conceptual shape of your information – to let you access and manipulate information according to your needs 16

17 The best abstraction is one that captures what the thing really is "No hardware improvements or programming ingenuity can completely overcome a flawed representation." We need representations that will aid us in “collecting, preserving, organizing (arranging), representing (describing), selecting (retrieving), reproducing (copying), and disseminating documents”. 17

18 The OHCO Vision of What Text is Text is an Ordered Hierarchy of Content Objects, the grammar of which is determined by genre – content objects = things such as chapters, paragraphs, sentences, stanzas, lines, speeches, equations, titles, headings, abstracts – hierarchy = sentences inside paragraphs, paragraphs inside sections, sections inside chapters, etc., nesting with no overlaps – ordered = objects proceed or follow one another Formal structure: tree with ordered branches; – syntax expressible with a context free generative grammars developed in linguistics 18

19 The Two Things in the XML World Document Instances – particular documents, marked up with a markup language Schemas – One for each document type (class, category, genre) – Often playing the role of data structure standard – Defines a markup language for document structures by specifying its vocabulary and syntax (grammar) including: what elements can occur in documents of a particular type, what patterns these elements may form, what other information can be included about these elements rules for applying the markup to documents are not a formal part of XML per se, but are... which kind of standard? – (from Gilliland...) 19

20 Documents and Languages An XML document is like a sentence in a formal language. The schema (e.g. DTD) is a formal grammar for the language. – DTDs are based on BNF meta-grammars. – They define “context free grammars” (type 2 in the Chomsky hierarchy). – If a document conforms to the schema it is in the language, otherwise it isn’t. An XML document is a linearized parse tree – using a form of the “labeled nested bracket” linearization technique. Grammars are a technique for information modeling – for describing schemas and instances of schemas – that is well-suited for documents and text. 20

21 XML languages XML is a “meta-grammar” that lets you define document markup languages The elements of an XML vocabulary are based on abstractions of the specific components that appear in some genre of document. A document model is an abstract conceptualization of a class of documents; – it identifies the possible components of a document and the relationships those components may have. A schema (small "s") is (more or less) a formal specification of a document model in a particular document model specification language. EAD is an XML language – with elements that are based on the components of a finding aid TEI is an XML language EML (ecological metadata language) is an XML language 21

22 XML data structure The underlying data structure for an XML document is a “tree” – a directed acyclic graph – with ordered branches This hierarchical structure can be parsed to check for validation against a grammar – the grammar is an abstraction of class of documents Good for documents or data with hierarchical structure 22

23 XML schemas 23

24 Validity and Well-formedness Document instances can be Well-Formed elements must be bounded by both start and end tags elements must nest, no overlaps attribute values must be quoted attribute/value assignments must not be “minimized” all “<“ and “&” in content must be escaped Valid – Document instance correctly matches the rules given in a schema – "If documents are of known types, a special-purpose program (called a parser), once provided with an unambiguous definition of a document type, can check that any document claiming to be of that type does in fact conform to the specification." 24

25 Example DTD for poems <!ELEMENT author (#PCDATA) <!ELEMENT line (#PCDATA | italic | persname) 25

26 XML Processing Schema (DTD, XSL, RNG, etc) Document Instance Other information (e.g. stylesheet) XML Parser XML Application (e.g. formatter) Output (e.g. (Formatted Output) Expanded and reorgnized Parsed data Validity: Yes|No Well-formedness: Yes|No Errors 26

27 discussion break: interoperability Find a partner or two and discuss: Recall the levels of interoperability – if I send you a well-formed XML document, which level have we achieved? – if I send you a schema and a valid XML document, which level have we achieved? 27

28 What is a document, though? While DeRose, et al. look inside documents to ask "the question of essentials" – "What is it which, if changed, makes a document essentially different, and what is it which can change, yet a document remains 'the same'?" Buckland looks outside to ask how documents have been treated as an object of study. – Just what is it that we are working on when we're “collecting, preserving, organizing (arranging), representing (describing), selecting (retrieving), reproducing (copying)" 28

29 A challenge: the indeterminacy of documents Objects can be treated as informative, as evidence of some assertion – even if they were not created for that purpose, – or even if they were not created by people at all. 29

30 Discussion Let's discuss: Recall Suzanne Briet's requirements for something to be a document. – Do these requirements fit your own intuitions? – Does a digital environment present particular challenges for understanding or applying them? – What readings or concepts from earlier in the semester might we apply here? 30


Download ppt "Documents and Document Modeling Week 12 Lecture notes INF 380E: Perspectives on Information 1."

Similar presentations


Ads by Google