Chapter 6 Text and Multimedia Languages and Properties .. .
Introduction Document has given syntax and structure also has semantics may have presentation style associated with it Figure 6.1 depicts all these relationships document can also have information about itself, called metadata
one or more of these elements may be given together Syntax of document can express different elements such as structure, presentation style, semantics one or more of these elements may be given together structural element (e.g. section) can have fixed formatting style
Syntax of document can be implicit in its content expressed in declarative language or PL current trend is to use languages that provide information on document structure format semantics readable by humans and computers SGML is one such language
Metadata Metadata is data about data metadata associated with text include author date of publication source of publication document length (in pages, words, bytes) document genre (book, article, memo) Machine Readable Cataloging Record (MARC) is most used format for library records
In Web, metadata used for many purposes cataloging content rating (e.g. to protect children from reading some type of document) intellectual property rights digital signatures (for authentication) privacy levels (who should/should not have access to document) application to EC, etc.
New standard for Web metadata is Resource Description Framework (RDF) RDF allows description of Web resources consists of description of nodes and attached attribute/value pairs nodes can be any Web resource (any URI), that include URL attributes are properties of nodes, and their values are text strings or other nodes
Text With the advent of computers, necessary to code text in binary digits first coding schemes were EBCDIC and ASCII for internationalization of oriental languages like Chinese or Japanese Kanji, 16-bit Unicode (ISO10616) exists
Text Formats No single format for text document in the past, IR systems would convert document to internal format cannot change content of document current IR systems have filters to handle most popular documents, in particular Word, WordPerfect or Framemaker
Other text formats for document interchange Rich Text Format (RTF) used by word processors and has ASCII syntax Portable Document Format (PDF) developed for displaying and printing documents Multipurpose Internet Mail Exchange (MIME) used to encode electronic mail