Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.

Slides:



Advertisements
Similar presentations
CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
Advertisements

3.02B Authoring Languages 3.02 Develop webpages..
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Web Development & Design Foundations with XHTML
XML/EDI Overview West Chester Electronic Commerce Resource Center (ECRC)
XHTML Basics.
1 © Netskills Quality Internet Training, University of Newcastle XML.
HTML/XML XHTML Authoring. Creating Tables  Table: An arrangement of horizontal rows and vertical columns. The intersection of a row and a column is called.
Authoring Languages and Web Authoring Software 4.01 Examine web page development and design.
Project 1 Introduction to HTML.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
Developing a Basic Web Page with HTML
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
HTML, XML, PDF Pros and Cons.
1st Project Introduction to HTML.
Glencoe Digital Communication Tools Create a Web Page with HTML Chapter Contents Lesson 4.1Lesson 4.1 Get Started with HTML (85) Lesson 4.2Lesson 4.2 Format.
4.01B Authoring Languages and Web Authoring Software 4.01 Examine webpage development and design.
Tutorial 3: Adding and Formatting Text. 2 Objectives Session 3.1 Type text into a page Copy text from a document and paste it into a page Check for spelling.
Introduce of XML Xiaoling Song CS157A. What is XML? XML stands for EXtensible Markup Language XML stands for EXtensible Markup Language XML is a markup.
Strategies for Building Successful Digital Initiatives at Small to Medium Size Institutions Rachel Frick & Andrew Rouner.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
Understanding HTML Style Sheets. What is a style?  A style is a rule that defines the appearance and position of text and graphics. It may define the.
DIGITIZATION OF RARE LIBRARY MATERIALS Metadata Format Access to Digital Documents © Adolf Knoll, National Library of the Czech Republic.
Chapter 12 Creating and Using XML Documents HTML5 AND CSS Seventh Edition.
 Using Microsoft Expression Web you can: › Create Web pages and Web sites › Set what you site will look like as you design it › Add text, images, multimedia.
Homework Full-text article – entire textual contents of article in online format Abstract – brief summary of article Citation – basic information required.
Getting Started with Expression Web 3
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Copyright © 2012 Accenture All Rights Reserved.Copyright © 2012 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
NetTech Solutions Working with Web Elements Lesson 6.
First things, First Do you belong in here? – 10 – 12 – Comp. Discovery or Keyboard/Comp Apps – Do you have any experience with Web Page Design?????
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Copyright © 2008 Pearson Prentice Hall. All rights reserved. 1 Exploring Microsoft Office Word 2007 Chapter 8 Word and the Internet Robert Grauer, Keith.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
XHTML. Introduction to XHTML What Is XHTML? – XHTML stands for EXtensible HyperText Markup Language – XHTML is almost identical to HTML 4.01 – XHTML is.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
HTML, XHTML, and CSS Sixth Edition Chapter 1 Introduction to HTML, XHTML, and CSS.
Practical Metadata Kathryn Lybarger. What is metadata?
HTML PROJECT #1 Project 1 Introduction to HTML HTML Project 1: Introduction to HTML 2 Vocabulary Internet service provider (ISP) A company that has a.
Introduction to HTML Tutorial 1 eXtensible Markup Language (XML)
Internet Web Publishing III. Intro to Cascading Style Sheets Patricia Roberts.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
XP Tutorial 9 1 Working with XHTML. XP SGML 2 Standard Generalized Markup Language (SGML) A standard for specifying markup languages. Large, complex standard.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
CEAL 2003 XML for CJK Wooseob Jeong School of Information Studies University of Wisconsin - Milwaukee.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
+ Information Systems and Databases 2.2 Organisation.
Tutorial 3 Adding and Formatting Text with CSS Styles.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Metadata Metadata Mark-up and Management © Adolf Knoll, National Library of the Czech Republic.
History Internet – the network of computer networks that provides the framework for the World Wide Web. The web can’t exist without the internet. Browser.
XML A Language Presentation. Outline 1. Introduction 2. XML 2.1 Background 2.2 Structure 2.3 Advantages 3. Related Technologies 3.1 DTD 3.2 Schemas and.
Web Technologies Lecture 4 XML and XHTML. XML Extensible Markup Language Set of rules for encoding a document in a format readable – By humans, and –
SCHOOL OF LIBRARY, ARCHIVE AND INFORMATION STUDIES Andy Dawson LIS1510 Library and Archives Automation Issues XML and extensible systems Andy Dawson School.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Project 1 Introduction to HTML.
Chapter 1 Introduction to HTML.
Improving Braille accessibility and personalization on Internet
Project 1 Introduction to HTML.
Markup Languages Gilok Choi 9/17/2018
Prepared for Md. Zakir Hossain Lecturer, CSE, DUET Prepared by Miton Chandra Datta
Presentation transcript:

Delivering textual resources

Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How to guidance for: Rekeying OCR

Getting the text ready - decisions Choices: Full text every character & word searchable, viewable & reusable in digital form Marked-up as above but with markup added to enable structured searches and use (e.g. XML, SGML) Image and text an image is all the viewer sees - text is fully searchable but is not seen or reusable Indexed Images/files attached to an index or catalogue

Getting the text ready - costs Full text generally expensive in time and resources but depends upon source – for born digital very cheap Marked-up Usually the most expensive due to skilled staff needed for intellectual content markup but some automated system around for format based markup Image and text comparatively cheap but some usability down sides Indexed great if index or catalogue already exists and can just link file to record (e.g. MARC)

Full text Files (e.g. PDF, Word) Formatted text (e.g. HTML) Fully searchable Reusable – copy, edit, share Very high accuracy i.e. 100% expected by user Unstructured searches Results can be overwhelming Born digital – reformatting for delivery to be considered

Markup Advantage of structured search and use Complex to create specifications and workflow from scratch Delivering requires a description of the codes, rules and documents used Most projects will adapt one that already exists: TEI – Text Encoding Initiative EAD – Encoded Archival Documents Some automation possible and some system solutions that enable this

Markup: examples Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock- case maker; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it.

Markup Two forms commonly used: Layout and structure based (format) Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock-case maker; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it.

Markup Content based (function) Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock-case maker ; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it. Can obviously be combined to deliver function and format at the same time

Markup languages Markup is a language not a programming tool All use tags or elements – software interprets those tags for display purposes and/or for search and retrieval Allows users (or communities of users) to create their own tag sets Markup can encode both logical and physical features of text

Markup languages SGML Standard Generalised Markup Language (ISO in 1986) Father of all markup languages HTML Hypertext Markup Language (ISO in 1991) Markup of physical features of articles to enable Internet sharing of content – is about format of content XML: Extensible Markup Language (ISO in 1998) SGML lite to enable generic Web use of powerful XML features – is about function of content /

XML: bits and pieces XML Content (.xml) XML Rules (.dtd) Schemas – e.g. TEI, METS DTDs = Document Type Definitions Namespaces (used when you want to combine sets of rules together in a single document)

DTD explained A DTD is the formal definition of the elements, structures, and rules for marking up a given type of XML document Think of it as an abstraction of the document structure What tags and elements must/can be used How these tags and elements are structured in relation to each other Allows Internet browsers and other software to understand how to interpret XML content

XML: further bits and pieces Entities (.ent) Reusable data inside a DTD or within markup Think of entities as variables that can be used to define common text (e.g. copyright information). You can then use the entity anywhere you would normally use the text. Display (.css &.xsl) eXtensible Style Sheet Language Cascading Style Sheets Exstensible Style Sheet Language (.xsl) Used for transforming data to another structure Used for formatting objects

Image and text Image delivered and text is fully searchable but not viewable Text usually created by uncorrected OCR Different ways to do this: Use a PDF document with image and text Deliver an image with text that has been extracted to a searchable database e.g. JSTOR Deliver an image with text that has very basic mark up (possibly just pages defined) and searched as XML

Indexed Basically just linking text or document formats to a subject index or resource catalogue Makes sense and is low cost where the index resources already exists Not so good if the index/catalogue has to be created as this part is costly – in that circumstance XML might be better Delivered as a link within the index/catalogue that directs user to the single text/document file Often used with MARC records or museum Content Management Systems

How to guidance: Rekeying Single rekeying one pass with checks. Generally 99.5% accurate Double rekeying keyed twice, differences checked. Generally 99.99% accurate Rekeyers should key what they see not what they think! Assume they know nothing Textual layout and structure provide clues for rekeyers Detail all variations, special characters, spellings that you can

How to guidance: Rekeying Example From the hand out Note the detail the variations quality assurance

How to guidance: OCR Handout Note the need to understand the nature of the document nature of original nature of printing language uniformity text alignment complexity of alignment lines, graphics and pictures handwriting

OCR Quiz Look at the 4 examples on screen Make a note of any features you think might affect OCR accuracy Have a guess of what you think the accuracy in % terms might be