Download presentation
Presentation is loading. Please wait.
Published byHorace Blake Modified over 8 years ago
1
Rheinisch- Westfälische Technische Hochschule Aachen Department of Computer Science III, Prof. Dr.-Ing. M. Nagl September 14, 2000 / Digital Documents and Electronic Publishing 2000 Inferring Structure Information from Typography Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen
2
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 2 Overview Context Deriving Structure Information: »Partitioning »Typographic abstraction »Determine Type Conclusion Cooperation project of Prototype aTool in the WEP goup of the Global-Info Project (www.global-info.org)
3
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 3 Author Proprietary document format Writing Today’s Publication Chain Copy Editing <> <> <> <> Typesetting Publisher <> <> Standard format Reading Reader Conversion Web Publ.
4
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 4 TEXTEX Submissions Classification of Submissions MS Word Unformatted Formatted Correctly Formatted Somehow Formatted Structured (XML) Somehow Formatted Formatted Unformatted Structured (XML) Somehow Formatted
5
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 5 Basic Assumptions Textual Nature Typographic markup Consistent markup Known target document type
6
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 6 Deriving Structure Information In: MS Word document Record Formatting (Format Tuples) Locate the Elements Reduce Format Tuples to Patterns Determine Types Out: XML document Also interactively
7
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 7 Format Tuples The basic typographic abstraction FormatTuple(" Is this a dagger? ") = [Times, 22pt, regular, roman] Here:Font, Size, Weight, Variation Planned:Search expressions modulo Text More general:Including regular expressions of text content or context.
8
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 8 Locate the Elements Tree-Partitioning of Formatted Character Streams on »Format Tuple changes »Paragraphs breaks Nesting of Inline Elements »Is this a dagger? »Is this a dagger? ft1> »Is this a dagger? > »Is this a dagger? > > Format-To-Type Map: FormatTupleElementType ft1(times, 22pt, reg, roman) dummyType1 ft2(times, 22pt, bold, roman) dummyType2 ft3(times, 22pt, reg, italic) dummyType3
9
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 9 Format patterns Identity too restrictive wildcard generalization Isthisa dagger? (,,) TimesTimesTimes* 22pt22pt22pt* regularboldregularbold romanromanroman* ( , a, b) = (a, a, b); (a, b, ) = (a, b, b) ( , a, ) propagated to paragraph level Format-To-Type Map: FormatPatternElementType fp1(*, *, regular, *) dummyType1 fp2(*, *, bold, *) dummyType2 fp2b(*, *, bold, roman) dummyType2 fp3(*, *, regular, italic) dummyType3
10
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 10 Determine Types Replace dummy types in Format-To-Type Map Preconfiguration by publisher Controlled Learning from the author FormatPatternElementType (*, *, regular, *) Body (*, *, bold, *) FirstTerm (*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis
11
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 11 Further useable information Allowed context from the DTD Paragraph standard format Text patterns »Bullets »Enumeration »Whitespace »ASCII Markup ( Is *this* a dagger? ) Format pattern match confidence
12
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 12 Motivational aspects Quick feedback on formal correctness Publication preview while keeping format freedom (Via XSL) flexible previews of other formats New structure-based functionality: »Structure editing »Structure evaluation »Document templates
13
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 13 Conclusion Summary »4-step inference Record format tuples Locate the elements Reduce tuples to patterns Determine types »Increase efficiency of publication chain »Provide unobtrusive structuring for non-expert authors Plans »Cautious extension of inference »Validation of document »Evaluation with authors
14
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 14 Author Proprietary document format Writing Today’s Publication Chain Copy Editing <> <> <> <> Typesetting Publisher <> <> Standard format Reading Reader Conversion Web Publ.
15
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 15 TEXTEX Submissions Classification of Submissions MS Word Unformatted Formatted Correctly Formatted Somehow Formatted
16
Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 16 Determine Types Replace dummy types in Format-To-Type Map Preconfiguration by publisher Controlled Learning from the author FormatPatternElementType (*, *, regular, *) Body (*, *, bold, *) FirstTerm (*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.