Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rheinisch- Westfälische Technische Hochschule Aachen Department of Computer Science III, Prof. Dr.-Ing. M. Nagl September 14, 2000 / Digital Documents.

Similar presentations


Presentation on theme: "Rheinisch- Westfälische Technische Hochschule Aachen Department of Computer Science III, Prof. Dr.-Ing. M. Nagl September 14, 2000 / Digital Documents."— Presentation transcript:

1 Rheinisch- Westfälische Technische Hochschule Aachen Department of Computer Science III, Prof. Dr.-Ing. M. Nagl September 14, 2000 / Digital Documents and Electronic Publishing 2000 Inferring Structure Information from Typography Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen

2 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 2 Overview  Context  Deriving Structure Information: »Partitioning »Typographic abstraction »Determine Type  Conclusion  Cooperation project of  Prototype aTool in the WEP goup of the Global-Info Project (www.global-info.org)

3 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 3 Author Proprietary document format Writing Today’s Publication Chain Copy Editing <> <> <> <> Typesetting Publisher <> <> Standard format Reading Reader Conversion Web Publ.

4 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 4 TEXTEX Submissions Classification of Submissions MS Word Unformatted Formatted Correctly Formatted Somehow Formatted Structured (XML) Somehow Formatted Formatted Unformatted Structured (XML) Somehow Formatted

5 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 5 Basic Assumptions Textual Nature Typographic markup Consistent markup Known target document type

6 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 6 Deriving Structure Information In: MS Word document  Record Formatting (Format Tuples)  Locate the Elements  Reduce Format Tuples to Patterns  Determine Types Out: XML document Also interactively

7 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 7 Format Tuples  The basic typographic abstraction  FormatTuple(" Is this a dagger? ") = [Times, 22pt, regular, roman]  Here:Font, Size, Weight, Variation  Planned:Search expressions modulo Text  More general:Including regular expressions of text content or context.

8 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 8 Locate the Elements  Tree-Partitioning of Formatted Character Streams on »Format Tuple changes »Paragraphs breaks  Nesting of Inline Elements »Is this a dagger?  »Is this a dagger?  ft1> »Is this a dagger?  > »Is this a dagger?  > >  Format-To-Type Map: FormatTupleElementType ft1(times, 22pt, reg, roman) dummyType1 ft2(times, 22pt, bold, roman) dummyType2 ft3(times, 22pt, reg, italic) dummyType3

9 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 9 Format patterns  Identity too restrictive  wildcard generalization Isthisa dagger?  (,,) TimesTimesTimes* 22pt22pt22pt* regularboldregularbold romanromanroman*   ( , a, b) =  (a, a, b);  (a, b,  ) =  (a, b, b)   ( , a,  ) propagated to paragraph level  Format-To-Type Map: FormatPatternElementType fp1(*, *, regular, *) dummyType1 fp2(*, *, bold, *) dummyType2 fp2b(*, *, bold, roman) dummyType2 fp3(*, *, regular, italic) dummyType3

10 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 10 Determine Types  Replace dummy types in Format-To-Type Map  Preconfiguration by publisher  Controlled Learning from the author FormatPatternElementType (*, *, regular, *) Body (*, *, bold, *) FirstTerm (*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis

11 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 11 Further useable information  Allowed context from the DTD  Paragraph standard format  Text patterns »Bullets »Enumeration »Whitespace »ASCII Markup ( Is *this* a dagger? )  Format pattern match confidence

12 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 12 Motivational aspects  Quick feedback on formal correctness  Publication preview while keeping format freedom  (Via XSL) flexible previews of other formats  New structure-based functionality: »Structure editing »Structure evaluation »Document templates

13 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 13 Conclusion  Summary »4-step inference  Record format tuples  Locate the elements  Reduce tuples to patterns  Determine types »Increase efficiency of publication chain »Provide unobtrusive structuring for non-expert authors  Plans »Cautious extension of inference »Validation of document »Evaluation with authors

14 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 14 Author Proprietary document format Writing Today’s Publication Chain Copy Editing <> <> <> <> Typesetting Publisher <> <> Standard format Reading Reader Conversion Web Publ.

15 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 15 TEXTEX Submissions Classification of Submissions MS Word Unformatted Formatted Correctly Formatted Somehow Formatted

16 Department of Computer Science III RWTH Aachen DDEP 2000 Inferring Structure Information from Typography 16 Determine Types  Replace dummy types in Format-To-Type Map  Preconfiguration by publisher  Controlled Learning from the author FormatPatternElementType (*, *, regular, *) Body (*, *, bold, *) FirstTerm (*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis


Download ppt "Rheinisch- Westfälische Technische Hochschule Aachen Department of Computer Science III, Prof. Dr.-Ing. M. Nagl September 14, 2000 / Digital Documents."

Similar presentations


Ads by Google