Presentation is loading. Please wait.

Presentation is loading. Please wait.

ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen.

Similar presentations


Presentation on theme: "ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen."— Presentation transcript:

1 ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen

2 2 Outlines Part A Introduction Related work Part B Material Methodology Part C Implementation Conclusion

3 3 Part A: Introduction

4 4 Introduction

5 5 Introduction (cont’d) Research Objective Analyze a document image and detect its logical structure with annotated labels Project Scope Focus on: Academic articles Source Corpus: Association for Computational Linguistics (ACL) Anthology Corpus

6 6 Related Work Physical Layout Analysis Top-down methods Bottom-up methods Logical Structure Analysis Syntactic methods Rule-based methods

7 7 Part B: Methodology

8 8 Material: XML Source by Text An example of Input file of the project

9 9 Methodology 1a. Grouping texts into lines XML source by text 1b. Aggregating lines into blocks XML source by line Physical Structure Phase I: Aggregation of Homogeneous Blocks

10 10 Methodology (cont’d) 2. Annotating each block with a logical label Logical Structure XML source by block 1b. Aggregating lines into blocks Phase II: Detection of Logical Structure

11 11 Methodology (cont’d) Check dominant font size Read-in 3 lines at a time A1A2A3A1A2A3 AABABBA 1 BA 2 ABC ABCA1BA2ABBAAB Check spacing s 1 =s 2 AAA s 1 >s 2 A1A1 A2A3A2A3 A3A3 A1A2A1A2 A, B, C: lines of texts with different dominant font sizes A 1, A 2 : lines of texts with the same dominant font size s 1 : spacing between A 1 and A 2 s 2 : spacing between A 2 and A 3 A : belongs to the same block Algorithm for aggregating blocks In Phase II

12 12 Part C: Outcomes

13 13 Current Outcome Original PDF document Physical layout outcome in HTML

14 14 Current Outcome (cont’d) Logical structure outcome in HTML

15 15 Implementation: Class Diagram

16 16 Implementation: User Interfaces

17 17 Conclusion: Information Evaluation Error Type Error Found Accuracy of Detection Incorrect title or missing title197.5% (39/40) Incorrect Abstract heading or Missing Abstract heading 490.0% (36/40) Incorrect Abstract or Missing Abstract490.0% (36/40) Incorrect Affiliation(s) or Missing Affiliation(s) 1172.5% (29/40) Missing >50% of Page number(s) or Erroneous Page number(s) found 1562.5% (25/40) Missing >50% Section heading(s) or Erroneous Section heading(s) found 1172.5% (29/40) Summary of detection results out of 40 randomly selected documents

18 18 Conclusion: Future Work Improving Algorithms Aggregation of Homogenous blocks Detection of Abstract Heading, Section Heading, and Paragraph Removing Noise Incomplete table contents Incomplete mathematic formula


Download ppt "ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen."

Similar presentations


Ads by Google