Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.

Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan

Logical structure annotation in ForeciteReader. The view shows object navigation interface, currently focusing on the list of figure captions. 10/27/20152

Section navigation in ForeCiteReader environment with generic sections 10/27/20153

Overview Methodology – Problem Formulation – Learning Model - CRF – Approach overview – Classification categories Raw-text features Rich document representation Experiments Further analysis 10/27/20154

Problem Formulation Two related subtasks: Logical structure (LS) classification – scholarly document as an ordered collection of text lines – label each text line with a semantic category e.g. title, author, address, etc. Generic section (GS) classification – take the headers of each section of text in a paper – deduce a generic logical purpose of the section.  Sequence labeling tasks - CRF 10/27/20155

Learning Model - CRF CRF in simplified form f: both state & transition functions 10/27/20156 Binary feature State function Transition function Utilize CRF++ package http://crfpp.sourceforge.net/ Input for line l i to CRF++ is of the form “value 1 … value m category i "

Approach overview 10/27/20157

Classification categories - example 10/27/20158

Classification categories – full sets Logical structure subtask, 23 categories: address, affiliation, author, bodyText, categories, construct, copyright, email, equation, figure, figureCaption, footnote, keywords, listItem, note, page, reference, sectionHeader, subsectionHeader, subsubsectionHeader, table, tableCaption, and title. Generic section subtask, 13 categories: abstract, categories, general terms, keywords, introduction, background, relatedWork, methodology, evaluation, discussions, conclusions, acknowledgments, and references. 10/27/20159

Raw-text features - LS Parscit token-level features + Our line-level features: – Location: relative position within document – Number: patterns of subsections, subsubsections, categories, footnotes – Punctuation: patterns of emails & web links bracket numbering  equation – Length: 1token, 2token, 3token, 4token, 5+token  identify majority of lines as bodtyText 10/27/201510

Raw-text features - GS Naïve, yet effective features: – Positions – First and Second Words – Whole Header 10/27/201511

Rich document representation – OCR output Linearlize XML output into CRF features: “Don't-Look-Now,-But-We've- Created-a-Bureaucracy. Loc_0 Align_left FontSize_largest Bold_yes Italic_no Picture_no Table_no Bullet_no". 10/27/201512

Rich document representation – OCR features Position – Alignment: left, center, right & justified – Location: within-page location Format – FontSize: quantize base on frequency, e.g smaller, smaller, base, -2, -1, 0 – Bold– Italic Object – Bullet– Picture– Table 10/27/2015 13

Experiments - datasets LS: 20 ACM, 10 CHI 2008, 10 ACL 2009 – fully labeled GS: 211 ACM papers – headers labeled 10/27/201514 Skewed data

Experiments – metrics TP: # correctly classified text lines (true positive) Similarly, FN, FP, and TN for true negatives. Category-specific performance: – F 1 measure = 2 x P x R / (P+R); Precision = TP/(TP+FP), Recall = TP/(TP + FN) Overall performance: – Macro average: average of all category-specific F 1 – Micro average: percentage of correctly labeled lines 10/27/201515

Experiments – LS results LS PC - baseline using only ParsCit features LS PC+RT : LS PC + raw text features LS PC+RT+RD : LS PC+RT + rich document features (OCR) LS PC+RT+RD, LS PC+RT > LS PC more than 10 F 1 points LS PC+RT+RD < LS PC+RT : minor degradation for four categories LS PC+RT+RD > LS PC+RT : all other categories (many > 4 F 1 scores) Large improvements for footnote, sssHeaders 10/27/201516

Experiments – GS results GS maxent : maximum entropy based system (Nguyen and Kan, 2007) GS CRF : our system GS CRF > GS maxent : in all categories except background Large improvements for discussions 10/27/201517

Further analysis – Text features All contribute to the final composite performance Most influential: position 10/27/201518

Further analysis – rich doc features Format contributes most to macro avg While object influences micro average most Format features help a wider spectrum of categories: paper metadata & section headers Object features enhance fewer categories, but containing a large number of training data e.g. list item, table 10/27/2015 19

Further analysis – rich doc features Most features improve both metrics except align & table: trade off macro vs. micro Location, Font, and Bullet as the most effective features in each of the groups position, format, and object 10/27/201520

Error analysis - LS 10/27/201521

Error analysis - GS whole header: non-overlapping tokens with any of the memoized training data instances  Needs to use body text instead (Future work) Similar relative positions of consecutive headers: background vs. method, method vs. discussions, & discussions vs. Conclusions The dataset skew also impacts: large number of method, while much less for background and discussions categories  many headers are mislabelled as method 10/27/201522

10/27/201523

Q & A Thank you! 10/27/201524

Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.

Similar presentations

Presentation on theme: "Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.

Similar presentations

Presentation on theme: "Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan."— Presentation transcript:

Similar presentations

About project

Feedback