Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity.

Similar presentations


Presentation on theme: "Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity."— Presentation transcript:

1 Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity

2 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 2 Bibliographic References - Parsing Why parse bibliographic references? –Generation of BibTeX records, etc. –Rendering in different styles –Reconciliation –…–…  Absolute necessity when compiling large bibliographies Thor, A.U., Cond, S.E. 2012. The article title. The Journal 7: 8-15 Author YearTitleJournal Pagi- nation Volume

3 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 3 Bibliographic References - Examples  Diversity with regard to –Reference style (order of fields, intermediate punctuation) –Type of referenced work Thor, AU, SE Cond (2012) The article title. The Journal 7: 8-15 Thor, AU, Cond, SE. The article title, The Journal 7 (2012): 8-15 Thor, A.U. 2012. The paper title. Proc. ICST 2012, Location. Thor, AU, Cond, SE. 2012. The chapter title. In: Itor, ED (Ed.) The book title. Location: Publisher: 8-15 Thor, A.U. 2012. The book title, Publisher, Location, 151 pp. Thor, AU, SE Cond, 2012. The 3rd article title. In: Itor, ED (Ed.) The 1st special issue. The Journal 7: 8-15

4 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 4 Bibliographic References - Fields Fields present in references to (almost) all types of works –Authors (can be given in different styles) –Year of publication (four-digit Arabic number) –Title Fields present in references to specific types of works: –Publisher and Location / Journal name –Pagination ((mostly) Arabic number or number range) –Volume / issue / numero number (Arabic number) –Volume title / Proceedings title –Editors (can be given in different styles) –URL / DOI / ISBN / ISSN

5 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 5 Overview Bibliographic References Previous Parsing Approaches The RefParse Algorithm Evaluation Summary & Outlook

6 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 6 Pattern Based Parsers Principle: –Patterns match individual field values –Meta patterns arrange field patterns –One meta pattern per reference style Most prominent: ParaCite (now offline) Strengths: –Numerical fields –Author names Weaknesses: –Meta patterns to be created for every single reference style –Combinatorial explosion with alternatives for individual fields

7 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 7 Learning Based Parsers Learn statistical models from pre-parsed references –Hidden Markov Models –Conditional Random Fields –Finite State Transducers –etc. Strengths: –Can handle all cases covered in training set –No handcrafting of rules or patterns Weaknesses: –Need for training data covering all cases –Usually do not exploit morphology –Incremental training hard

8 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 8 Knowledge Based Parsers Divide references into blocks at punctuation marks Classify blocks by comparing them to knowledge base Examples: FLUX-CiM, INFOMAP Strengths: –No handcrafting of rules or patterns –Learn domain specific journal names, etc. very well Weaknesses: –Need for representative training data covering domain –Abbreviations interfere with blocking –Problems with numerical fields –Problems with highly variable fields like author names

9 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 9 Alignment Based Parsers Morphologically classify word, numbers, punctuation marks Interpret sequence of classes as gene sequence Try to align this sequence with learned one Strengths: –No handcrafting of rules or patterns –Learn reference styles Weaknesses: –Need for representative training data covering many cases –Abbreviations interfere with alignment

10 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 10 Overview Bibliographic References Previous Parsing Approaches The RefParse Algorithm Evaluation Summary & Outlook

11 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 11 RefParse: The Idea Observation of previous approaches: –For each field, some approach is strong –Reference styles need to be in training set or created manually Observation gathering data: –References rarely come individually –Paper bibliographies are a common source  Lists of references following the same style Idea: –Exploit structural redundancy given in reference lists –Use individual approaches for fields they handle best

12 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 12 Exploiting Redundancy Get field values that patterns identify reliably: –Author names (all possible styles) –Numerical elements (year, volume, etc., pagination) –Ambiguous numbers become candidates for all they match Generate all possible field arrangements Compare field arrangements across reference list … … and pick the one that fits the best Align references against one another … … to infer meta pattern at runtime

13 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 13 Thor, AU. The article title. The Journal 1998 (1987): 1997 Reference Alignment - Example Only alignment with second reference disambiguates numbers in first one Exploiting redundancy overcomes inherent weaknesses of reference-by-reference parsers Cond, SE. Another article title. Another Journal 7 (2012): 8-15 Volume? Year? Page? Volume Year Pages

14 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 14 Reference Alignment Result After alignment steps, RefParse has identified –Author lists, including style –Years of publication –Pagination (where present) –Volume / issue / numero numbers (where present) –Reference style (order of fields, intermediate punctuation) Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference 4. Volume Reference Extraction 5. Periodical / Publisher Extraction 6. Title Extraction Parsed References

15 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 15 Handling Volume References Embedded references to books or journal volumes –In principle, references on their own (safe for year) –Extract and handle in recursive step Thor, AU, Cond, SE. 2012. The chapter title. In: Itor, ED (Ed.) The book title. Location: Publisher: 8-15 Thor, AU, SE Cond, 2012. The 3rd article title. In: Itor, ED (Ed.) The 1st special issue. The Journal 7: 8-15 Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference 4. Volume Reference Extraction 5. Periodical / Publisher Extraction 6. Title Extraction Parsed References

16 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 16 Journal / Publisher Extraction Morphologically, names of journal and publisher very similar (Word block in title case) Sometimes heavily abbreviated (dots interfere with blocking) –Recognize title case abbreviation blocks –Handle parts in brackets / quotes as single blocks Use patterns to find candidates (optionally, use lexicons) Choose candidate closest to volume number / pagination Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference 4. Volume Reference Extraction 5. Periodical / Publisher Extraction 6. Title Extraction Parsed References

17 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 17 Title Extraction – Finally Title most important field of reference … … but also most variable one  pattern matching hard Having identified all other fields, however … … title is what remains in middle of reference Circumvents matching or aligning title Reference List 1. Base Element Extraction 2a. Author List Assembly 2b. Author List Selection 3. Reference Style Inference 4. Volume Reference Extraction 5. Periodical / Publisher Extraction 6. Title Extraction Parsed References

18 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 18 Overview Bibliographic References Previous Parsing Approaches The RefParse Algorithm Evaluation Summary & Outlook

19 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 19 Experimental Setup Corpora: –Cora Corpus: 500 individual references –Plazi Corpus: ~25.000 references from ~1.000 documents Experiments: –RefParse without training (empty lexicons) –RefParse with training (50% / 50% data split) –ParseCit (model based parser for comparison) –FreeCite (model based parser for comparison)

20 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 20 Experiments with Cora Corpus  RefParse clearly outperforms related approaches  Interestingly, accuracy lower with training (in a minute) RefParse-gRefParse-lParsCitFreeCite Word / Token91.5%89.8%83.0%83.8% Field: - Author / Editor98.6% / 74.6%98.6% / 78.6%95.7% / 0% - Title79.0%74.5%91.0% - Year of Publication98.8%99.1%96.7% - Pagination97.7%97.0%88.9%1.6% - Part Designators96.0%89.2%66.7%96.0% - Volume Title38.8%38.6%46.3%50% - Journal / Publisher68.0%61.6%53.1%54.2% Instance58.4%52.1%23.4%12.2%

21 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 21 Experiments with Plazi Corpus  RefParse clearly outperforms related approaches  Again, accuracy lower with training (next slide) RefParse-gRefParse-lParsCitFreeCite Word / Token94.3%93.7%78.9%79.7% Field: - Author / Editor97.2% / 83.7%97.7% / 81.0%88.3% / 0%88.0% / 0% - Title78.4%78.5%40.4%32.4% - Year of Publication99.5% 95.5%89.7% - Pagination99.3% 20.4%0.3% - Part Designators97.7%95.1%42.0%64.3% - Volume Title63.2%52.5%0.6%0.3% - Journal / Publisher76.6%75.5 %54.3%44.3% Instance69.9%69.2%65.6%3.4%

22 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 22 Lexicons can be Harmful ?! Observation in experiments: Accuracy for title and journal/publisher lower with lexicons Totally counter-intuitive at first glance What happens: –Frequent infix of long, rare journal name found in lexicon … –… and are taken as journal name proper … –… preventing whole journal name from being found  Infix Match Problem

23 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 23 Overview Bibliographic References Previous Parsing Approaches The RefParse Algorithm Evaluation Summary & Outlook

24 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 24 Summary RefParse algorithm: –Combines strengths of previous approaches –Processes whole reference lists –Infers reference style by mutual alignment –Independent of training data RefParse clearly outperforms previous approaches Lexicon lookup phenomenon: Infix Match Problem

25 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 25 Outlook Overcome infix match problem Improve overall accuracy in title and journal/publisher –Blocking & block scoring (akin to knowledge backed parsers) –Exploiting redundancy to find separating punctuation Gather experience in real-world deployment

26 Guido Sautter KIT Improved Bibliographic Reference Parsing Based on Repeated Patterns 26 Questions? Guido Sautter, Klemens Böhm: Improved Bibliographic Reference Parsing Based on Repeated Patterns ViBRANT Virtual Biodiversity


Download ppt "Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity."

Similar presentations


Ads by Google