Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert
AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference Extraction Task “Citation” = “Pattern” –regular expression
Method Outline Query Search (WIT) Extract Patterns using known citations Web pages (HTML, text) Page-specific Patterns Citation DB Seed (e.g. 5 citations) Extract Citations using new patterns Citations
Query: "multiple-goal recognition from low-level signals " " Xiaoyong Chai" " Qiang Yang" "AAAI 2005 " Page: AUTHOR,AUTHOR: TITLE CONF. 4 Patterns: AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF
AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF Finding New Citations AUTHOR, AUTHOR: TITLE CONF. AUTHOR, CONF AUTHOR, AUTHOR:
The Challenge: Patterns Beginning and the end –Start token? End token? HTML tags? difficult to find: length of token vs. general NER? These things should be talked about while viewing the previous slide Are regex’s sufficient? (but not really relevant for “self-supervised learning”) Incorporating NER as a source of possible ENTITY marker? –Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values
System Spits Out… 6 seeds 60 citations 36 of these (partial citations) –"Theory and Algorithms for Plan Merging ", " Ming Li" –"The Expected Value of Hierarchical Problem-Solving ", " Fahiem Bacchus" –"Handling feature interactions in process-planning " 14 of these (partial strings) –"On D " –"On t ", " John Tromp", " Elizabeth Sweedyk", " Umest Vazirani" –"An L ", " Ronan Sleep" –"To D “ No new conferences (end-token)
Bootstrapping, Short-Lived Highly restrictive regex’s –No recovery –More seeds and variety the better Stupid Little Things –Mis-capitalization –Variations in titles (‘-’ vs. ‘ ’) –Etc, etc, etc…
Why is this one hard?
Extensions ~ Improvements Less strict string matching –Not case and punctuation sensitive Better boundary detection –Start/end tokens, HTML wrapper detection? Better pattern construction –e.g. n authors not 2 NER –help find the right "window“ –A source of ENTITY marker Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values Evaluation with DBLP?
NER Baseline model (News corpus) M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. S. Awodey. Topological Representation of the Lambda Calculus. September Math. Struct. in Comp. Sci. (2000), vol. 10, pp Adapted model (News + citation corpus) M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. L. Birkedal. A General Notion of Realizability. December Proceedings of LICS 2000
NER HMM-based Model (Bikel’s 99) Baseline NER : 94% F-score Trained: 1.1 million words in News and Broadcastnews domain Apply Baseline Model to recognize –Author, Conference, Location
NER: Example with Baseline Model M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. L. Birkedal. A General Notion of Realizability. December Proceedings of LICS 2000 S. Awodey. Topological Representation of the Lambda Calculus. September Math. Struct. in Comp. Sci. (2000), vol. 10, pp Good at detecting Author names boundaries, but sometimes too aggressive.
Adaptation NER Goals: adapt baseline model to work better in citation domain. Issue: No training data. A Solution: Take 300 citations; Run baseline model then recorrect them; Train: multiply 300 citations by 10, then train adaptation model with broadcast news corpus.
NER: Example with Adaptation Model M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. L. Birkedal. A General Notion of Realizability. December Proceedings of LICS 2000 D. Litman, D. Bhembe, C. P. Rose, K. Forbes-Riley, S. Silliman, & K. VanLehn (2004). Spoken Versus Typed Human and Computer Dialogue Tutoring, Proceedings of the Intelligent Tutoring Systems Conference.
How NER can help? Provide system generic Patterns. AUTHOR = M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: CONFERENCE = International Conference on Acoustics, Speech Then use specific rules to refine
Lessons Learned Another Boring Text Slide Semi-structured text is surprisingly difficult to read Off-line training for wrappers and/or NER may help Need very high-confidence rules to ensure precision A continuously-running system needs robustness (internet/Google-failure, unexpected errors, …)