Presentation is loading. Please wait.

Presentation is loading. Please wait.

Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert.

Similar presentations


Presentation on theme: "Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert."— Presentation transcript:

1 Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert

2 AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference Extraction Task “Citation” = “Pattern” –regular expression

3 Method Outline Query Search (WIT) Extract Patterns using known citations Web pages (HTML, text) Page-specific Patterns Citation DB Seed (e.g. 5 citations) Extract Citations using new patterns Citations

4 Query: "multiple-goal recognition from low-level signals " " Xiaoyong Chai" " Qiang Yang" "AAAI 2005 " Page: AUTHOR,AUTHOR: TITLE CONF. 4 Patterns: AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF

5 AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF Finding New Citations AUTHOR, AUTHOR: TITLE CONF. AUTHOR, CONF AUTHOR, AUTHOR:

6 The Challenge: Patterns Beginning and the end –Start token? End token? HTML tags?  difficult to find: length of token vs. general NER? These things should be talked about while viewing the previous slide Are regex’s sufficient? (but not really relevant for “self-supervised learning”) Incorporating NER as a source of possible ENTITY marker? –Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values

7 System Spits Out… 6 seeds  60 citations 36 of these (partial citations) –"Theory and Algorithms for Plan Merging ", " Ming Li" –"The Expected Value of Hierarchical Problem-Solving ", " Fahiem Bacchus" –"Handling feature interactions in process-planning " 14 of these (partial strings) –"On D " –"On t ", " John Tromp", " Elizabeth Sweedyk", " Umest Vazirani" –"An L ", " Ronan Sleep" –"To D “ No new conferences (end-token)

8 Bootstrapping, Short-Lived Highly restrictive regex’s –No recovery –More seeds and variety the better Stupid Little Things –Mis-capitalization –Variations in titles (‘-’ vs. ‘ ’) –Etc, etc, etc…

9 Why is this one hard?

10 Extensions ~ Improvements Less strict string matching –Not case and punctuation sensitive Better boundary detection –Start/end tokens, HTML wrapper detection? Better pattern construction –e.g. n authors not 2 NER –help find the right "window“ –A source of ENTITY marker Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values Evaluation with DBLP?

11 NER Baseline model (News corpus) M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. S. Awodey. Topological Representation of the Lambda Calculus. September Math. Struct. in Comp. Sci. (2000), vol. 10, pp Adapted model (News + citation corpus) M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. L. Birkedal. A General Notion of Realizability. December Proceedings of LICS 2000

12 NER HMM-based Model (Bikel’s 99) Baseline NER : 94% F-score Trained: 1.1 million words in News and Broadcastnews domain Apply Baseline Model to recognize –Author, Conference, Location

13 NER: Example with Baseline Model M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. L. Birkedal. A General Notion of Realizability. December Proceedings of LICS 2000 S. Awodey. Topological Representation of the Lambda Calculus. September Math. Struct. in Comp. Sci. (2000), vol. 10, pp Good at detecting Author names boundaries, but sometimes too aggressive.

14 Adaptation NER Goals: adapt baseline model to work better in citation domain. Issue: No training data. A Solution: Take 300 citations; Run baseline model then recorrect them; Train: multiply 300 citations by 10, then train adaptation model with broadcast news corpus.

15 NER: Example with Adaptation Model M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. L. Birkedal. A General Notion of Realizability. December Proceedings of LICS 2000 D. Litman, D. Bhembe, C. P. Rose, K. Forbes-Riley, S. Silliman, & K. VanLehn (2004). Spoken Versus Typed Human and Computer Dialogue Tutoring, Proceedings of the Intelligent Tutoring Systems Conference.

16 How NER can help? Provide system generic Patterns. AUTHOR = M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: CONFERENCE = International Conference on Acoustics, Speech Then use specific rules to refine

17 Lessons Learned Another Boring Text Slide Semi-structured text is surprisingly difficult to read Off-line training for wrappers and/or NER may help Need very high-confidence rules to ensure precision A continuously-running system needs robustness (internet/Google-failure, unexpected errors, …)


Download ppt "Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert."

Similar presentations


Ads by Google