Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Academic Affiliations Status Report Alicia Tribble Einat Minkov Andy Schlaikjer Laura Kieras.

Similar presentations


Presentation on theme: "Extracting Academic Affiliations Status Report Alicia Tribble Einat Minkov Andy Schlaikjer Laura Kieras."— Presentation transcript:

1 Extracting Academic Affiliations Status Report Alicia Tribble Einat Minkov Andy Schlaikjer Laura Kieras

2 Identify people who are affiliated with an academic institution –Degrees earned –Positions held (student, post-doc, faculty) –Current position Class of beliefs to be learned: –affiliated(,, ) The Problem

3 The System /Algorithm Patterns Relations (facts) Html files Extract patternsExtract relations Search Engine Interface Assess patternsAssess relations Query relationQuery pattern Query Generator

4 Algorithm Details Pattern query formulation –Replace in pattern string with '*' operator –Remove leading and trailing '*'s –Wrap query string in quotes –Example: " received his from " -becomes- '"received his * from"'

5 Algorithm Details Relation Extraction (Slot filling) –Find the relevant sentence/s on a page –Alignment – slot filling –Some cleanup – “he”, capitalization –Examples: Robertson, Ph.D. in ecology and evolutionary biology, Indiana University Jeff, B.S., Bucknell University Rex Jung, degree, University of New Mexico Alavosius, BA in psychology, Clark University Jacobs, B.E.E. degree, Cornell University He, Associates Degree in Livestock Production, Northeast Community College

6 Algorithm Details Relation query formulation –All argument values become query terms –Example: (William Cohen, Ph.D., Rutgers) -becomes- 'William Cohen Ph.D. Rutgers'

7 Algorithm Details Pattern Extraction –Build a regex from a relation, one per argument (Mr\.|Mr|MR|M\.?+r\.?+|Dr\.?+|Mrs\.?+|MRS|Ms|MS)* ?+(Scott Fahlman|Scott|Fahlman) ([a-zA-Z]*? [dD]egree|[Dd]octoral [Dd]egree|PhD|Ph\.D\.|Doctorate|PHD) (MIT) –Apply regex to input and for every match, extract intermediate string and generalize received her from received his from earned a from s, MD

8 Initial seeds –Relations affiliated('William Cohen', 'Ph.D.', 'Duke University') affiliated('Tom Mitchell', 'Ph.D.', 'Stanford') affiliated('Scott Fahlman', 'Ph.D.', 'MIT') –Patterns received his from earned his from earned a from Testing and development performed with 2 bootstrap iterations, using only Google snippets Experimental Settings

9 Results! inital: patterns: 3 relations: 3 iteration 0: patterns: 6 (+3) relations: 13 (+3) iteration 1: patterns: 14 (+9) relations: 0 total: patterns: 23 relations: 16

10 Interim Conclusions Issue I: over-specificity of queries arguments Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes: A: "Oren Etzioni", "doctorate" "Carnegie Mellon University".. ? Possible avenues: –Larger dictionaries –Unquote query arguments? (allow for some variation) –Allow argument values to include random terms "Oren * Etzioni" This might incorporate more noise, and require additional queries to be issued per relation.

11 Interim Conclusions Issue II: name and pronoun resolution Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes: A: "He recieved his Ph.D from CMU in..." Rate of occurance of "S/he..." in extracted relations –1 pattern, 50 queries: 56.8% (96/169) Possible avenues: –Identify homepages and extract names from titles, or other unambiguous sources on page –Pronoun resolution simple techniques?? (for example, identify immediate previous name mentions. This may require NER.)

12 Interim Conclusions Issue III: compound sentences Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes: A: "Oren Etzioni recieved his MS from, and his Ph.D from CMU" Possible avenues: –Extensions to pattern extraction techinque –May require dependency parsing

13 Software / Resources A generic search framework which allows asynchronous processing of search tasks, as well as "filter" tasks (processing of resulting URLs) A URL caching implementation of Java 1.5's java.net.ResponseCache using Hibernate, supporting centralized caching and remote access

14 Result Generic Search Framework Search URLExtraction Search Tasks Filter Tasks SearchProcessor Extraction Search Extraction Filter Test run: 1 Search 50 URLs 169 Extractions 15 seconds

15 Search Framework System Flow RelationPattern Relation Pattern SearchProcessor Validate

16 Extensions Dictionaries - next slide Simple pronoun resolution Extraction validation metrics URL of professor’s personal home page Clustering of people / universities, or normalization of names Identify biography section of personal home pages Links incoming and outgoing from personal home page

17 Additional information Dictionary of institution names Tiny dictionary of degrees –E.g. Ph.D., B.S., B. Tech., etc Map of domain names to institution names –E.g. cmu.edu : Carnegie Mellon University –This could be learned but we will leave that for another group!

18 Example extracted relations Dictionary of institution names Tiny dictionary of degrees –E.g. Ph.D., B.S., B. Tech., etc Map of domain names to institution names –E.g. cmu.edu : Carnegie Mellon University –This could be learned but we will leave that for another group!


Download ppt "Extracting Academic Affiliations Status Report Alicia Tribble Einat Minkov Andy Schlaikjer Laura Kieras."

Similar presentations


Ads by Google