Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Published byModified over 4 years ago
Presentation on theme: "Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy."— Presentation transcript:
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy Johns Hopkins University Presented by Ben Wellington Right In Front of You
Outline Reasons for Research Information Extraction AutoSlog Projection Experiments\Results Conclusions
Current IE Systems Current Information Extraction (IE) systems are expensive –Parsing tools –Development texts –Specialized dictionaries Language and domain specific Not easily portable across languages
Resources Available Not all languages have equal resources Annotated corpora, text analysis tools easily available for English For most of worlds languages, these don’t exist.
Information Extraction Task Plane Crashes –Find victim, vehicle, location “AutoSlog-TS” –Gather statistics from a corpus of relevant texts (with in the domain) and irrelevant ones (outside the domain) –Each pattern is a linguistic expression can extract noun phrases from one of three syntactic positions: subject, direct object, object of a prepositional phrase. –“ crashed”, “hijacked ”
AutoSlog Used AutoSlog-TS using AP news stories –About plane crashes –Not about plane crashes Creates list of extraction patterns, ranked according to their association with the domain –A human reviews the list to decide which are useful –For new text, use “Sundance”, a shallow parser. –“ crashed”, “hijacked ”
Cross Language Projection Yarowsky 2001 developed cross-language –POS tagging, –Base noun phrases –Named-entity tags –Morphological analysis. Atserias 1997 developed cross language –Ontologies and WordNets
Mechanics of Projection Use Machine Translation tools to create artificial parallel corpus –Adds error (-) –Frees cross-language projection research from existing bitexts, such as “Canadian Hansards” (+) –Can’t fine “plane crash” bitexts readily. (+)
The Algorithm 1.Sentence align the parallel corpus 2.Word-align the parallel corpus using the Giza++ system 3.Transfer English IE annotations and noun-phrase boundaries 4.Train a stand alone IE tagger on these projected annotations
Transformation Based Learning TBL was discussed earlier in the course The tagger applies a number of transformations of the form 'in context C, if a word is tagged A, change its tag to B‘ The rule learner starts with the initial tagging (each word assigned its most common tag), tries all possible transformations, and selects the one which produces the greatest improvement (maximizes errors corrected - errors introduced).
TBL Rules Subject Capture Rule Templates (no parser to find subjects)
TBL Rules Chaining Rule Templates (no parser to find noun-phrase boundaries)
The Experiment. English and French AP news stories about plane crashes. Two languages from different years. English (420,000 words) French (150,000 words) Hired 3 University Students to mark location, vehicle and victim.
Now that we’re all in Agreement Annotator Agreement –Exact-word match –Exact-NP match –“Boeing 727” vs. “new Boeing 727” –16-31% for French, 24-27% for English –43-54% for French, 51-59% for English –(one French annotator marked 4.5 times as many locations as another)
TBL-Based IE Projection ~Equal to F-measure of monolingual English, ~9% lower then F-measure of monolingual French
Two’s Company… Two Step processes are better than Three Step Processes. –Mean F-measure drop was from 32.6 to 28.8 A collection of domain specific English Texts can be used to project and induce new IE systems, even with out domain- specific texts in foreign language
Strength in Numbers Used a voting system with the different techniques. Able to achieve F-measure of 48%, 3% higher than the strongest individual system. Only 4% lower than the French monolingual system.
Conclusion Used IE systems for English to automatically derive IE systems for a second language Very little human effort French and English are relatively close though… Improve performance by improving English performance, or minor specializations for the language.