Parsing in Multiple Languages

Parsing in Multiple Languages
Dan Bikel University of Pennsylvania August 22nd, 2003

Problems with many parser designs
Many parsing models are largely language-independent, but their implementations do not have layer of abstraction for language-specific information Most generative parsers do not have easy method for experimenting with different parameterizations and back-off levels Most parsers need large database of smoothed probability estimates, but this database is often tightly coupled with decoder Parallel and/or distributed computing cannot easily be exploited

Parsing Architecture for Speed and Flexibility
Language Language package DecoderServer 1 ModelCollection CKY Client 1 CKY Client 2 M M DecoderServer N ModelCollection CKY Client N Switchboard Object server

Architecture for Parsing II
Highly parallel, multi-threaded Can take advantage of, e.g., clustered computing environment Fully fault-tolerant Significant flexibility: layers of abstraction Optimized for speed Highly portable for new domains, including new languages

Layer of Abstraction: Language Package
Four, easy-to-build components Treebank: encapsulates linguistic information, Treebank-specific information isConjunction(tag), isConjunction(word) isNP(nonterminal) isSentence(nonterminal) isVerb(preterminal subtree) Training Preprocessing, tree augmentations Head finder Reads in small, head rules file (35 lines for English) Word features

Tree augmentation Parsers operate over augmented tree space, T+
Examples of tree augmentations Head lexicalization: NP ® NP(thing,NN) Argument identification: NP ® NP-A Chiang & Bikel (2002) provided treep New, portable syntax for augmenting tree nodes Method for reestimating parser models in the augmented space such that P(S,T) is maximized

Rapid Portability to New Languages and New Data Sets
Bikel & Chiang (2000) described porting two parsing models developed for English to Chinese BBN: LR 69.0, LP 74.8 (≤ 40 words) Chiang: LR 76.8, LP 77.8 (≤ 40 words) Original design goal for parsing engine: develop new language packages in 1–2 weeks Developed Chinese language package for new engine in one and a half days Results with new parsing engine using CTB (100k words) LR 77.0, LP 81.6 (≤ 40 words, hand segmented) Latest results with new engine using CTB 3.0 (250k words) LR 78.8, LP 82.4 (≤ 40 words, hand segmented)

Tools in Use English parser Chinese parser
Gets state-of-the-art results Training data: Penn Treebank, §§02-21, ~1 million words Results: LR 89.90, LP on §00 (≤40 words) Fast: trains in minutes, can use cluster to parse an entire WSJ section in about 5 minutes Chinese parser Dan Jurafsky and one of his grad students, Honglin Sun, at the University of Colorado (predicate-argument labeling for IE) Kevin Knight and Philipp Koehn at ISI Research (MT) In-house users Dan Gildea (MT) Yuan Ding (MT) Chinese Treebank annotators (bootstrapping treebank annotation)

Tools in Use II Arabic Parser
Once representation/format issues ironed out, porting to Arabic also proceeded rapidly Encouraging preliminary results Training data: 150k words LR 75.6, LP 77.4 (≤40 words, mapped gold-standard tags) Currently being used by Rebecca Hwa at the University of Maryland (MT) We believe Much room for improvement after analysis of initial results, but… performance is already more than good enough for bootstrapping Treebank annotation

Tools in Use III Classical Portugese Parser
Rapid development of language package (with help of Tony Kroch) Successfully used for bootstrapping of treebank for the Statistical Physics, Pattern Recognition and Language Change project at State University of Campinas, São Paolo, Brazil

Obtaining Tools and Data
For my parsing engine (English, Arabic, Chinese and soon, Korean), please contact me For Penn’s Chinese segmenter, contact Bert Xue Data Treebanks and PropBanks

Future Work Develop Java version of David Chiang’s and my tree augmentation program, treep, and incorporate into parsing engine Develop user-level documentation for developing a new language package (in progress) Provide layer of abstraction for language-specific lexical resources, such as semantic hierarchy (in progress) Explore language-independent parsing model changes, to find those that yield positive effect across all languages Employ the layers of abstraction for exploring parameter space Perform “regression tests” across treebanks/languages

Parsing in Multiple Languages

Similar presentations

Presentation on theme: "Parsing in Multiple Languages"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parsing in Multiple Languages

Similar presentations

Presentation on theme: "Parsing in Multiple Languages"— Presentation transcript:

Similar presentations

About project

Feedback