Presentation is loading. Please wait.

Presentation is loading. Please wait.

NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.

Similar presentations


Presentation on theme: "NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software."— Presentation transcript:

1 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software framework for integrating NLP components into larger NLP systems aimed at (but not limited to) MT using tectogrammatics other goals: to create a system for testing the true usefulness of various NLP tools within a real-life application to exploit the abstraction power of tectogrammatics to supply data and technology for other projects

2 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 2 Design decision implementation in Perl under Linux non-Perl tools integrated via Perl wrappers focus on modularity simple uniform API for all included modules maximum re-usage of the PDT annotation scheme linguistic layers, tree editor TrEd, XML data formats, tools for distributed processing... no requirements on methodology rule-based and statistical components can be freely combined

3 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 3 MT pyramid (in terms of PDT) Key question in MT: optimal level of abstraction? Our hypothesis: somewhere around tectogrammatics high generalization over different language characteristics, but still computationally (and mentally!) tractable

4 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 4 MT pyramid in vivo English-Czech translation in TectoMT (sequence of around 80 modules is used): She has never laughed in her new boss's office.Nikdy se nesmála v úřadu svého nového šéfa.

5 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 5 Alignment of t-trees in TectoMT necessary for training t-layer transfer from parallel corpora current solution: set of human-designed features weighted by perceptron Sample sentence pair: It is extremely important that Iraq held elections to a constitutional assembly. Je nesmírně důležité, že v Iráku proběhly volby do ústavního shromáždění.

6 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 6 TectoMT: Current State (1) around 5 developers around 200 modules, especially for Czech and English sentence analysis and synthesis many of them are Perl wrappers to previously existing NLP tools: Collins's parser, McDonald's parser, Hajič's morphology analysis and tagger, Brants's TnT tagger... intensive usage of existing linguistic data resources: PTB, BNC, PDT, PEDT, CNC...

7 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 7 TectoMT: Current State (2) applications implemented in TectoMT so far (April 2008) English-Czech MT (TectoMT participates in WMT08 Shared Task) preprocessing of t-trees for manual annotations of the Prague Czech-English Dependency Treebank interactive Czech analysis in the tree editor TrEd English sentence generator in a dialog system building of a large parallel treebank from the parallel corpus CzEng

8 NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 8 TectoMT: Future plans the most critical bottleneck - insufficient usage of target language model in tectogrammatical transfer (loglinear model trained by perceptron will be used soon) there are many modules based on simple heuristics in the current system: corpus-based alternatives should be searched for most of them tools for other languages will be added


Download ppt "NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software."

Similar presentations


Ads by Google