Presentation is loading. Please wait.

Presentation is loading. Please wait.

EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento.

Similar presentations


Presentation on theme: "EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento."— Presentation transcript:

1 EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

2 TextPro 2 A suite of modular NLP tools developed at FBK-irst  TokenPro: tokenization  MorphoPro: morphological analysis  TagPro: Part-of-Speech tagging  LemmaPro: lemmatization  EntityPro: Named Entity recognition  ChunkPro: phrase chunking  SentencePro: sentence splitting  Architecture designed to be efficient, scalable and robust.  Cross-platform: Unix / Linux / Windows / MacOS X  Multi-lingual models  All modules integrated and accessible through unified command line interface

3 3 TagPro’s architecture To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes Feature selection Controller Feature extraction ortho, prefix, suffix, dictionary, morpho analysis dictionary Learning models Classification YamCha Training data Test data Feature selection TagPro Feature extraction ortho, prefix, suffix, dictionary, morpho analysis MorphoPro

4 YamCha 4 Created as generic, customizable, open source text chunker Can be adapted to a lot of other tag-oriented NLP tasks Uses state-of-the-art machine learning algorithm (SVM)  Can redefine  Context (window-size)  parsing-direction (forward/backward)  algorithms for multi-class problem (pair wise/one vs rest)  Practical chunking time (1 or 2 sec./sentence.)  Available as C/C++ library

5 Support Vector Machines 5 Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

6 YamCha: Setting Window Size 6 Default setting is "F:-2..2:0.. T:-2..-1". The window setting can be customized

7 Training and Tuning Set 7 The Evalita development set was randomly split into 2 parts  Training: 89,170 tokens  Tuning: 44,586 tokens

8 FEATURES 8 For each running word a rich set of features are extracted  WORD: the word itself (both unchanged and lower-cased) e.g. Autoreautore  MORPHO: the morphological analysis (produced by MorphoPro) e.g. Autoreautore+n+m+sing Calciocalcio calcio+n+m+sing calciare+v+indic+pres+nil+1+sing  AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word) e.g. libro{li,lib,libr,libro,ro,bro,ibro,libro}  ORTHOgraphic information (e.g. capitalization, hypenation) e.g. OggiC (capitalized) oggiL (lowercased)  GAZETTeers of proper nouns (154,000 proper names, 12,000 cities, 5,000 organizations and 3,200 locations)

9 9 Static vs Dynamic Features  STATIC FEATURES  extracted for the current, previous and following word  WORD, MORPHO, AFFIXes, ORTHO, GAZET  DYNAMIC FEATURES  decided dynamically during tagging  tag of the two tokens preceding the current token.

10 An Example of Feature Extraction 10 l' ART ex ADJ leader NN socialista ADJ Bettino NN_P Craxi NN_P l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ART ex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJ leader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NN socialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJ Bettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_P Craxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P

11 Finding the best features 11 EAGLES TagSetAccuracyUTAccuracy baseline AFFIX +ORTHO AFFIX +ORTHO +MORPHO+10, AFFIX +ORTHO +MORPHO +GAZETT Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1

12 Finding the best window-size 12 EAGLES TagSetSTATDYNAccuracy +1, , , , Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size

13 multi-class problem pair-wise/one vs rest 13  one vs rest : fewer bigger classifiers  pairwise :  a classifier for each possible pair of classes  choose the classifier with best confidence  many relatively small classifiers  faster, less memory EAGLES TagSetmethodAccuracy pairwise97.65 one vs rest97.78

14 Evaluating the best algorithm PKI vs. PKE 14 EAGLES TagSetAccuracy PKI97.78 PKE97.64 YamCha uses two implementations of SVMs: PKI and PKE. both are faster than the original SVMs  PKI (3-12 x faster) produces the same accuracy as the original SVMs.  PKE ( x) approximates the orginal SVM, slightly less accurate but much faster

15 Results on the development set 15 EAGLESDISTRIB Accuracy Known Words: 40,320 MorphoPro coverage: 96.20% Accuracy Unknown Words: 4,396 MorphoPro coverage: 84.41% Accuracy

16 Test Results 16 TagSetAccuracyUTAccuracy EAGLES DISTRIB

17 Conclusions 17  A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs.  Results confirm that SVMs can deal with a big number of features without incurring in overfitting.  We used the same best configuration for both tagsets.  No specific method was applied for classifying unknown words.  Features: AFFIX+ORTHO: over baseline MORPHO: 2.13 improvement over AFFIX+ORTHO GAZETteers do not contribute any further significant improvement Features for unknown words: AFFIX+ORTHO: MORPHO: ++7,62  No benefit from a larger context (e.g. window-size +2,-2 and more)

18 TagPro 18  TagPro is a system for PoS-tagging based on YamCha.  YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo) is a generic, customizable, and open source text chunker. is based on Support Vector Machines (SVMs)  TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes.  The system is part of TextPro, a suite of NLP tools developed at FBK-irst.

19 19 PRON_PERV_AVEREV_PPADVARTADJNNNN_PP_OTHPREP_AP_EOSV_GVRBPREPCONJ_SADJ_DIMV_MODCONJ_CV_ESSEREPRON_RELPRON_DIMPRON_INDADJ_INDPRON_IESV_CLITADJ_NUMC_NUMADJ_POSADJ_IESPRON_POSNULLP_APOINT PRON_PER V_AVERE V_PP ADV ART ADJ NN NN_P P_OTH PREP_A P_EOS V_GVRB PREP CONJ_S ADJ_DIM V_MOD CONJ_C V_ESSERE PRON_REL PRON_DIM PRON_IND ADJ_IND PRON_IES V_CLIT ADJ_NUM C_NUM ADJ_POS ADJ_IES PRON_POS NULL P_APO INT Confusion matrix


Download ppt "EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento."

Similar presentations


Ads by Google