Presentation is loading. Please wait.

Presentation is loading. Please wait.

EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento.

Similar presentations

Presentation on theme: "EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento."— Presentation transcript:

1 EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

2 TextPro 2 A suite of modular NLP tools developed at FBK-irst  TokenPro: tokenization  MorphoPro: morphological analysis  TagPro: Part-of-Speech tagging  LemmaPro: lemmatization  EntityPro: Named Entity recognition  ChunkPro: phrase chunking  SentencePro: sentence splitting  Architecture designed to be efficient, scalable and robust.  Cross-platform: Unix / Linux / Windows / MacOS X  Multi-lingual models  All modules integrated and accessible through unified command line interface

3 3 TagPro’s architecture To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes Feature selection Controller Feature extraction ortho, prefix, suffix, dictionary, morpho analysis dictionary Learning models Classification YamCha Training data Test data Feature selection TagPro Feature extraction ortho, prefix, suffix, dictionary, morpho analysis MorphoPro

4 YamCha 4 Created as generic, customizable, open source text chunker Can be adapted to a lot of other tag-oriented NLP tasks Uses state-of-the-art machine learning algorithm (SVM)  Can redefine  Context (window-size)  parsing-direction (forward/backward)  algorithms for multi-class problem (pair wise/one vs rest)  Practical chunking time (1 or 2 sec./sentence.)  Available as C/C++ library

5 Support Vector Machines 5 Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

6 YamCha: Setting Window Size 6 Default setting is "F:-2..2:0.. T:-2..-1". The window setting can be customized

7 Training and Tuning Set 7 The Evalita development set was randomly split into 2 parts  Training: 89,170 tokens  Tuning: 44,586 tokens

8 FEATURES 8 For each running word a rich set of features are extracted  WORD: the word itself (both unchanged and lower-cased) e.g. Autoreautore  MORPHO: the morphological analysis (produced by MorphoPro) e.g. Autoreautore+n+m+sing Calciocalcio calcio+n+m+sing calciare+v+indic+pres+nil+1+sing  AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word) e.g. libro{li,lib,libr,libro,ro,bro,ibro,libro}  ORTHOgraphic information (e.g. capitalization, hypenation) e.g. OggiC (capitalized) oggiL (lowercased)  GAZETTeers of proper nouns (154,000 proper names, 12,000 cities, 5,000 organizations and 3,200 locations)

9 9 Static vs Dynamic Features  STATIC FEATURES  extracted for the current, previous and following word  WORD, MORPHO, AFFIXes, ORTHO, GAZET  DYNAMIC FEATURES  decided dynamically during tagging  tag of the two tokens preceding the current token.

10 An Example of Feature Extraction 10 l' ART ex ADJ leader NN socialista ADJ Bettino NN_P Craxi NN_P l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ART ex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJ leader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NN socialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJ Bettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_P Craxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P

11 Finding the best features 11 EAGLES TagSetAccuracyUTAccuracy baseline86.7059.95 +AFFIX +ORTHO+8.56+25.56 +AFFIX +ORTHO +MORPHO+10,69+33.18 +AFFIX +ORTHO +MORPHO +GAZETT+10.72+33.13 Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1

12 Finding the best window-size 12 EAGLES TagSetSTATDYNAccuracy +1,-197.42 +2,-2-2-0.34 +1,-1-2+0.23 +1,-1-3+0.22 Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size

13 multi-class problem pair-wise/one vs rest 13  one vs rest : fewer bigger classifiers  pairwise :  a classifier for each possible pair of classes  choose the classifier with best confidence  many relatively small classifiers  faster, less memory EAGLES TagSetmethodAccuracy pairwise97.65 one vs rest97.78

14 Evaluating the best algorithm PKI vs. PKE 14 EAGLES TagSetAccuracy PKI97.78 PKE97.64 YamCha uses two implementations of SVMs: PKI and PKE. both are faster than the original SVMs  PKI (3-12 x faster) produces the same accuracy as the original SVMs.  PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster

15 Results on the development set 15 EAGLESDISTRIB Accuracy97.7897.52 Known Words: 40,320 MorphoPro coverage: 96.20% Accuracy98.2997.95 Unknown Words: 4,396 MorphoPro coverage: 84.41% Accuracy93.0393.56

16 Test Results 16 TagSetAccuracyUTAccuracy EAGLES98.0495.02 DISTRIB97.6894.65

17 Conclusions 17  A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs.  Results confirm that SVMs can deal with a big number of features without incurring in overfitting.  We used the same best configuration for both tagsets.  No specific method was applied for classifying unknown words.  Features: AFFIX+ORTHO: +8.56 over baseline MORPHO: 2.13 improvement over AFFIX+ORTHO GAZETteers do not contribute any further significant improvement Features for unknown words: AFFIX+ORTHO:+25.56 MORPHO: ++7,62  No benefit from a larger context (e.g. window-size +2,-2 and more)

18 TagPro 18  TagPro is a system for PoS-tagging based on YamCha.  YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo) is a generic, customizable, and open source text chunker. is based on Support Vector Machines (SVMs)  TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes.  The system is part of TextPro, a suite of NLP tools developed at FBK-irst.

19 19 PRON_PERV_AVEREV_PPADVARTADJNNNN_PP_OTHPREP_AP_EOSV_GVRBPREPCONJ_SADJ_DIMV_MODCONJ_CV_ESSEREPRON_RELPRON_DIMPRON_INDADJ_INDPRON_IESV_CLITADJ_NUMC_NUMADJ_POSADJ_IESPRON_POSNULLP_APOINT PRON_PER972002012000000008000000000000000000 V_AVERE0521000000000100000000000000000000 V_PP0011882097190000210010000000100000000 ADV902224801210300002014000000310000000000 ART60003780000000000000000400000000000 ADJ009714024541203000620000000020020000000 NN00212301148673430002401000100102062000100 NN_P000113151847030020000000000010000200 P_OTH00000000347600000000000000000000100 PREP_A00000201026550010000000000000000000 P_EOS00000000201786000000000000000000000 V_GVRB0072015270000275410000000000110000000 PREP00050010010042301001000000000000000 CONJ_S10070010000045980030700000000000000 ADJ_DIM0000020000000026100008070000100000 V_MOD0010030000040002850000000000000000 CONJ_C000200300100110001759210000000000000 V_ESSERE00020020000100001128500000010000000 PRON_REL00010000000002400006640022000000000 PRON_DIM00000000000000180000191000000000000 PRON_IND00074010000000000010214160000000000 ADJ_IND0006041000001000000053950000000000 PRON_IES000100400000000000900019000000000 V_CLIT0370008000030000000000023200000000 ADJ_NUM0000243200003000000023003140000000 C_NUM0000000300000000000000001348000000 ADJ_POS1001010000000000000000000035100000 ADJ_IES00000000000002000030020000080000 PRON_POS10000010000000000000000000800000 NULL000001440000000000000000000002100 P_APO000000001000000000000000000000300 INT00000201000100000000000000000002 Confusion matrix

Download ppt "EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento."

Similar presentations

Ads by Google