Presentation is loading. Please wait.

Presentation is loading. Please wait.

A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*

Similar presentations


Presentation on theme: "A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*"— Presentation transcript:

1 A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi* Simonetta Montemagni + Giulia Di Pietro* Maria Simi* * Universit à di Pisa + ILC - CNR, Pisa

2 Summary Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work

3 Semantic tagging Named Entity Recognition (NER)  Simple ontologies: person, organization, location …  Limited semantic/syntactic coverage  High accuracy Word Sense Disambiguation  Identifying WordNet senses  tens of thousands of specific “word senses”  all open class words covered, domain-independent  inadeguate performance

4 Super Senses  Introduced by Ciaramita and Altun (2006) WordNet super senses  Noun and verb synsets mapped to 41 general semantic classes (lexicographic categories)  26 noun categories; 15 verb categories Example: “Clara Harris person, one of the guests person in the box artifact, stood up motion and demanded communication water substance ”

5 Super Sense Tagging For English (Ciaramita and Altun, 2006)  training on SemCor (Senseval-3)  discriminative HMM, trained with an average perceptron algorithm  average F-Score on 41 categories: 77.18 For Italian (Picca, Gliozzo, Ciaramita, 2008)  trained on MultiSemCor (Bentivoglio et al.)  average F-Score on 41 categories: 62,90

6 Improving MultiSemCor Problems  Smaller size (64% of English corpus)  Incomplete alignment (sense in Eng., no sense in Ita.)  PoS coarseness  Word by word translation Stategy  Retagging, adding morphology Results  average F-Score: 64,95 (same algorithm; 45 categories)

7 Further work Our requirements  Integration of a SST tagger in the TANL pipeline  Useful model for annotating realistic Italian texts Two directions for improvement  A brand new resource for SST  A new algorithm for SST, based on Maximum Entropy

8 Building the new resource ISST - Italian Syntactic-Semantic Treebank  305,547 tokens  81,236 content words annotated at the lexico- semantic level, including IWN senses  ILI* mapping from IWN to WN senses * Inter Linguistic Index WordNet IWN ILI Supersense ISST Corpus

9 From Italian senses to English super senses Starting from sense S i : 1. If S i is in ILI, return che corresponding S e 2. If not, look for the first hyperonym in the ILI and return the corresponding S e In both cases return the super sense of S e in WN ItalWordNet Synset ILI n#24931n #08770969 WordNet Supersense Synset ILI n#16564 – Synset ILI n#12484 #04692559 Hyperonym Token L’ atmosfera di festa ISST Corpus

10 ISST-SST after mapping Tokens with super-sense Tokens with ambiguous super-sense Tokens without super-sense direct ILIILI from hyp noun 43.9081.7413.49238.266 verb 10.088601.35129.260 adjective 3.2191.51911816.492 adverb 00013.812 Total57.2153.3204.96197.830

11 Revision Mapping of adverbs in adv.all (~ 10,000) Listing of possible super senses  Alternative for nouns: 2-6  Alternative for verbs: 3-10 An ad-hoc tool for revision Difficulties  Aspectual verbs: “continuare a …”, “stare per …”  Support verbs: “prestare attenzione a …”, “ dar una mano …”

12 ISST-SST after revision Tokens with super sense Tokens without super sense noun69,36011,545 verb27,6677,075 adjective17,4784,649 adverb12,2321,596 Total126,73724,865

13 Super Sense Tagger Adapting a generic chunker, part of the Tanl pipeline Maximum Entropy classifier  Effective for chunking since it does not assume independence of features Dynamic programming to select sequences of tags with higher probability The tagger is flexibile and customizable for different tasks  specialization of class FeatureExtractor

14 Features No external resources, no first sense heuristics Local features  Token Attribute features POSTAG-2 -1 0 1 2 CPOSTAG-1 0  Form features FORM ^\p{Lu} -1 +1 FORM ^\p{Lu}*$ 0 Global features  Whether a word in the document was previously annotated with a given tag

15 Detailed results

16 Results for Italian Improvement for Italian due to:  new corpus  the different algorithm and the tuning of features PrecisionRecallF1 Italian Picca et al. 62.2663.5762,90 Italian our 79.9278.3079.10

17 Analysis of improvement Improvement due to new corpus  MultiSemCor vs ISST-SST, ME tagger  about +4.5 on the F1 score Improvement due to new algorithm and features  Ciaramita-Altun tagger vs ME tagger, on ISST-SST  about +10 on the F1 score

18 Conclusions Significant improvement in accuracy for SS tagging The tagger has been used to annotate the Italian Wikipedia Examples of queries made possible on the semantic index  Who proves emotions? (the subj of a verb.emotion)  What did Edison invent/create/discover …? (Edison as the subject of a verb.creation) Completion of the ISST-ISST resource can further improve accuracy


Download ppt "A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*"

Similar presentations


Ads by Google