MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A comparative study of the tagging of adverbs in modern English corpora.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics.
University of Sheffield NLP Module 4: Machine Learning.
Corpus Processing and NLP
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003.
Some Advances in Transformation-Based Part of Speech Tagging
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Exploring XML-based Technologies and Procedures for Quality Evaluation from a Real-life Case Perspective Folkert de Vriend 1 & Giulio Maltese 2 1 Speech.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia Polishing BootCat corpora: XML validation and tagset unification.
Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.
XML technologies for text encoding Tamás Váradi
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
Part-of-speech tagging
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Formats, interoperability and standards Marc Kemps-Snijders.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Relations between Data Categories
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
Universal Dependencies
Topics in Linguistics ENG 331
ISOCAT ISOCAT Problems
Natural Language Processing
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia Dublin April 3 rd, 2009

Erjavec: MULTEXT-East Version 4 Dublin, Overview of the talk 1. Part-of-speech tagging, tagsets and interoperability 2. MULTEXT(-East) morphosyntactic specifications 3. Languages, formats, transformations 4. An application: JOS resources for Slovene 5. Conclusions

Erjavec: MULTEXT-East Version 4 Dublin, Part-of-speech tagging The task of assigning the correct PoS tag to each word in a running text, e.g. The task of assigning the correct PoS tag to each word in a running text, e.g. Under/IN the/DT proposal/NN,/, Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP … Important HLT infrastructure Very useful annotations for linguists Some applications: pre-processing step for further analyses: lemmas, syntactic structure, etc. text indexing, e.g. nouns are more useful than verbs

Erjavec: MULTEXT-East Version 4 Dublin, Methods of PoS tagging PoS tagging: determine ambiguity class or word (saw NN | VBD) disambiguate to correct tag in (local) context (I saw/VBD a saw/NN ) Tagger training: manually annotated corpus: source of probabilities for tags given a (local) context + manually annotated corpus: source of probabilities for tags given a (local) context + (lexicon: gives possible tags for each word-form) (lexicon: gives possible tags for each word-form) Popular taggers: Popular taggers: TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging) TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging) Tagging usefulness as well as accuracy crucially depends on the tagset Tagging usefulness as well as accuracy crucially depends on the tagset

Erjavec: MULTEXT-East Version 4 Dublin, English tagsets Tagging first developed for English (Brown, CLAWS, PTB tagsets) Tagging first developed for English (Brown, CLAWS, PTB tagsets) English inflectionally very poor language small tagsets ~ 50 different tags English inflectionally very poor language small tagsets ~ 50 different tags Tags are typically synthetic, i.e. the tag does not transparently map to features e.g. : Tags are typically synthetic, i.e. the tag does not transparently map to features e.g. : to/TO (PoS?) Delmed/NNP (number?) shares/NNS (number?)

Erjavec: MULTEXT-East Version 4 Dublin, Tagsets for other languages will often have many more morphosyntactic features associated with a word, so tagsets will be larger will often have many more morphosyntactic features associated with a word, so tagsets will be larger e.g. Slovene nouns: e.g. Slovene nouns: type: common, proper type: common, proper gender: masculine, feminine, neuter gender: masculine, feminine, neuter number: singular, dual, plural number: singular, dual, plural case: nom., gen., dat., acc., loc., ins. case: nom., gen., dat., acc., loc., ins. (animacy: yes, no) (animacy: yes, no) = 104 PoS tags just for Nouns = 104 PoS tags just for Nouns Russian, Czech, Slovene ~ word level syntactict tags Russian, Czech, Slovene ~ word level syntactict tags

Erjavec: MULTEXT-East Version 4 Dublin, PoS tags vs. MSDs PoS tags: PoS tags: used in corpora for corpus annotations / tagging used in corpora for corpus annotations / tagging typically synthetic typically synthetic Morphosyntactic Descriptions (MSDs): Morphosyntactic Descriptions (MSDs): used in inflectional lexica for lexical annotations / morphological analysis used in inflectional lexica for lexical annotations / morphological analysis typically analytic typically analytic Relation of PoS tagsets to MSD tagsets/features Relation of PoS tagsets to MSD tagsets/features in general: |PoS| < |MSD| in general: |PoS| < |MSD| but in most MULTEXT-East languages: [PoS] [MSD] but in most MULTEXT-East languages: [PoS] [MSD]

Erjavec: MULTEXT-East Version 4 Dublin, Developing a multilingual morphosyntactic framework Interoperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented Interoperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it

Erjavec: MULTEXT-East Version 4 Dublin, so, wouldnt it be nice to have: an open, standardised, documented, flexible model for MSD/PoS tagset design, an open, standardised, documented, flexible model for MSD/PoS tagset design, that would be instantiated for lots of languages, that would be instantiated for lots of languages, and could be simply applied to any language? and could be simply applied to any language?

Erjavec: MULTEXT-East Version 4 Dublin, EU standardisation efforts EAGLES: Expert Advisory Group for Language Engineering Standards ( ) EAGLES: Expert Advisory Group for Language Engineering Standards ( ) MULTEXT: Multilingual Text Tools and Corpora (1995) MULTEXT: Multilingual Text Tools and Corpora (1995) MULTEXT-East: MULTEXT for Central and Eastern European Languages: MULTEXT-East: MULTEXT for Central and Eastern European Languages: Version 1: TELRI edition (1998) Version 1: TELRI edition (1998) Version 2: Concede edition (2002) Version 2: Concede edition (2002) Version 3: TEI edition (2004) Version 3: TEI edition (2004) Version 4: MondiLex edition (2009?) Version 4: MondiLex edition (2009?) ISO / TC 37 / LMF / isoCat (2008) ISO / TC 37 / LMF / isoCat (2008)

Erjavec: MULTEXT-East Version 4 Dublin, MULTEXT-East morphosyntactic resources Basic Language Resource Kit: Basic Language Resource Kit: 1. specifications: define features and MSDs 2. lexica (~15,000 lemmas): triplets: word-form / lemma / MSD 3. parallel corpus: MSD and lemma annotated Freely available for research Freely available for research

Erjavec: MULTEXT-East Version 4 Dublin, : aligned and annotated

Erjavec: MULTEXT-East Version 4 Dublin, MULTEXT-East languages

Erjavec: MULTEXT-East Version 4 Dublin, The MULTEX(-East) morphosyntactic specifications They specify that e.g.Ncmsn They specify that e.g.Ncmsn corresponds to the feature-structure [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] corresponds to the feature-structure [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] is a valid MSD for Slovene is a valid MSD for Slovene Specifications consist of Specifications consist of Front matter Front matter Common part - common definitions for all languages (features) Common part - common definitions for all languages (features) Language particular parts - particulars for each language (MSD set) Language particular parts - particulars for each language (MSD set)

Erjavec: MULTEXT-East Version 4 Dublin, V4 specs draft in HTML

Erjavec: MULTEXT-East Version 4 Dublin, Specifications in Version 4 Encoded in XML / teiLite (in Version 3: LaTeX) Encoded in XML / teiLite (in Version 3: LaTeX) TEI = Text Encoding Initiative Guidelines P4 TEI = Text Encoding Initiative Guidelines P4 Still in book-like in form, to make authoring easier Still in book-like in form, to make authoring easier XSLT into other formats: XSLT into other formats: HTML HTML tabular mapping formats (e.g. MSD to features) tabular mapping formats (e.g. MSD to features) XML/TEI feature library XML/TEI feature library (OWL) (OWL)

Erjavec: MULTEXT-East Version 4 Dublin, The common specifications Define categories (parts-of-speech) Define categories (parts-of-speech) For each category define features, i.e. attributes and their values For each category define features, i.e. attributes and their values For each attribute-value specify for which languages it is appropriate For each attribute-value specify for which languages it is appropriate Give positional mapping to MSDs: Give positional mapping to MSDs: each attribute assigned a position each attribute assigned a position each attribute-value assigned a one- character code each attribute-value assigned a one- character code

Erjavec: MULTEXT-East Version 4 Dublin, Common table (HTML)

Erjavec: MULTEXT-East Version 4 Dublin, Common table (source XML/teiLite)

Erjavec: MULTEXT-East Version 4 Dublin, Language particular sections Recap the feature definitions for the language Recap the feature definitions for the language Add combinations, i.e. feature-coocurrence restrictions Add combinations, i.e. feature-coocurrence restrictions Add lexicon, i.e. list of all valid MSDs for language Add lexicon, i.e. list of all valid MSDs for language Possibly localise the features and codes Possibly localise the features and codes Possibly give notes and examples Possibly give notes and examples

Erjavec: MULTEXT-East Version 4 Dublin, Combinations

Erjavec: MULTEXT-East Version 4 Dublin, Lexicon

Erjavec: MULTEXT-East Version 4 Dublin, Jezikoslovno označevanje slovenščine

Erjavec: MULTEXT-East Version 4 Dublin, JOS as a bridge to MULTEXT-East Version 4 FidaPLUS corpus JOS corpora MTE V3 slv specifications JOS (slv) specifications MTE V4 (slv) specifications MTE V4 specifications

Erjavec: MULTEXT-East Version 4 Dublin,

Erjavec: MULTEXT-East Version 4 Dublin, JOS specifications XML/teiLite + XSLT transforms XML/teiLite + XSLT transforms Allow reordering of attribute positions (Vm-----d Vmd) Allow reordering of attribute positions (Vm-----d Vmd) i18n / slv+eng: i18n / slv+eng: translation: specifications translation: specifications localisation: attributes, values, codes localisation: attributes, values, codes localisation: TEI element names localisation: TEI element names

Erjavec: MULTEXT-East Version 4 Dublin,

Erjavec: MULTEXT-East Version 4 Dublin,

Erjavec: MULTEXT-East Version 4 Dublin, MSD conversion tables Tabular UTF-8 files Tabular UTF-8 files MSD-slv to -eng MSD-slv to -eng MSD to features MSD to features Collating sequence Collating sequencee.g. 01N Somei Ncmsn 01N Somer Ncmsg 01N Somed Ncmsd Ncmsn Noun Type=common Gender=masculine Number=singular Case=nominative Animacy=0 Ncmsg Noun Type=common Gender=masculine Number=singular Case=genitive Animacy=0 Ncmsd Noun Type=common Gender=masculine Number=singular Case=dative Animacy=0

Erjavec: MULTEXT-East Version 4 Dublin, Adding a new language XSLT scripts: XSLT scripts: mtems-split.xsl: make a template for the language particular section of a new language mtems-split.xsl: make a template for the language particular section of a new language mtems-merge: merge a new language particular section to the common tables mtems-merge: merge a new language particular section to the common tables Maybe shortly to be tested on new Slavic languages in the scope of MondiLex Maybe shortly to be tested on new Slavic languages in the scope of MondiLex

Erjavec: MULTEXT-East Version 4 Dublin, Critiques Its just an exercise in encoding anyway Its just an exercise in encoding anyway Same is different, different is same Same is different, different is same The Procrustean bed of standards The Procrustean bed of standards Policy change: from unification to harmonisation (hippy school) Policy change: from unification to harmonisation (hippy school)

Erjavec: MULTEXT-East Version 4 Dublin, Conclusions Presented work-in-progress on standardisation of multilingual morphosyntactic specifications Presented work-in-progress on standardisation of multilingual morphosyntactic specifications Specifications are a de-facto standard for several languages (Romanian, Slovene, Croatian) Specifications are a de-facto standard for several languages (Romanian, Slovene, Croatian) Could serve as hub encoding for multilingual applications, e.g. MT Could serve as hub encoding for multilingual applications, e.g. MT and as an framework for new languages and as an framework for new languages

Erjavec: MULTEXT-East Version 4 Dublin, Further work Finishing MTE V4! Finishing MTE V4! Distribution: LDC, ELDA Distribution: LDC, ELDA Relation to ISO-TC37 standards: Relation to ISO-TC37 standards: LMF, isoCAT LMF, isoCAT Connecting to GOLD ontology Connecting to GOLD ontology Adding new languages: Adding new languages: Slavic completion Slavic completion Western European: MULTEXT Western European: MULTEXT Japanese: chasen tagset, jpWaC(-L2) Japanese: chasen tagset, jpWaC(-L2) Irish? Irish?