Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.

Slides:



Advertisements
Similar presentations
A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
Advertisements

Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics.
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
Introduction to BLaRKs Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB.
MIG-KOMM-EU Multilingual intercultural business communication in Europe University of Bucharest Faculty of Foreign Languages and Literatures German Studies.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
Eleni Galiotou, Dept. of Informatics
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Michal Křen Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010 Accessing the.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.
Barcelona Meeting 21/06/05 MM 1 LIRICS WP2 LIRICS WP2 NLP LEXICA Task Leader: ILC-CNR (Pisa) presented by: Monica Monachini.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Language Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School Winter 2013 / Spring 2014 Jožef Stefan International.
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003.
PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
IATE EU tool for translation-oriented terminology work
February 2007MCST - FP7 Launch1 Michael Rosner Department of Computer Science and Artificial Intelligence University of Malta.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
TEI and Scholarly publishing Laurent Romary INRIA & HUB-ISDL TEI council, chair.
Dutch HLT Resources: from BLARK to Priority Lists Helmer Strik, Diana Binnenpoorte, Janienke Sturm, Folkert de Vriend, and Catia Cucchiarini* A 2 RT, Dept.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Advanced Language Technologies Information and Communication Technologies Research Area "Knowledge Technologies" Jožef Stefan International Postgraduate.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 1: Overview
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia Polishing BootCat corpora: XML validation and tagset unification.
Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.
Handbook of Language and Ethnic Identity Ch 21: The Slavic World By Miroslav Hroch.
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
PLS Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
Advanced Language Technologies Information and Communication Technologies Module "Knowledge Technologies" Jožef Stefan International Postgraduate School.
Catia Cucchiarini, Walter Daelemans and Helmer Strik Strengthening the Dutch Language and Speech Technology Infrastructure Catia Cucchiarini, Walter Daelemans.
Standards for digital encoding Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 2: TEI.
Generality and Openness in Enabling Methodologies for Morphology and Text Processing Anssi Yli-Jyrä Department of General Linguistics, University of Helsinki.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
TEI presentation for IS 590 Robert Patrick Waltz July 10 th, 2012.
Introduction to TEI Tomaž Erjavec dept
Prepared by: Galya STATEVA, Chief expert
EU Terminology: Building text-related & translation-oriented projects for IATE 20th European Symposium on Languages for Special Purposes – University.
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
The Re3gistry software and the INSPIRE Registry
Darja Fišer CLARIN ERIC Director of User Involvement
Work Session on Statistical Metadata (Geneva, Switzerland May 2013)
Presentation transcript:

Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora Application of MULTEXT-East and TEI in the compilation of parallel corpora Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Overview 1. The need for standardisation 2. Corpus encoding in TEI 3. MULTEXT-East morphosyntactic descriptions

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Why standards (for digital language resources)? public documentation (+ software) public documentation (+ software) (semi)automated validation (semi)automated validation application independent application independent platform independent platform independent do not become obsolescent (as fast) do not become obsolescent (as fast) However: However: –demand time to understand and use them –there are (too) many and not all are accepted –they are not perfectly tuned to application (overhead)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI: the Text Encoding Initiative TEI Guidelines are a vocabulary to describe text for scholarly purposes TEI Guidelines are a vocabulary to describe text for scholarly purposes They consist of: They consist of: –XML schemas –documentation P3 (1994), P4 (2002), P5 (0.9, 2007) P3 (1994), P4 (2002), P5 (0.9, 2007) being developed by the TEI Consortium being developed by the TEI Consortium large user base, web site, mailing list, tutorials, yearly meetings large user base, web site, mailing list, tutorials, yearly meetings increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Jp-Sl dictionary akeru akeru あける あける 開ける 開ける V1 trans. V1 trans. あけます あけます あけて あけて あけない あけない odpreti odpreti 穴(あな)をあける narediti luknjo 穴(あな)をあける narediti luknjo 窓(まど)を開ける odpreti okno 窓(まど)を開ける odpreti okno prim. 開く(あく) intr. prim. 開く(あく) intr. 4 4

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example: MULTEXT-East “1984”, Serbian Prvi deo Prvi deo Bio je vedar i hladan aprilski dan; na časovnicima Bio je vedar i hladan aprilski dan; na časovnicima je izbijalo trinaest. je izbijalo trinaest. Vinston Smit, brade zabijene u Vinston Smit, brade zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade Pobeda, no nedovoljno hitro stambene zgrade Pobeda, no nedovoljno hitro da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim. da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim. …

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute MULTEXT-East MULTEXT-East: EU Project ( ) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East: EU Project ( ) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East Based on the results of EU MULTEXT (~West) Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: To produce a harmonised BLARK for six languages: –morphosyntactic specifications (EAGLES / MULTEXT) –morphosyntacticaly annotated parallel corpus –inflectional lexica –multilingual comparable, speech corpora –language processing tools

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute History of MULTEXT-East resources First release 1998 on CD-ROM: already extended with new languages First release 1998 on CD-ROM: already extended with new languages Resources since 1998 available on the Web: Resources since 1998 available on the Web: Second release 2002 (EU CONCEDE): re-encoding in XML/TEI, harmonisation Second release 2002 (EU CONCEDE): re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Third release 2004: merge of first two releases, further languages Fourth release 2007 (?) Fourth release 2007 (?)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The Languages of MULTEXT-East Germanic: English Germanic: EnglishEnglish Romance: Romanian Romance: RomanianRomanian Baltic: Baltic: –Latvian Latvian –Lithuanian Lithuanian Finno-Ugric: Finno-Ugric: –Estonian Estonian –Hungarian Hungarian (BalkaNet): (BalkaNet): –Greek –Tukrish) Slavic: Slavic: –Russian (East Slavic) Russian –Czech (West Slavic) Czech –Slovene (South West Slavic) Slovene –Resian (Slovene dialect) Resian –Croatian (South West Slavic) -- Marko Tadič Croatian –Serbian (South West Slavic) -- C. Krstev, D. Vitas Serbian –Bulgarian (South East Slavic) Bulgarian In progress: In progress: –Macedonian –Persian

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The MULTEXT morphosyntactic trinity 1. MULTEXT-East morphosyntactic specifications (Croatian, Serbian) 2. MULTEXT-East morphosyntactic lexica (Serbian) 3. MULTEXT-East morphosyntactically annotated "1984" corpus (Serbian)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 1. Morphosyntactic specifications Based on EAGLES / MULTEXT Based on EAGLES / MULTEXT Define PoS, their attributes and values Define PoS, their attributes and values The specs are a document containing: The specs are a document containing: –introduction –common tables –language particular sections Written in LaTeX  PDF & HTML Written in LaTeX  PDF & HTML Derived XML/TEI encoding as feature structures Derived XML/TEI encoding as feature structures In Version 4 specifications to be fully in TEI/XML In Version 4 specifications to be fully in TEI/XML

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example common table

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example language specific table

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 2. The lexica Medium size morphosyntactic lexica Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca lemmas ~ all word-forms of cca lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields: –the word-form: the inflected form of the word –the lemma: the base-form of the word –the morphosyntactic description (MSD)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn …

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 3. The “1984” corpus Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structurally annotated Structurally annotated Sentence aligned with English Sentence aligned with English Words annotated with lemma and MSD Words annotated with lemma and MSD Encoded in TEI P4 (XML) Encoded in TEI P4 (XML)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example linguistic encoding Bil Bil je je jasen jasen,, mrzel mrzel aprilski aprilski dan dan in in ure ure so so bile bile trinajst trinajst.. … Context disambiguated lemmas and MSDs

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Utility of MULTEXT-East LRs Specifications became, for some, the “national” standard Specifications became, for some, the “national” standard Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: A base dataset for further annotation and experiments: –Word-sense disambiguation –WordNet development and evaluation –Syntactic parser induction Teaching aid in HLT courses Teaching aid in HLT courses ~ 100 registered users ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian, Bosnian? As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian, Bosnian?

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Corpora using TEI+MULTEXT-East Reference corpus of Slovene: FIDA (100Mw), FIDA+ (600Mw) (+ other Sl. corpora) Reference corpus of Slovene: FIDA (100Mw), FIDA+ (600Mw) (+ other Sl. corpora) Croatian National Corpus: HNK (100Mw) Croatian National Corpus: HNK (100Mw) Various Romanian corpora, … Various Romanian corpora, … En-Sl parallel annotated corpus: SVEZ-IJS (10Mw) En-Sl parallel annotated corpus: SVEZ-IJS (10Mw)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Conclusions TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions Both have already been used for a number of corpora Both have already been used for a number of corpora Maybe also for BKS? Maybe also for BKS?

Thank you!