Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.

Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora Application of MULTEXT-East and TEI in the compilation of parallel corpora Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si, http://nl.ijs.si/et/ http://nl.ijs.si/et/ tomaz.erjavec@ijs.sihttp://nl.ijs.si/et/

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Overview 1. The need for standardisation 2. Corpus encoding in TEI 3. MULTEXT-East morphosyntactic descriptions

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Why standards (for digital language resources)? public documentation (+ software) public documentation (+ software) (semi)automated validation (semi)automated validation application independent application independent platform independent platform independent do not become obsolescent (as fast) do not become obsolescent (as fast) However: However: –demand time to understand and use them –there are (too) many and not all are accepted –they are not perfectly tuned to application (overhead)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute TEI: the Text Encoding Initiative TEI Guidelines are a vocabulary to describe text for scholarly purposes TEI Guidelines are a vocabulary to describe text for scholarly purposes They consist of: They consist of: –XML schemas –documentation P3 (1994), P4 (2002), P5 (0.9, 2007) P3 (1994), P4 (2002), P5 (0.9, 2007) being developed by the TEI Consortium being developed by the TEI Consortium large user base, web site, mailing list, tutorials, yearly meetings large user base, web site, mailing list, tutorials, yearly meetings increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Jp-Sl dictionary akeru akeru あけるあける開ける開ける V1 trans. V1 trans. あけますあけますあけてあけてあけないあけない odpreti odpreti 穴（あな）をあける narediti luknjo 穴（あな）をあける narediti luknjo 窓（まど）を開ける odpreti okno 窓（まど）を開ける odpreti okno prim. 開く（あく） intr. prim. 開く（あく） intr. 4 4

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example: MULTEXT-East “1984”, Serbian Prvi deo Prvi deo 1. 1. Bio je vedar i hladan aprilski dan; na časovnicima Bio je vedar i hladan aprilski dan; na časovnicima je izbijalo trinaest. je izbijalo trinaest. Vinston Smit, brade zabijene u Vinston Smit, brade zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade Pobeda, no nedovoljno hitro stambene zgrade Pobeda, no nedovoljno hitro da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim. da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim. …

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute MULTEXT-East MULTEXT-East: EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East: EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East Based on the results of EU MULTEXT (~West) Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: To produce a harmonised BLARK for six languages: –morphosyntactic specifications (EAGLES / MULTEXT) –morphosyntacticaly annotated parallel corpus –inflectional lexica –multilingual comparable, speech corpora –language processing tools

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute History of MULTEXT-East resources First release 1998 on CD-ROM: already extended with new languages First release 1998 on CD-ROM: already extended with new languages Resources since 1998 available on the Web: http://nl.ijs.si/ME/ Resources since 1998 available on the Web: http://nl.ijs.si/ME/http://nl.ijs.si/ME/ Second release 2002 (EU CONCEDE): re-encoding in XML/TEI, harmonisation Second release 2002 (EU CONCEDE): re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Third release 2004: merge of first two releases, further languages Fourth release 2007 (?) Fourth release 2007 (?)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The Languages of MULTEXT-East Germanic: English Germanic: EnglishEnglish Romance: Romanian Romance: RomanianRomanian Baltic: Baltic: –Latvian Latvian –Lithuanian Lithuanian Finno-Ugric: Finno-Ugric: –Estonian Estonian –Hungarian Hungarian (BalkaNet): (BalkaNet): –Greek –Tukrish) Slavic: Slavic: –Russian (East Slavic) Russian –Czech (West Slavic) Czech –Slovene (South West Slavic) Slovene –Resian (Slovene dialect) Resian –Croatian (South West Slavic) -- Marko Tadič Croatian –Serbian (South West Slavic) -- C. Krstev, D. Vitas Serbian –Bulgarian (South East Slavic) Bulgarian In progress: In progress: –Macedonian –Persian

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The MULTEXT morphosyntactic trinity 1. MULTEXT-East morphosyntactic specifications (Croatian, Serbian) 2. MULTEXT-East morphosyntactic lexica (Serbian) 3. MULTEXT-East morphosyntactically annotated "1984" corpus (Serbian)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 1. Morphosyntactic specifications Based on EAGLES / MULTEXT Based on EAGLES / MULTEXT Define PoS, their attributes and values Define PoS, their attributes and values The specs are a document containing: The specs are a document containing: –introduction –common tables –language particular sections Written in LaTeX  PDF & HTML Written in LaTeX  PDF & HTML Derived XML/TEI encoding as feature structures Derived XML/TEI encoding as feature structures In Version 4 specifications to be fully in TEI/XML In Version 4 specifications to be fully in TEI/XML

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example common table

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example language specific table

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 2. The lexica Medium size morphosyntactic lexica Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15.000 lemmas ~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields: –the word-form: the inflected form of the word –the lemma: the base-form of the word –the morphosyntactic description (MSD)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn …

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 3. The “1984” corpus Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structurally annotated Structurally annotated Sentence aligned with English Sentence aligned with English Words annotated with lemma and MSD Words annotated with lemma and MSD Encoded in TEI P4 (XML) Encoded in TEI P4 (XML)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example linguistic encoding Bil Bil je je jasen jasen,, mrzel mrzel aprilski aprilski dan dan in in ure ure so so bile bile trinajst trinajst.. … Context disambiguated lemmas and MSDs

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Utility of MULTEXT-East LRs Specifications became, for some, the “national” standard Specifications became, for some, the “national” standard Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: A base dataset for further annotation and experiments: –Word-sense disambiguation –WordNet development and evaluation –Syntactic parser induction Teaching aid in HLT courses Teaching aid in HLT courses ~ 100 registered users ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian, Bosnian? As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian, Bosnian?

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Corpora using TEI+MULTEXT-East Reference corpus of Slovene: FIDA (100Mw), FIDA+ (600Mw) (+ other Sl. corpora) Reference corpus of Slovene: FIDA (100Mw), FIDA+ (600Mw) (+ other Sl. corpora) Croatian National Corpus: HNK (100Mw) Croatian National Corpus: HNK (100Mw) Various Romanian corpora, … Various Romanian corpora, … En-Sl parallel annotated corpus: SVEZ-IJS (10Mw) En-Sl parallel annotated corpus: SVEZ-IJS (10Mw)

BKS symposium April 2007 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Conclusions TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions Both have already been used for a number of corpora Both have already been used for a number of corpora Maybe also for BKS? Maybe also for BKS?

Thank you!

Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.

Similar presentations

Presentation on theme: "Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.

Similar presentations

Presentation on theme: "Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora."— Presentation transcript:

Similar presentations

About project

Feedback