Presentation is loading. Please wait.

Presentation is loading. Please wait.

The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Similar presentations


Presentation on theme: "The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana"— Presentation transcript:

1 The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si, http://nl.ijs.si/et/ http://nl.ijs.si/et/ tomaz.erjavec@ijs.sihttp://nl.ijs.si/et/

2 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Overview 1. Introduction to Language Resources 2. MULTEXT-East: morphosyntactic resources for East-European languages 3. A tour of Slovene language resources 4. Conclusions

3 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Introduction to Language Resources LR comprise two types of data: LR comprise two types of data: –corpora: mono- or multilingual, reference or specialised, …, /variously annotated/ –lexica: vocabularies, morphosyntactic, syntactic, semantic (ontologies) LRs, esp. corpora are used for empirical language research: LRs, esp. corpora are used for empirical language research: –linguistic research: (annotated) corpus + (sophisticated) search engine –human language technology R&D: testing and training dataset

4 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Characteristics of LRs Separate development for each language  great variation in availability between languages Separate development for each language  great variation in availability between languages Costly to produce, so should be widely available, but: Costly to produce, so should be widely available, but: –“monopoly protection” –problems of copyright –lack of encoding standardisation Good side: Good side: –text is becoming increasingly easy to acquire (WWW) –un- & semi-supervised ML methods give increasingly good results Ideal: lots of different, large, high-quality, standardised, freely available, and supported LRs for all languages, multilingual and multimodal Ideal: lots of different, large, high-quality, standardised, freely available, and supported LRs for all languages, multilingual and multimodal

5 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute History of LRs 70s: Chomskyan paradigm – no LRs 70s: Chomskyan paradigm – no LRs 85-95: renaissance of empiricism (LR-based): 85-95: renaissance of empiricism (LR-based): –became accepted in academic circles: corpus linguistics / (statistical) machine learning –advances in standardisation: TEI, EAGLES –large EU funded HLT/LR projects: EAGLES, MULTEXT,… –EU Copernicus (1995,’97): MULTEXT-East, TELRI,… –LR brokers: LDC (1992), ELRA (1995) 95-05: established field ~ old hat 95-05: established field ~ old hat –LREC: bi-annual conferences (1998-), LRE journal (2005) –XML based standards: TEI, ISO, W3C –national initiatives –no more EU funding for LR collection or HLT R&D –EU funding for component multimodal / multilingual technologies, standardisation and research infrastructures

6 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute MULTEXT-East resources MULTEXT-East: Copernicus Joint Project COP 106 (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East: Copernicus Joint Project COP 106 (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages MULTEXT-East Based on the results of EU MULTEXT (~West) Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: To produce a harmonised BLARK for six languages:BLARK –corpus encoding standardisation (TEI / CES) –multilingual parallel, comparable, speech corpora –morphosyntactic specifications (EAGLES / MULTEXT) –(inflectional) lexicon –annotated corpus –language processing tools

7 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute History of MULTEXT-East resources First release 1998 on TELRI CD-ROM Vol II: already extended with new languages First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http://nl.ijs.si/ME/ Resources since 1998 available on the Web: http://nl.ijs.si/ME/http://nl.ijs.si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects

8 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The Languages of MULTEXT-East Germanic: English Germanic: EnglishEnglish Romance: Romanian Romance: RomanianRomanian Baltic: Baltic: –Latvian Latvian –Lithuanian Lithuanian Finno-Ugric: Finno-Ugric: –Estonian Estonian –Hungarian Hungarian Slavic: Russian (East Slavic) Russian (East Slavic) Russian Czech (West Slavic) Czech (West Slavic) Czech Slovene (South West Slavic) Slovene (South West Slavic) Slovene Resian (Slovene dialect) Resian (Slovene dialect) Resian Croatian (South West Slavic) Croatian (South West Slavic) Croatian Serbian (South West Slavic) Serbian (South West Slavic) Serbian Bulgarian (South East Slavic) Bulgarian (South East Slavic) Bulgarian In progress: Macedonian Macedonian Persian Persian

9 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Version 3 Available on http://nl.ijs.si/ME/V3/ Available on http://nl.ijs.si/ME/V3/http://nl.ijs.si/ME/V3/ Some parts completely free, others free for research  licence Some parts completely free, others free for research  licence Web pages gives: Web pages gives: –extensive documentation –bibliography list –web licence form –resources

10 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The MULTEXT morphosyntactic trinity 1. MULTEXT-East morphosyntactic specifications 2. MULTEXT-East morphosyntactic lexica 3. MULTEXT-East morphosyntactically annotated "1984" corpus

11 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 1. Morphosyntactic specifications Based on EAGLES / MULTEXT Based on EAGLES / MULTEXT Define PoS, their attributes and values Define PoS, their attributes and values The specs are a document containing: The specs are a document containing: –introduction –common tables –language particular sections Written in LaTeX  PDF & HTML Written in LaTeX  PDF & HTMLPDFHTMLPDFHTML Derived XML/TEI encoding as feature structures Derived XML/TEI encoding as feature structuresXML/TEI

12 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example common table

13 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example language specific table

14 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Complexity

15 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 2. The lexica Medium size morphosyntactic lexica Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15.000 lemmas ~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields: –the word-form: the inflected form of the word –the lemma: the base-form of the word –the morphosyntactic description (MSD)

16 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn …

17 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Lexicon sizes

18 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute The specification in as TEI FS … …

19 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute 3. The “1984” corpus Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structurally annotated Structurally annotated Sentence aligned with English Sentence aligned with English Words annotated with lemma and MSD Words annotated with lemma and MSD Encoded in TEI P4 (XML) Encoded in TEI P4 (XML)

20 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Example linguistic encoding Bil Bil je je jasen jasen,, mrzel mrzel aprilski aprilski dan dan in in ure ure so so bile bile trinajst trinajst.. … Context disambiguated lemmas and MSDs

21 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Quantifying the corpus

22 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Utility of MULTEXT-East LRs Specifications became, for some, the “national” standard Specifications became, for some, the “national” standard Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP Training/testing dataset for HLT development: PoS taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: A base dataset for further annotation and experiments: –Word-sense disambiguation –WordNet development and evaluation –Syntactic parser induction Teaching aid in HLT courses Teaching aid in HLT courses ~ 100 registered users ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian

23 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute LRs @ JSI JSI Also ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Contractors for: Inxight Nice try: EU CULTACT

24 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute JSI know-how in corpus compilation 1. Encoding standardisation: XML, TEI, ISO 2. Up-conversion: character set, structure, meta-data 3. Linguistic annotation: token, lemma, MSD, alignment 4. Distribution via nl.ijs.si: concordancing, browsing, download nl.ijs.si & teaching in these areas: ESSLLI, JSIPS, FF, NG

25 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Slovene LRs @ SDJT SDJT

26 Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Conclusions Introduced language resources, MULTEXT- East and Slovene LRs Introduced language resources, MULTEXT- East and Slovene LRs Useful basis for empirical studies of the (Slovene) language Useful basis for empirical studies of the (Slovene) language Of course, more resources are needed, but we are working on it: SDT, sloWNet, jaSlo, ACQUIS, MULTEXT-East Of course, more resources are needed, but we are working on it: SDT, sloWNet, jaSlo, ACQUIS, MULTEXT-East Further collaborations welcome… Further collaborations welcome…

27 Thank you!


Download ppt "The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana"

Similar presentations


Ads by Google