Presentation is loading. Please wait.

Presentation is loading. Please wait.

HLT at the AILab, IMCS, UL Artificial Intelligence Laboratory Institute of Mathematics and Computer Science University of Latvia www.ailab.lv 27.10.2004.

Similar presentations


Presentation on theme: "HLT at the AILab, IMCS, UL Artificial Intelligence Laboratory Institute of Mathematics and Computer Science University of Latvia www.ailab.lv 27.10.2004."— Presentation transcript:

1 HLT at the AILab, IMCS, UL Artificial Intelligence Laboratory Institute of Mathematics and Computer Science University of Latvia www.ailab.lv 27.10.2004

2 Agenda Brief history of the Laboratory Corpus linguistics MT modelling Speech synthesis for Latvian Development of electronic dictionaries Computer-assisted teaching aids

3 History IMCS (UL) has been dealing with automated processing of Latvian for more than 15 years First activities concern the development of Latvian character coding standard (1989) The AILab was founded in 1992 One of the main tasks of the Laboratory is to ensure the usage and processing of Latvian in computer systems There are 5–10 people working at the Laboratory – research fellows and students of the Faculty of Physics and Mathematics (UL) and Faculty of Philology (UL)

4 Building up Latvian Corpus Collecting of Latvian resources has been initiated at the end of 80ies, beginning of 90ies: at the very beginning texts were manually keyboarded later – they were scanned and optically recognised precision of 98.5% has been achieved Ca 20 mill. running words covering different types of texts: Pieces of classical Latvian literature:  > 10 000 pages  end of the 19th, beginning of the 20th century

5 Building up Latvian Corpus Pieces of Latvian folklore:  the biggest collection of Latvian beliefs  the biggest collection of Latvian fairy tales and legends Electronic library of Latvian folkloristics In collaboration with the Archives of Latvian Folklore building of fund of Latvian proverbs is in progress (> 20 000 units) Latvian Culture (in Latvian and English, rich in pictures) Texts from the newspaper “Rīgas Balss” (1994–1997) in Latvian and Russian

6 Building up Latvian Corpus An issues of fragmentation, mark-up, character set/font compatibility, copyrights Mark-up: plain text format; HTML ca 4 mill. running words with structured SGML mark-up ca 1 mill. running words transformed form HTML to XML work on a software development for semi automated morpho-syntactical annotation is ongoing

7 Building up Latvian Corpus Tools: Morphological analyser has been developed, which offers all base forms of the particular word form and vice versa Exploring data from the “Reverse dictionary of Latvian” an experimental software for Latvian morphemic analysis has been created, which is supplemented with new rules during the development of morpho-syntactical annotation tools In collaboration with Stockholm’s University on the basis of “A derivational dictionary of Latvian” the morphemic analysis system has been performed

8 Building up Latvian Corpus Work on a pilot morpho-syntactically annotated corpus has been started (1996–2000)  It covers approximately 10 000 words of modern written Latvian manually annotated An experimental mark-up transformation tool (HTML to structural XML) has been developed IMCS has passed 1 th phase in the ESF project competition to receive a funding for supplementation and balancing of the corpus: + 20 mill. running words both old and contemporary texts

9 Work with Parallel Texts IMCS joined the EU joint action TELRI (Trans-European Language Resources Infrastructure) during 1995–2001: Latvian translation of Plato “Republic” has been added to other 14 European languages and a CD “East meet West” has been produced with these aligned parallel texts Orwell’s novel “1984” aligned at sentence level is available from Tractor, TELRI Research Archive of Computational Tools and Resources (www.tractor.de) Vanila alignator developed at Göteborg University (Danielsson and Ridings, 1999), which explores algorithm of Gale and Church (1993) and operates with number of symbols, is used to align these texts

10 Work with Parallel Texts Thanks to the collaboration with Translation and Terminology Centre (www.ttc.lv), there is a possibility to work with English- Latvian parallel texts: a small pilot English-Latvian parallel corpus (legal texts) with ca 100 000 words per language aligned at sentence level has been built in 2001 corpus-based analysis of English multi-word verb units and their Latvian translation equivalent has been carried out, as well as some translation studies possibilities how the information gained from parallel texts can be applied for MT systems have been examined

11 Corpus of Early Written Latvian Texts Project was initiated in the middle of 90ies, when the most significant sources have been keyboarded The aim of the corpus is: to promote and facilitate the diachronic study of Latvian to offer a computerised material to those interested in the development and varieties of language it will serve as basis both for the dictionary of the 17th century (in near future) and Latvian thesaurus (in far future) A pilot project on the development of electronic dictionary of the early written texts was initiated in 2002 which contributed towards building up the first Latvian corpus available publicly

12 Corpus of Early Written Latvian Texts Statistics: consists of texts from the 16th to the 18th century ecclesiastical texts mainly ca 900 000 running words A number of conventions have been introduced for: coding special characters (different accents etc.) using compound chars of the Baltic Windows code page Structural annotation: foreign language fragments; cross notes to the other text parts; structural containers; mistakes a.o. elements

13 Corpus of Early Written Latvian Texts Acquisition of the corpus: text typing or scanning → adding some structural mark-up → automated verifying and (pre)processing On-line end-user tools: content navigation through different dimensions: periods of time, sources, authors and text types search in word form index providing a word pattern (several criteria and bounding of scope can be applied); a kwic-concordance automatic context positioning of retrieved running words (by search in index and concordance) some statistical tools: frequency lists, word lists a.o.

14 Corpus of Early Written Latvian Texts

15 Machine Translation In 1993 work limited MT model was started An interlingua MT model LATRA (SWETRA, Lund University) for translation of stock market texts (Latvian-English-Latvian) has been developed: translates basic types of declarative sentences problem – disambiguation Supported by the Latvian Council of Sciences: “Limited Model of Automated Machine Translation System for Latvian” (1993–1996) “Development of Probabilistic Methods for Automated Disambiguation of Natural Language Texts and Applications for Machine Translation” (1997 – 1999)

16 Machine Translation In 1997 the Laboratory joined the Universal Networking Language (UNL) project (www.undl.org): artificial language to overcome language barrier consists of dictionary of universal words, relations, knowledge base, attributes an experimental grammar for deconversion of UNL texts into Latvian has been developed “Automated synthesis of language independent text representation” funded by the Latvian Council of Science Perspective: semantic aspects of text analysis development of a general purpose translation system implementation of semantic types proposed in SIMPLE (Semantic Information for Multifunctional Plurilingual Lexica) project

17 Development of Electronic Dictionaries A number of Latvian dictionaries have been keyboarded or scanned Some special computerised lexicons have been prepared to meet the needs of particular projects: for UNL an English-Latvian machine-readable lexicon including ca 10 000 entries has been made, grammatical information of Latvian entries is presented in formalised way An initiative to develop a new electronic dictionary to cover as much Latvian words and their meanings as possible in order to achieve this, main lexicographical resources are being processed

18 Development of Electronic Dictionaries On-line versions of several dictionaries: an explanatory dictionary of contemporary Latvian (ca 35 000 entries) Mülenbach-Endzelin’s “Lettisch-deutsches Wörterbuch” (ca 130 000 entries, very complex structure and character set) Latvian-Russian-Latvian bilingual dictionary for students (ca 70 500 entries) an internet Term Bank with ca 115 000 Latvian terms and their translation equivalents (Russian, German, English, Latin); developed in 1998  this was carried out for Translation and Terminology Centre using Trados MultiTerm platform

19 Development of Electronic Dictionaries Have to move to standardised dictionary encoding (and development) Have to convert already existing lexicographical sources in widely compatible format Development of universal, metamodel-based dictionary production and publishing on-line environment for both parties: dictionary creators/providers and end-users (humans as well as software agents) funding from the Latvian Council of Science is assingned On the basis of Latvian Corpus and various machine- readable/understandable dictionaries – extraction of the Latvian WordNet

20 Work with Speech In a project “ONOMASTICA-COPERNICUS” (1995–1997) ca 250 000 Latvian proper names were transcribed using IPA In 2001 work on development of corpus of spoken language has been started (funded by the Latvian Council of Science): ca 1 300 phrases, words and sentences spoken by 15 persons (5 men, 7 women and 3 children) 8-hour record of a seminar (2 women performing synchronic translation) special text of ca 1000 words read by 50 persons (29 women and 21 men) speech is transcribed and text segmentation is performed using Transcriber software (Edinburgh University)

21 Work with Speech In order to explore these data in speech synthesis and speech recognition systems, grapheme-phoneme transcription software has been developed: > 300 grapheme-to-phoneme rules The machine-readable transcription presents: consonant assimilation in sonority consonant assimilation in point of articulation vocalization vowel wakening The machine-readable transcription does not present: word stress, syllable intonation, sentence intonation

22 Work with Speech The accuracy of automatically obtained phonemic transcription is approximately 92% experimental Latvian TTS (Text-to-Speech) synthesizer has been developed The speech segment database is prepared The program uses not only diphones and half phonemes, but also triphones, phonemes etc. The next stage will be creation of the prosodic model in order to avoid monotonous synthesized speech and to develop a complete TTS system Quality improvement and supplementation of the speech corpora – planned activities in next year

23 Computer-assisted Teaching Aids Since 1998 the AILab is taking part in a project “Latvian education informatization system” Special emphasis is put on non-native speakers of Latvian For deaf students a sign language dictionary has been developed Software tool to help to master a pronunciation of single sound or sound combinations or even short words Latvian word analyser and synthesizer is available on-line

24 Computer-assisted Teaching Aids Development of e-books and e-courses of Latvian since 1998 E-course of the Latvian language for secondary schools: widely covers theory of language more than 600 interactive exercises with automatic testing Latvian for primary schools: expounded using animation, and interactive exercises and tests Teaching aids for foreigners: theory and exercises methodical guidelines for Russian teachers interactive course “What have you said?” (17 themes, games and exercises; animation and sound)

25 Conclusions & Future Perspective Fields that are and will remain the most important of interest of the Artificial Intelligence Laboratory, IMCS, UL: Building up Latvian corpora and tools for their analysis:  spoken and written language  monolingual and multilingual Work on electronic dictionary and software development Latvian WordNet MT modelling Latvian text to speech synthesizer and speech recogniser

26 Thank You!


Download ppt "HLT at the AILab, IMCS, UL Artificial Intelligence Laboratory Institute of Mathematics and Computer Science University of Latvia www.ailab.lv 27.10.2004."

Similar presentations


Ads by Google