Hamburg, The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET.

Hamburg, 22-11-2004steven.krauwer@elsnet.org1 The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET

Hamburg, 22-11-2004steven.krauwer@elsnet.org2 Overview The BLARK Enterprise How to arrive at it The Dutch Language Union approach Refining the concept Defining a BLARK Main beneficiaries References Concluding remarks

Hamburg, 22-11-2004steven.krauwer@elsnet.org3 The BLARK Enterprise Define the minimal set of language resources that is necessary to do any precompetitive R&D and professional education at all for a language (the Basic Language Resource Kit or BLARK) Determine for each language which components are already available Make a priority plan to complete the BLARK for each language Ensure funding to get the work done

Hamburg, 22-11-2004steven.krauwer@elsnet.org4 What are the components of a BLARK Lexicons (monolingual, multilingual, …) Corpora (language, speech; annotated, unannotated; mono- and multilingual; mono- and multimodal; …) Tools (annotation, exploration, …) Modules (lemmatizers, parsers, speech recognizers, tts, transcribers, translation, …) …

Hamburg, 22-11-2004steven.krauwer@elsnet.org5 What makes the BLARK Enterprise special? The idea is to make a common generic BLARK definition, in principle applicable to all languages The common definition will be based on the experience with different languages, and will prevent reinvention of wheels The common definition will ensure interoperability and interconnectivity (especially for multilingual or cross-lingual applications)

Hamburg, 22-11-2004steven.krauwer@elsnet.org6 Other benefits Experience from other languages will help making cost estimations Adoption of a BLARK common to all languages may help in persuading funders to support the creation of the BLARK Adoption of a common BLARK may facilitate porting of knowledge and expertise between languages

Hamburg, 22-11-2004steven.krauwer@elsnet.org7 Words of caution A BLARK definition will evolve over time, as new applications, application environment and technologies come up A BLARK definition should be seen as a template rather than a dictate, as different languages may have different specific requirements BLARK completion priorities may differ from language to language (on e.g. economic, social or political grounds)

Hamburg, 22-11-2004steven.krauwer@elsnet.org8 How to define a BLARK and assign priorities Methodology proposed by the Dutch Language Union [DLU] (Binnenpoorte et al, LREC 2002): –Identify a number of typical applications –Determine for each of them which technologies (modules) are needed to make them (-, +, ++, +++) –Identify for each module which resources they require (-, +, ++, +++) –Assign the highest priority to the resources that support most applications

Hamburg, 22-11-2004steven.krauwer@elsnet.org9 Proposed DLU priorities for NLP 1.treebank 2.robust parsers 3.tokenisation and named entity recognition 4.semantic annotations for the treebank 5.translation equivalents 6.evaluation benchmarks

Hamburg, 22-11-2004steven.krauwer@elsnet.org10 Proposed DLU priorities for speech 1.automatic speech recognition 2.application-specific speech corpora 3.multi-media speech corpora 4.tools for transcription of speech data 5.speech synthesis 6.benchmarks for evaluation

Hamburg, 22-11-2004steven.krauwer@elsnet.org11 Next steps by DLU Make a survey of what exists and to what extent it is available (0-9 availability score) Assign priorities (not just resources but also an infrastructure for maintenance and distribution) Secure funding from Dutch and Flemish government for a national programme Issue calls for proposals for collaborative resources projects (1 st call closed Nov 2 2004)

Hamburg, 22-11-2004steven.krauwer@elsnet.org12 Refining the concept Items not really covered by the DLU teams: –definition vs specification –availability –quality –quantity –standards –support Addressed in the NEMLAR project

Hamburg, 22-11-2004steven.krauwer@elsnet.org13 Definition / specification Not enough to say ‘a written language corpus’, what about: –size (types, tokens) –encoding –annotation –text types –representativity –domains i.e. we need full specs

Hamburg, 22-11-2004steven.krauwer@elsnet.org14 Availability DLU: 0-9 scale, very impressionistic Our proposal: 3 dimensions –accessibility –cost –modifiability to each we assign a penalty score (0 is best)

Hamburg, 22-11-2004steven.krauwer@elsnet.org15 Accessibility 3 classes, with associated penalties –(3) existing, but only company-internal –(2) existing and freely usable for precompetitive research –(1) existing and freely usable for all R&D

Hamburg, 22-11-2004steven.krauwer@elsnet.org16 Cost 4 cost categories: –(4) price over 10 keuro –(3) price between 1 and 10 keuro –(2) price between 100 and 1000 euro –(1) less than 100 euro

Hamburg, 22-11-2004steven.krauwer@elsnet.org17 Modifiability 3 categories –(3) black box: you get them as they are, but you cannot change or even inspect its internals –(2) glass box: you can’t change them but you can see what is inside) –(1) open resources: freely manipulable

Hamburg, 22-11-2004steven.krauwer@elsnet.org18 Comments on availability we can now express availability in a 3 digit score (accessibility, cost, modifiability) which should be rather easy to assign objectively the lowest scores are the best if the accessibility score is 3, the other scores don’t mean very much

Hamburg, 22-11-2004steven.krauwer@elsnet.org19 Quality We distinguish two types of quality: absolute (I.e. an inherent property of the resource) and relative (I.e. in relation to how you want to use it): Absolute: standard-compliance and soundness Relative: task-relevance and environment- relevance

Hamburg, 22-11-2004steven.krauwer@elsnet.org20 Standard-compliance criterion: to what extent is the resource based on a common standard (formal or de facto) possible values (penalty based): –(3) no standard –(2) standard, but not fully compliant –(1) standard and fully compliant

Hamburg, 22-11-2004steven.krauwer@elsnet.org21 Soundness criterion: to what extent is the resource based on well-defined specifications values: –(3) no specifications provided –(2) specs provided, but not fully compliant –(1) specs provided, fully compliant

Hamburg, 22-11-2004steven.krauwer@elsnet.org22 Task-relevance criterion (relative): to what extent is the resources suited for a specific task X values (3 binary values): –contains all information needed for X (yes/no) –has the proper size for X(yes/no) –based on a relevant selection of items for X (yes/no)

Hamburg, 22-11-2004steven.krauwer@elsnet.org23 Environment-relevance criterion: to what extent is the resource interoperable with its environment (other resources) values (3 binary valuas): –information matches (yes/no) –size matches (yes/no) –selection matches (yes/no)

Hamburg, 22-11-2004steven.krauwer@elsnet.org24 Comments on quality We can now express absolute quality objectively in terms of a pair of scores (standard-compliance, soundness); this score can be assigned by the provider and relative quality (for our own purposes) in terms of two triples of yes/no answers (task- relevance, environment-relevance); this score can only be assigned by the user other attributes may be added as long as they can be objectively assigned

Hamburg, 22-11-2004steven.krauwer@elsnet.org25 Quantity The DLU team did not try to formulate any quantitative requirements We have tried to do this in the context of the NEMLAR project, see below for our tentative figures Statistical approaches can swallow any amount of resources, and minimal figures are very hard to find Our figure finding exercise has been very much example driven

Hamburg, 22-11-2004steven.krauwer@elsnet.org26 Standards Very few existing formal standards around, although some exist (cf Romary & Ide at LREC2004 workshop, Monachini et al, 2003) Evolving de facto standards include: –Bottom-up work by committees (TEI) –Top-down actions: Projects aiming at standards (e.g. EAGLES, ISLE) Example setting R&D projects (e.g. Wordnet, Speechdat, Multext) Our position: any standard is better than no standard at all

Hamburg, 22-11-2004steven.krauwer@elsnet.org27 Defining a BLARK Work carried out in the context of the NEMLAR project (www.nemlar.org), aimed at Arabic resourceswww.nemlar.org Work described here based on project deliverables (see site), summarized in article by Maegaard, Krauwer, Choukri, Damsgaard presented at NEMLAR conference in Cairo (Sep 2004)

Hamburg, 22-11-2004steven.krauwer@elsnet.org28 Approach adopted Same strategy as Dutch Language Union (applications => modules => resources) But with different results because of differences in social/economic situation and in language structure Results follow, in terms of global definitions and tentative size indications (no specs provided at this stage, but project is still ongoing) Feedback is welcome!!!!!!!!

Hamburg, 22-11-2004steven.krauwer@elsnet.org29 Written resources (1) Lexicon: –For all components: 40 000 stems with POS & morphology –For sentence boundary detection: list of conjunctions and other sentence starters/stoppers –For named entity recognition: 50 000 human proper names –For semantic analysis: same 40 000, with subcategorization, shallow lexical semantic info; possibly a WordNet

Hamburg, 22-11-2004steven.krauwer@elsnet.org30 Written resources (2) Bi-/Multilingual lexicon –Same size as monolingual Thesauri, ontologies, wordnets: –Thesaurus subtree with ca 200-300 nodes for each domain –Ontologies and wordnets ideally same size as lexicon

Hamburg, 22-11-2004steven.krauwer@elsnet.org31 Written resources (3) Corpora: –For term extraction: 100 million words unannoteted –For small applications: 0.5 million words annotated –For statistical POS tagger: 1-3 million (ann) –Sentence boundary: 0.5-1.5 million (ann) –Named entity (stat based): 1.5 million (ann) –Term extraction: 100 million (ann) –Co-reference resolution: 1 million (ann) –WSD: 2-3 million (ann)

Hamburg, 22-11-2004steven.krauwer@elsnet.org32 Written resources (4) Multilingual corpora: –For alignment: 0.5 million (tagged) Multimodal corpora: –For OCR (printed): ?? –For OCR (hand-written): ??

Hamburg, 22-11-2004steven.krauwer@elsnet.org33 Spoken resources (1) Acoustic data: –For dictation: 50-100 speakers, 20 min each, fully transcribed, plus 10 speakers for testing –For telephony: 500 speakers uttering 50 different sentences (speechdat, orientel based) –For embedded speech recognition: data similar to Speecon –For broadcast news transcription: 50-100 hours well- annotated, plus 1000 hours of non-transcribed data; should come with 300 million words of non-annotated written text

Hamburg, 22-11-2004steven.krauwer@elsnet.org34 Spoken resources (2) Acoustic data (cont’d): –For conversational speech: data similar to CallHome/CallFriends from LDC –For speaker recognition: 500 speakers for training, 3 minutes each, transcribed, plus 100 speakers for testing –For language/dialect identification: data similar to CallFriend, or from Broadcast News (esp for variants of Arabic) –For speech synthesis: male and female speakers, 15 hours, using a read text, phonetically balanced –For formant synthesis: sama as above, with hand- labelled formant

Hamburg, 22-11-2004steven.krauwer@elsnet.org35 Spoken resources (3) Multimodal corpora: –For lips movement reading: similar to M2VTS, with some 50 faces Written corpora for speech technologies: –General; 300 million words unannotated, preferably broadcast news or other press and media sources –For phonetic lexicon and language models: 1-5 million words, annotated –For Arabic: vowelized and non-vowelized corpus

Hamburg, 22-11-2004steven.krauwer@elsnet.org36 What next? (1) Check definition and quantification for completeness and consistency and correct Try to provide specs for every single item Try to differentiate between general and Arabic in definitions and specs

Hamburg, 22-11-2004steven.krauwer@elsnet.org37 What next? (2) For each language: –Take the BLARK definition and specs –Adapt to local conditions –Make a survey of what exists and what has to be made –Find the funds and build the BLARK for your language

Hamburg, 22-11-2004steven.krauwer@elsnet.org38 Prescriptive / descriptive Prescriptive: –the BLARK definition tells you which ingredients you need –the specification tells you what they should look like Descriptive: –a BLARK instantiation comes with a description of its components

Hamburg, 22-11-2004steven.krauwer@elsnet.org39 Main beneficiaries (1) academic and industrial researchers: material to try out ideas and conduct pilot studies industrial developers: only for generic activities, since specific applications require more user and domain orientation educators: material for experimental work by students in labs

Hamburg, 22-11-2004steven.krauwer@elsnet.org40 Main beneficiaries (2) probably not the main languages in Europe (EN, FR, GE) as they are pretty well covered anyway mostly the languages that are not supported by a strong market (because of small size or poor economy)

Hamburg, 22-11-2004steven.krauwer@elsnet.org41 References Binnenpoorte et al at LREC 2002 (see also www.elsnet.org/dox/lrec2002-binnenpoorte.pdf ELRA Newsletter vol 3, n 2, 1998 (see also www.elsnet.org/blark.html) NEMLAR: see www.nemlar.org for –Arabic BLARK Report –NEMLAR presentation at Cairo conference Romary & Ide at LREC 2004 (see also www.elsnet.org/lrec2004-roadmap/Romary- Ide.ppt)

Hamburg, 22-11-2004steven.krauwer@elsnet.org42 Concluding remarks The BLARK aims at providing a common definition of the notion ‘minimal set of resources’ It should help language communities to come closer to the idea of creating an equal playing field, in spite of market forces It should facilitate porting of expertise It is necessarily dynamic, as technologies evolve rapidly

Hamburg, 22-11-2004steven.krauwer@elsnet.org43 Thanks! Contact: steven.krauwer@elsnet.org

Hamburg, The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET.

Similar presentations

Presentation on theme: "Hamburg, The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hamburg, The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET.

Similar presentations

Presentation on theme: "Hamburg, The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET."— Presentation transcript:

Similar presentations

About project

Feedback