Presentation is loading. Please wait.

Presentation is loading. Please wait.

Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years - Peter Wittenburg, André Moreira The Language Archive - Max Planck Institute.

Similar presentations


Presentation on theme: "Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years - Peter Wittenburg, André Moreira The Language Archive - Max Planck Institute."— Presentation transcript:

1 Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years - Peter Wittenburg, André Moreira The Language Archive - Max Planck Institute CLARIN European Research Infrastructure

2 Content 1.CLARIN vs. DOBES - differences? 2.Tools vs. Standards - differences? 3.Overall Comparison 4.TLA Team - Landscape and Strategy 5.Technology - Mainstream influences 6.Conclusions

3 DOBES vs. CLARIN DOBES is about the documentation of endangered languages (as many other comparable initiatives) documentation teams are under time pressure thus efficiency is required (transcription: 1-35, translation: 1-25) can be facilitated by good tools documentation certainly is for this generation of researchers, speech communities, students, public, etc. (primary focus of DOBES and teams) documentation is also for future generations documents part of our cultural heritage languages encode knowledge about natures and cultures historical material helps finding our identity therefore DOBES has a short-term and a long-term challenge

4 DOBES vs. CLARIN CLARIN is about an interoperable + persistent infrastructure for LRT landscape is fragmented and nothing fits together thus researchers working on data can't be efficient (knowledge workers spend 40% of time on finding resources, making things compatible etc) can be facilitated by good standards and agreements infrastructure certainly is for this/next generation of researchers, students, "citizen scientists", etc. enable "better" research if it is "data-driven" infrastructure is also for future generations ensuring access to our research records lots of data is highly endangered !!! comparing "old" data with "new" data therefore CLARIN has a short-term and a long-term challenge

5 DOBES vs. CLARIN: interoperability DOBES community of documenting field linguists is interoperability an issue? well I still don't know interoperable with whom? cross-corpus work based on data is still to come of course some practical barriers (language) CLARIN infrastructure covering "all" language resources & tools (named entity recognition relevant for everyone) is interoperability an issue: YES - it's in the focus otherwise always barriers to tackle relevant questions otherwise data-driven research too expensive seems that here is a clear difference in primary objectives

6 DOBES and CLARIN DOBESCLARIN researcher focus "comprehensive" documentation give seamless access to all relevant data main characteristic efficiency in annotating, lexicon creation etc efficiency in finding things and combining them

7 DOBES and CLARIN DOBESCLARIN researcher focus "comprehensive" documentation give seamless access to all relevant data main characteristic efficiency in annotating, lexicon creation etc efficiency in finding things and combining them addressees communities, researchers, students, pupils, public researchers, students, "citizen scientists"

8 DOBES and CLARIN DOBESCLARIN researcher focus "comprehensive" documentation give seamless access to all relevant data main characteristic efficiency in annotating, lexicon creation etc efficiency in finding things and combining them addressees communities, researchers, students, pupils, public researchers, students, "citizen scientists" short-term taskgive access nowimprove access now long-term task preserve cultural heritage second priority ensure access in future part of the concept

9 DOBES and CLARIN DOBESCLARIN researcher focus "comprehensive" documentation give seamless access to all relevant data main characteristic efficiency in annotating, lexicon creation etc efficiency in finding things and combining them addressees communities, researchers, students, pupils, public researchers, students, "citizen scientists" short-term taskgive access nowimprove access now long-term task preserve cultural heritage second priority ensure access in future part of the concept interoperabilitynot first priorityfirst priority

10 DOBES and CLARIN DOBESCLARIN researcher focus "comprehensive" documentation give seamless access to all relevant data main characteristic efficiency in annotating, lexicon creation etc efficiency in finding things and combining them addressees communities, researchers, students, pupils, public researchers, students, "citizen scientists" short-term taskgive access nowimprove access now long-term task preserve cultural heritage second priority ensure access in future part of the concept interoperabilitynot first priorityfirst priority Ulrike - Nicoletta"standard" a no topic"standard" a major topic thus very much in common - but also some differences

11 Tools vs. Standards who dears to doubt that tools determine our "productivity" tools influence attractiveness of solutions people are used to tools - who wants to learn new stuff? tools need to be egocentrically built development is expensive (UI) fast development cycles are necessary SW management is very expensive and eats up person power ~ 80 % of all software developments fail lot of SW developed will die quickly since not enough money to maintain it tools have a short lifecycle of in average about 10 years functionality time

12 Tools vs. Standards who dears to doubt that standards live almost forever de facto lifetime comparatively high standards are in general not attractive for users except for some XML "fans" standards should be hidden and only experts need to read all documents standards building has some form of altruism (if big industry is not involved) costs lot of time and effort (ISO TC37/SC4 started 2002 at LREC) risk of being quickly outdated will a standard be accepted? implementing standards in tools can be expensive (moving target, complexity of standard, etc)

13 Tools and Standards ToolsStandards lifetimecomparatively shortcomparatively long user attractivenesshighlow creation costshigh maintenance costshighlow

14 Tools and Standards ToolsStandards lifetimecomparatively shortcomparatively long attractivenesshighlow creation costshigh maintenance costshighlow? short-term successhighlow (requires time) long-term "factor"lowpotentially high

15 Tools and Standards ToolsStandards lifetimecomparatively shortcomparatively long attractivenesshighlow creation costshigh maintenance costshighlow short-term successhighlow (requires time) long-term "factor"lowpotentially high thus tools are important for short term success standards are important for long term success

16 all together for CLARIN no separation - symbiosis between short-term tool support and long-term interoperability facilitation for DOBES there seems to be a difference ToolsStandards CLARIN relevant for short and long term development (stability, generic, standards-based) relevant for interoperability on short and long term DOBES clear interest in short term efficiency relevant only for those who focus on long-term aspects

17 Landscape for TLA Team being archivist and providing access to stored material in DOBES (+MPI) being in the core of CLARIN/EUDAT infrastructure development a few major questions: how can we preserve bit streams and interpretability over long period? how can we give access to heterogeneous resources and also support resource creation and manipulation/enrichment? have about 71 lexica (and many different annotation types) 61 in the archive, 10 active in LEXUS created by different tools, using different structures using different categories (lexical attributes) how can we build "generic" tools and frameworks that can cope with heterogeneity - cannot build/maintain SW too specifically targeted? how can we build SW in a scenario where there are so many smart developers out there?

18 Strategy for TLA Team Rule 1: have a coherent archive of 34/75 TB i.e. convert "everything" to stable formats with explicit syntax/encoding and check quality otherwise long term curation and access too expensive costs for late curation and manual migration are extreme Rule 2: base tool development on open and "generic" formats EAF for annotations turned out to be flexible enough over 10 years LMF is a flexible model for lexicon structures "LEGO" approach makes some people frightened but flexibility not even sufficient for field linguists yet no agreement on an exchange format - a disaster  ISOcat for registering semantics (is it generic enough?) Rule 3: provide converters and interfaces for major tools/formats Toolbox, CLAN, Transcriber, PRAAT, other XML time consuming effort (cyclic flow almost impossible)

19 Is our Strategy Successful? very difficult to answer - what are the criteria? strategy allows us to be coherent with both DOBES and CLARIN strategy was broad enough to help establishing TLA although LMF turned out to be very expensive for us much time investment to participate in x meetings little understanding from NLP hardcore guys can't even claim to be 100% compliant or? some years of instability of the model thus changes of code thus slowing down development invent own interchange format for archiving purposes (RELISH ??) modern lexica are complex objects with inclusions of objects (images, a/v fragments, internal and archived resources, etc) finally an approach based on flexible standards will pay off but it takes more time

20 Technology (IT) Issues technology innovation is moving ahead with the web as driving force designs and tools need to be web-ready visibility from everywhere access from everywhere collaboration support annotation (incl. relation drawing) support (there are so many knowledgeable people around) web-technology subject of high innovation rate frequent re-design of components what is the stable core to keep costs low and make code maintenance feasible?

21 Conclusions research communities naturally more interested in tools research infrastructure work needs to find a balance between short- and long-term aspects however, need to store data following general IT principles explicit syntax, declared semantics, open formats need to build better tools to support standards and/or to convince companies to adopt standards but tool building based on standards can be more expensive and time consuming RELISH is very good to compare TEI, LMF and LIFT RELISH is very good to compare ISOcat and GOLD we need a strategy for TLA to support one (or two) exchange formats and one needs to be based on a standard (data will go into the archive)


Download ppt "Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years - Peter Wittenburg, André Moreira The Language Archive - Max Planck Institute."

Similar presentations


Ads by Google