e Content plus Standards: strength and limitations … LMF Nicoletta Calzolari

1 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20091 e Content plus Standards: strength and limitations … LMF Nicoletta Calzolari Fostering Language Resources Network

2 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20092 X-LEX In Europe the so-called X-LEX projects: ACQUILEX MULTILEX GENELEX and other lexical and text annotation/representation projects: NERC ET-7 ET-10 DELIS that saw the participation of many EU groups, linked by sharing similar approaches and visions EAGLES ISLE After the Grosseto Workshop (1985): a turning point Historical notes Start: Zampolli breakfast meeting EAGLES EAGLES acronym … by Cencioni

3 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20093 Reusability Reusability as key concept true also today To avoid duplication of efforts, costs, etc. To allow synergies, integration, exchange of data,... To provide a model for new data creation & acquisition feasiblepriorities Decide on feasible areas & state priorities this is changing over time strong sign of maturity The feasibility of formulation of consensual standards as a strong sign of maturity in the field we cant propose standards if there are not enough results on which to base them EAGLES was launched in 93 EAGLES was launched in 93 Key issues: Do conditions exist for standardisation effort?

4 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20094 Some standard-related projects & initiatives Defining standards/best practice: TEI: creating standards for text annotation NERC: creating the basis to bottom-up empirical harmonisation, based on extensive best-practice analysis EAGLES: introducing a methodological model for standard work ISLE: extending in topics & communities LIRICS: preparing for international standards ISO/TC 37/SC 4/WG 4: going to international standards LMF … & many others NEDO: porting to Asian languages MultilingualWeb: new Thematic Network for relation with W3C

5 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20095 Some standard-related projects & initiatives (cont.) Using standards/best practice: MULTEXT & MULTEXT-EAST: applying to lexicons & text annotation, with EAGLES compliant specs PAROLE-SIMPLE lexicons: morphology, syntax & semantics: operational specs & constraints betw. lexical descriptors (12 languages) EuroWordNets: a de-facto best-practice BOOTStrep: terminologies in Bio-domain: BioLexicon KYOTO: in the environment domain PANACEA: in a platform for LR acquisition

6 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20096 Some standard-related projects & initiatives (cont.) Promoting standards/best practice: INTERA : for a EU repository of language data ENABLER: to link EU & national initiatives ELRA: the EU LR association LanguageGrid: Japanese infrastructure for LR services CLARIN: LR standards for the Humanities & Social Sciences FLaReNet: LR standards for Human Language Technologies T4ME NoE: for an Open Resource Infrastructure

7 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20097 Main Results in Lexicon & Corpus WGs First Phase ( morphosyntactic encoding of lexical entries all Standard for morphosyntactic encoding of lexical entries, in a multi-layered structure, with applications for all the EU languages subcategorisation in the lexicon Standard for subcategorisation in the lexicon: a set of standardised basic notions using a frame-based structure lexical semantics Proposal for a basic set of notions in lexical semantics: focus on requirements of Information Systems and MT Corpus Encoding Standard (CES) Corpus Encoding Standard (CES) from TEI morphosyntactic annotation Standard for morphosyntactic annotation of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata syntactic annotation Preliminary recommendations for syntactic annotation of corpora Dialogue annotation Dialogue annotation, for integration of written and spoken annotation

8 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20098 Content vs. Format/Representation Work on lexical description deals with two aspects Linguistic descriptioncontent Linguistic description of lexical items (content) Formal representation format Formal representation of lexical descriptions (format) linguistic content EAGLES concentrated on linguistic content, not disregarding the formal representation of the proposal TEI more on format/representation issues In LMF : In LMF : on the abstract meta-model

9 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 20099 Flexibility in the Recommendations e.g. Morphosyntax Recommendation Level Information Type Recommendation Obligatory L-0 Part-of-Speech Obligatory Recommended L-1 Morphosyntactic agreement Recommended features Optional L-2 Language-specific (or refined) Optional features

10 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200910 MERITS Strengths (from EAGLES-ISLE) coherent market Standardisation as a necessary component of any strategic programme to create a coherent market industrials> 150 EU groups Leading industrials & academics participated (> 150 EU groups) Bottom-up community created standards To avoid wasting time To avoid wasting time reinventing basic/consolidated knowledge May be true also for many humanities users, not interested in debates on specific lexical approaches just once overall cost-effectiveness Work otherwise duplicated among many projects, done just once in a collaborative manner (overall cost-effectiveness) more competitive Allows the field to be more competitive: Concentrate efforts on innovative areas Engage in new/advanced technology

11 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200911 Why Standards for Language Resources? (from EAGLES-ISLE) To ensure: interoperability of systems (& data), through compatible interfaces reusability and integrability of components training based on consensual technical specifications and models (gold standards) evaluation & validation based on agreed criteria transition from prototypes to HLT products important for workflows essential for a LR Infrastructure for evaluation campaigns

12 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200912 The applications: requirements for systems & enabling technologies Machine Translation Information Extraction Information Retrieval Summarisation Natural Language Generation Word Clustering Multiword Recognition + Extraction Word Sense Disambiguation Proper Noun Recognition ParsingCoreference… I For HLT For HLT knowledge of applications requirements is essential

13 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200913 The Multilingual ISLE Lexical Entry (MILE) methodological principles General methodological principles (from EAGLES) MILE : Basic requirements for the design of the MILE : basic notions Discover and list the (maximal) set of basic notions needed to describe the MILE (up to which level standardisation is feasible?)Granularity edited union redundancy The leading principle: the edited union of existing lexicons/models (redundancy is not a problem) Modular & layered underspecification (& hierarchical structure) Allow for underspecification (& hierarchical structure)

14 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200914 MILE – Modularity The building-block model syntactic frame phrase slot Syn feature Lexical Objects Sem feature Lexical entry 1 Lexical entry 2 Lexical entry 3 Independent, but interlinked, modules allow to express different dimensions of lexical entries

15 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200915 MILE Lexical Classes & Lexical Objects vs ISO LMF Lexical Classes as the main building blocks of the lexical architecture Building blocks allow two kinds of reusability: intra-lexicon reusability (within the same lexicon) inter-lexicon reusability (among different lexicons) Define an ontology of lexical objects represent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc. specify the relevant attributes define the relations with other classes hierarchically structured Done in LMF To be done … (in ISOCat?)

16 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200916 The MILE Data Categories User-adaptability and extensibility HUMAN ARTIFACT EVENT ANIMAL GROUP AGE MAMMAL instance_of Core UserDefined MLC:SemanticFeature OK in ISOCat

17 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200917 MILE Lexical Data Category Registry A library of pre-instantiated objects Enables modular specification of lexical entities eliminate redundancy identify lexical entries or sub-entries with shared properties create ready-to-use packages that can be combined in different ways Can be used off the shelf or as a departure point for the definition of new or modified categories ISOCat ISO Profiles ISO Profiles

18 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200918 ISO - LMF Lexical Markup Framework Designed to accommodate many models of lexical representation Its pros: Meta-model: abstract high-level specification ISO24613 Data Category Registry: low-level specifications ISO12620 Not a monolithic model, rather a modular framework LMF library provides the hierarchy of lexical objects (with structural relations among them) Data Category Registry provides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined)

19 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200919 ISO LMF Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions Modular framework LMF specs comply with modelling UML principles an XML DTD allows implementation Builds on EAGLES/ISLE NEDOAsianLang. The field is mature NICT Language- Grid Service Ontology ICTKYOTO LIRICS New initiatives … LexInfo

20 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200920 Mapping experiment Major best practices: OLIF PAROLE/SIMPLE LC-Star (Speech Lexicon) WordNet - EuroWordNet FrameNet BDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French Entries from major existing lexicons mapped to LMF model is able to represent many best practices To prove that the model is able to represent many best practices To test the expressive potentialities, the adequacy of architectural model & linguistic objects from Monica Monachini

21 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200921 BioLexicon SIMPLE model & ISO-LMF standard BLBL A unique large-scale computational lexicon in the biomedical domain in terms of coverage & typology of information Populated with info from available biomedical resources Semi-automatically populated from corpora: Population toolkit available Including both domain- specific & general language words Rich linguistic information ranging over different linguistic descriptions levels Conformant to international lexical representation standards Designed to meet bio-Text Mining requirements from Monica Monachini

22 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200922 Sense activate_2 Synset activate PredicativeRepre sentation SemanticFeature SF_chemistry SF_process Collocation SemanticRelation is_a: [SenseID] Typical_of: [SenseID] S_protein Sense Representation

23 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200923 KYOTO SYSTEM Linear MAF/SYNAF Linear SEMAF Term extraction Tybot Generic TMF Semantic annotation Linear Generic FACTAF Fact extraction Kybot Domain editing Wikyoto Wordnet Domain Wordnet LMF API Ontology Domain ontology OWL API Concept User Fact User from Piek Vossen Source Documents

24 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200924 GlobalInformation Lemma Monolingual ExternalRef Monolingual ExternalRefs Sense LexicalEntry Statement Definition SynsetRelation SynsetRelations Monolingual ExternalRef Monolingual ExternalRefs Synset Lexicon Interlingual ExternalRef Interlingual ExternalRefs SenseAxis SenseAxes LexicalResource 1..1 1..*0..1 1..* 1..1 0..* 0..1 1..* Meta 0..1 Meta 0..1 Meta 0..1 Meta 0..* 0..1 1..* 0..* 0..1 1..* A common representation format: WordNet - LMF Data Categories from Monica Monachini

25 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200925 Centralized WordNet DC Registry A list of 85 sem.rels as a result of a mapping of the KYOTO WordNet grid Inter-WN Intra-WN from Monica Monachini

26 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200926 SWN 09686541-n IWN 00001251-n WordNet-LMF multilingual level - Cross-lingual relations WN3.0 13480848-n groups monolingual synsets corresponding to each other and sharing the same relations to English link to ontology/(ies) specifies the type of correspondence from Monica Monachini

27 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200927 LexInfo & Previous Models LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006] LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007] Lexical Markup Framework (LMF): ISO standardised model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007] LexInfo: building on LMF as a core, develop a model which subsumes LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009] From Paul Buitelaar

28 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200928 LexInfo: Lexical Entry Sub-Categorization Frames From Paul Buitelaar

29 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200929 MILE Lexical Model oriented towards an Open Distributed Lexical Infrastructure Lexical Information Servers for multiple access to lexical information repositories Enhance user-adaptivity resource sharing cooperative creation of LR & LT Develop integration and interchange tools Beyond MILE: future work

30 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200930 Some steps for a new generation of LRs From huge efforts in building static, large-scale, general-purpose LRs To dynamic LRs rapidly built on-demand, tailored to specific user needs From closed, locally developed and centralized resources To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them From Language Resources To Language Services BUT Need of tools to make this vision operational & concrete Interoperability

31 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200931 Lexical WEB & Content Interoperability As a critical step for semantic mark-up in the SemWeb ComLex SIMPLE WordNets FrameNet Lex_x Lex_y LMF with intelligent agents NomLex Standards for Interoperability Enough? ? Global WordNet GRID BioLexicon SIMPLE-WEB

32 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200932 A new paradigm of R&D in LRs & LT A new paradigm of R&D in LRs & LT Distributed Language Services Open & distributed infrastructures for LRs & LT accumulation of knowledge Adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, based on sharing LRs & tools effective cooperation of many groups on common tasks Ability to build on previous achievements, results accessible to various systems, allowing effective cooperation of many groups on common tasks Exchange and integrate information across repositories Create new resources on the basis of existing Compose new services on demand … A new scenario implying content interoperability standards development of architectures enabling accessibility supra-national cooperation

33 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200933 A few Issues for discussion: content, guidelines, tools, priorities,... Semantic Web content interoperability:mature enough to converge For Semantic Web and content interoperability: is the field mature enough to converge also for the semantic/conceptual level (e.g. to automatically establish links among different languages)? usability requirements of industrial applications For the standards to have impact, ensure their usability & gain industry support focusing on requirements of industrial applications Guidelines usable product To have Guidelines which are a usable product (to assist in creation or adaptation of lexicons, to share resources, …) open-source reference implementation platform & toolsweb services Facilitate acceptance of the standards providing an open-source reference implementation platform & tools, related web services and test suites Spoken language Relation with Spoken language community further stepspriorities Define further steps necessary to converge on common priorities

34 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200934 Limits observed & needs of further work For usability & operability of LMF: Data Categories (DC) & others: From Japanese NEDO: DC not defined in LMF & LMF non operational Asian, African DCs Need of DC organised in profiles (easy to use) IsoCat & Profiles Need of an ontology of DCs with structure/dependencies, and constraints Otherwise the model remains too abstract, and doesnt say anything on how to implement concretely the different layers Link with Ontologies: relations Lexicons-Ontologies Need of easy, user-friendly guidelines Need of tools to make it operational, also for creating standard compliant resources: more important than the model! More dissemination, also with industry Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards Need of operational definitions of interoperability Need of stimuli also from EC to produce standard-compliant resources (unless differently motivated)

35 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200935 Strengths Good set of methodological principles: Granularity of basic notions, … Many languages already compliant with EAGLES morpho-syntax, etc. Many projects today using LMF Unified Lexicon experiment between Speechdat & Parole, at ELRA (possible because EAGLES compliant) Web-services to access LRs based on standards Web-based platforms for LR integration An open infrastructure of LRT need standards New topics being constantly added: Time, Space, …

36 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200936 Future requirements & planning To make LMF usable and operational LMF User Guidelines with examples Mapping of commonly used lexicons into LMF Data categories for LMF lexicons Tool related to LMF, with particular reference to the Lexus tool Need to address another layer The ontological layer in a lexicon How lexicons and ontologies are linked and information mapped from each other An open space in a wiki encironment to store guidelines, examples to allow broad discussion on these topics to ease dissemination of LMF

37 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200937 FLaReNet Mission: structure the area of LR & LT of the future Worldwide Forum for LRs & LTs Consolidate methods, approaches, common practices, architectures Integrate so far partial solutions into broader infrastructures roadmap A roadmap: a plan of coherent actions as input to policy development For the EU, national organisations & industry As a model for the LRs/LTs of the next years Strengthening the language product market, e.g. for new products & innovative services Identifying areas where consensus is achieved/emerging vs. areas where more discussion & testing is required Indicating priorities 221 221 Individual Subscribers 81 81 Institutional Members from 31 countries

38 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200938 Promote knowledge of standards in the community Define specifications for tools supporting standards Support workshops/tutorials on how to use standards Start focusing on standards for more consensual areas & develop for these a toolkit that can be used off-the-shelf, so that we can move on to tackling the larger problems Identify best practices in standards wrt usability, usefulness, viability, outreach etc. Adopt a model for tool & resource development based on open & collaborative development, where the community as a whole contributes components, modules, etc. to a common framework Interoperability Session Some results from FLaReNet Vienna Forum: Interoperability Session

39 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200939 Standards & Interoperability: topics for cooperation A metadata catalogue should involve every party Common repositories for LRT universally & easily accessible Try to connect ongoing work done by many groups shared repository of data formats, annotations A shared repository of data formats, annotations – where to find the most frequently used and preferred schemes –major help to achieve standardisation For a new world-wide language infrastructure Create the means to plug together different LR & LT, in a web-based resource and technology grid Access to LRT is critical: involves – and has impact on – all the community With the possibility to easily create new workflows Create conditions to easily share and re-use technologies, to have more open (source) tools available for use also to under-funded groups International Cooperation Some results from FLaReNet Vienna Forum: International Cooperation

40 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200940 Special Highlight: Contribute to building the LREC2010 Map! Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation. The Map will be a collective enterprise of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure. First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years. When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research The Map will be disclosed at LREC, where some event(s) will be organised around this initiative FLaReNet & the ORI (Open Resource Infrastructure) … at LREC

41 N. Calzolari [FLaReNet]NEERI Workshop, Helsinki, September 200941 Join FLaReNet! We invite all interested players in the field to express their interest in becoming part of the Network How to join? To be part of the FLaReNet Network fill the form available on the project website (

