Presentation is loading. Please wait.

Presentation is loading. Please wait.

… e Progetti Risorse Linguistiche (lessici, corpora, ontologie, …)

Similar presentations

Presentation on theme: "… e Progetti Risorse Linguistiche (lessici, corpora, ontologie, …)"— Presentation transcript:

1 … e Progetti Risorse Linguistiche (lessici, corpora, ontologie, …)
Standard e tecnologie linguistiche (cont.) … e Progetti Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa With many others at ILC Dottorato, Pisa, Maggio 2009

2 SIMPLE Model for a BioLexicon
Design a representational model for a BioLexicon, a comprehensive lexical resource able to integrate terminological, lexical and ontological info compatible with HLT international standards (i.e. ISO) able to meet the domain-specific requirements Implement a BioLexicon database, a container with lexical objects to be filled with data provided by “populators” (EBI, UoM & CNR-ILC) able to be automatically incremented with new terms and linguistic info extracted from texts from Valeria Quochi Dottorato, Pisa, Maggio 2009

3 BioLexicon Building cycle
Term Repository Gather terms EBI Bio-Lexicon Population variants; synt info of terms UoM Bio-Lexicon Conceptual model and physical DB ILC Bio-events extraction of bio-events ILC Terminolgy to Ontology Jena/Rennes/EBI from Valeria Quochi Dottorato, Pisa, Maggio 2009

4 The BioLexicon: where from
Incremental population process Existing repositories chemical compounds, species names, disease, enzymes genes/proteins Subclustering of term variants BioLexicon new genes/proteins names MEDLINE Named Entity Recognition Term Mapping by Normalisation Verbs, nouns, adjs, advs (variants, inflected forms, derivative relations, ...) Manual curation Subcat extraction Linguistic pre-processing Syn-sem mapping Manual annotation of a bio-event corpus Bio-event extraction from Simonetta Montemagni Dottorato, Pisa, Maggio 2009

5 BioLexicon Model: High-level lexical objects, Data Categories
Syntax Semantics DC selection e.g. <feat att=“POS” val=“VVZ”> <feat att=“ConfScore” val=“0.9”> <feat att=“source” val=“UNIPROT” …… from Valeria Quochi Dottorato, Pisa, Maggio 2009

6 GeneRegOnto – BioLex Concepts to Predicates
from Valeria Quochi Dottorato, Pisa, Maggio 2009

7 NF-AT positively regulates IL2
Regulation PositiveProtein NegativeProtein regulates Transcription Factor Protein isregulatedby regulate PredRegulate Arg0Regulate Arg1Regulate NF-AT IL2 regulation regulator regulatee Bio-specific qualia relations bio semantic entry predicative argument structure bio semantic roles bio event concept bio entity concept bio relations Dottorato, Pisa, Maggio 2009 from Valeria Quochi

8 BioLexicon Activity Protein Arg0 pp-of SynBehaviour Lesion1
Sense Lesion1 Predicate LESION SubcatFrame pp-of BioLexicon Protein SynArg Arg0 pp-of SemArg Arg0 Pat The pattern “lesion of PROTEIN” is not in the lexicon, but can be calculated accessing info scattered over various lexical objects (i.e the syntactic unit lesion heads a pp-of corresponding to the patient argument, restricted by the ontological node PROTEIN) All lexical items labelled as PROTEIN can be candidates to fill this argument slot. Lesion of OmpC, OmpR, etc… are all admitted instances/sentences of this “predicate”/pattern. Dottorato, Pisa, Maggio 2009

9 Good mapping of Relations
OBO Relations Agentive isA is_a partOf is_a_part_of hasPart has_as_part GrainOf … hasGrain … componentOf … hasComponent … properPartOf … hasProperPart … locatedIn … locationOf … containtIn … contains contains adjacentTo ? Formal derivesFrom derived_from precededBy ? participatesIn ? hasParticipant ? agentOf … hasAgent ? functionOf is_the_activity_of hasFunction … instanceOf … Telic GALEN: ‘reciprocal’ instead of ‘inverse’ Constitutive Relations from Extended Qualia Structure Dottorato, Pisa, Maggio 2009

10 Enhancing Semantic Relations
Source_Sense Rel Type Target_Sense Phosphoglycolate BelongsToSpecies Mouse BelongsToSpecies phosphoglycolate mouse from Valeria Quochi Dottorato, Pisa, Maggio 2009

11 How to link Bio-Ontology and Bio-Lexicon Place(s) of Semantics in BootStrep
Bio-Ontology holds domain specific as well as general semantics (in terms of classes and relations between classes) Lexicon model comes with semantic layer based on linguistic ontology (SIMPLE-CLIPS Ontology) Questions: What relation between bio-ontology and linguistic ontology? Do they overlap? What is the overlap/intersection? the difference? Mapping possible? How could a mapping look like? Aim: Bringing lexical semantics and ontological semantics together ? Dottorato, Pisa, Maggio 2009

12 the BioLexicon Model & Standards
The Bio-Lexicon is based on the MILE metamodel and the more recent ISO proposal of a Lexical Markup Framework (LMF) Data Categories drawn as far as possible from already existing repositories and standards (i.e. morphosyntactic datacat) There is the need, however, to define a set of Data Categories specific for the biology domain (i.e. semantic roles and relations) Dottorato, Pisa, Maggio 2009

13 ISO Meta-model & Data Categories
An ISO standard for NLP lexica Definition of the Lexical Markup Framework, a general & abstract meta-model & a set of structural nodes relevant for linguistic description Objectives Design of the abstract lexical meta-model Definition of the common set of related Data Categories The field is mature from Monica Monachini Dottorato, Pisa, Maggio 2009

14 ISO - LMF Specifically designed to accommodate as many models of lexical representation as possible Its pros: Meta-model: a high-level specification ISO24613 Data Category Registry: low-level specifications ISO12620 Not a monolithic model, rather a modular framework LMF library provides the hierarchy of lexical objects (with structural relations among them) Data Category Registry provides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined) We moved in the realm of the ISO LMF which naturally lends itself to be used as basis to design a standard representation framework. Its main strenght is the clear cut between high level specifications and the low level specifications Core package: structural skeleton to represent the basic hierarchy of a lexicon Modular extensions required to describe additional classes and relations It is not a monolitic model rather a modular framework (many methaphors have been used, that of lego bricks or the kitcken units of the famous IKEA style. Anyway, LMF provides a libray of objects that can be combined as one wants provided that structural relations are maintained; this provides the skeleton. The DCR provided the linguistic descriptors which are used to decorate the structural skeleton …They can be drawn from the standard ISO DCR or be user defined. Dottorato, Pisa, Maggio 2009

15 ISO LMF – Lexical Markup Framework
Builds also on EAGLES/ISLE Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions; LMF specs comply with modelling UML principles; an XML DTD allows implementation Here is a very high level view of LMF as a set of packages with the core package providing the basic hierachy of the lexicon plus the addional components used to describe additional classes and relations. ICT KYOTO LIRICS NEDO Asian Lang. NICT Language-Grid Service Ontology Dottorato, Pisa, Maggio 2009

16 LMF: NLP Extension for Semantics
Dottorato, Pisa, Maggio 2009

17 Lexical Entry <LexicalEntry rdf:ID="LEprotein">
Lemma L_protein SyntacticBeahviour SB_protein Representation Frame RF_protein DC: writtenForm= protein <LexicalEntry rdf:ID="LEprotein"> <hasSyntacticBehaviour rdf:resource=“../../#SB_protein”/> <hasLemma> <Lemma rdf:ID="L_protein“/> <hasRepresentationFrame> <RepresentationFrame rdf:ID=“RF_protein” /> </hasRepresentationFrame> </hasLemma> </LexicalEntry> Dottorato, Pisa, Maggio 2009

18 Event Representation through SemanticPredicate
SP_regulate SemanticArgument SP_TF_protein DC: role=agent SemanticArgument SP_Target Gene DC: role=patient Dottorato, Pisa, Maggio 2009

19 PredicativeRepresentation
Sense Representation Synset activate <Sense rdf:ID=“activate_2"> <belongsToSynset rdf:resource="#activate"/> <hasSemanticRelation rdf:resource="#is_a_1"/> <hasSemanticRelation rdf:resource="#has_as_part_1"/> <hasSemanticRelation rdf:resource="#object_of_the_activity_1"/> <hasSemanticFeature rdf:resource="# SF_chemistry"/> <hasSemanticFeature rdf:resource="# SF_process"/> </Sense> PredicativeRepresentation Sense activate_2 SemanticFeature SF_chemistry SF_process Collocation SemanticRelation is_a: [SenseID] Typical_of: [SenseID] S_protein Dottorato, Pisa, Maggio 2009

20 Example of Semantic Relation
<SemanticRelation rdf:ID=“is_in"> <hasSourceSense> <Sense rdf:ID=“S_cox15"> <id rdf:datatype="">S_cox15</id> </Sense> </hasSourceSense> <hasTargetSense> <Sense rdf:ID=“S_chromosome19"> <id rdf:datatype="">S_chromosome19</id> </hasTargetSense> <relationName rdf:datatype="">is_in</relationName> </SemanticRelation> Sense S_cox15 SemanticRelation Is_in Sense S_chromosome19 Dottorato, Pisa, Maggio 2009

21 Example: How to encode Wordnet type of Info in LMF
Dottorato, Pisa, Maggio 2009

22 XML based Abstract Lexicon Interchange Format Mapping exercise
Entries from existing lexicons have been mapped to LMF to prove that the model is able to represent many best practices and achieve unification Major best practices: OLIF PAROLE/SIMPLE LC-Star WordNet - EuroWordNet FrameNet BDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French …others on the way… from Monica Monachini Dottorato, Pisa, Maggio 2009

23 Lexical WEB & Content Interoperability  ‘Standards’
As a critical step for semantic mark-up in the SemWeb NomLex WordNets WordNets ComLex WordNets with intelligent agents SIMPLE LMF Lex_x FrameNet Lex_y Standards for Interoperability Enough?? Dottorato, Pisa, Maggio 2009

24 Need of tools to make this vision operational & concrete
New prototype “LeXFlow”: web-based collaborative environment for semi-automatic management/integration of lexical resources enabling interoperability of distributed lexical resources accessed by different types of agents addressing semi-automatic integration of computational lexicons, with focus on linking and cross-lingual enrichment of distributed LRs Case-study: cross-fertilization between Italian and Chinese WordNets From Language Resources To Language Services Dottorato, Pisa, Maggio 2009

25 Dottorato, Pisa, Maggio 2009

26 Our WN case study ItalWordNet (Roventini et al., 2003)
Academia Sinica Bilingual Ontological WordNet (Sinica BOW, Huang et al., 2004) Both connected to Princeton WordNet (although to different versions) Same set of semantic relations (EWN ones) Dottorato, Pisa, Maggio 2009

27 Architecture for cooperative integration of lexicons
Agent Role3 Agent Role1 Agent Role4 Agent Role2 Coordination Web service Interface Simple-Wordnet Relation Calculator MultiWordnet Relation Calculator Application Web service Interface Italian Simple Italian Wordnet Chinese Wordnet ILI Mapper Relation Mapper Data Dottorato, Pisa, Maggio 2009

28 Basic assumptions behind MWN …
Interlingual level: Interlingua provides an indirect linkage between different WordNets: the Interlingual Index (ILI), an unstructured version of WordNet used in EuroWordNet Each synset in a WNA is linked to at least one record of the ILI by means of a set of relations (eq_synonym, eq_near_synonym, …) Synset correspondence: If there is a SA and a SB that point to the same ILI, they are correspondent Relation correspondence: If there are two synsets in WNA and a relation between them, the same holds between corresponding synsets in WNB Dottorato, Pisa, Maggio 2009

29 A new proposed mero relation
parte, tratto N#12348 iperonimia/HYP A new proposed mero relation passaggio, strada,via N#1290 meronimy/MPT curvatura, svolta,curva N#20944 iponimia/HPO carreggiata N#21225 Synonym Derived ILI n road,route ILI n ILI n stretch ILI1.6-??? ILI n passage ILI n ILI n roadway ILI n ILI n bend,crook,turn ILI n Synonym Reinforcement & validity tong_dao (通道 ) N# 上位(泛稱)詞_為 /HYP che_dao (車道 ) N# dao_lu,dao,lu (道路,道 ,路 ) N# 下位(特指)詞_為 /HPO wan (彎 ) N# 部件_部份詞_為 /MPT Dottorato, Pisa, Maggio 2009

30 assimilare_5, assorbire_3,
v HYP v v HPO eq_syn HYP CAU eq_syn eq_syn a Respective, several, various a v acquire_knowledge v v Absorb, assimilate Ingest, take_in v v receive, have v v imbibe v eq_near_syn eq_near_syn eq_syn has_hyponym V#32925 studiare_3, imparare_1, apprendere_2 V#39802 prendere_3 eq_syn has_hyperonym has_hyperonym V#32080 assimilare_5, assorbire_3, accettare_2, recepire_1 AG#42011 relativo_4 causes Derived

31 For a Global WordNet Grid
This architecture for making distributed wordnets interoperable lends itself to different applications in LR processing: Enrichment of existing lexical resources Creation of new resources Validation of existing resources Can provide a platform for cooperative & collective creation & management of LRs, by providing a web-based environment for the collaboration & interaction of distributed agents and resources Can be seen as the Prototype of a web application supporting the GlobalWordNet Grid initiative, i.e. a shared multi-lingual knowledge base for cross-lingual processing based on distributed resources over the Grid New project: KYOTO Dottorato, Pisa, Maggio 2009

32  Capture text: Distributed, diverse & dynamic data 1
Environmental organizations Citizens Governments Companies Wikyoto maintain terms & concepts 4 Capture text: "Sudden increase of CO2 emissions in 2008 in Europe" 2 Wordnets H20 CO2 Substance Abstract Process Physical Ontology Top Middle Tybot: term yielding robot CO2 emission 3 Domain CO2 Emission H20 Pollution Greenhouse Gas Kybot: knowledge yielding robot Index facts: Process: Emission Involves: CO2 Property: increase, sudden When: 2008 Where: Europe 5 Text & Fact Index Semantic Search 6 from Piek Vossen Dottorato, Pisa, Maggio 2009

33 TEXT Linear DAF Discourse Annotation LMF API OWL API Linear MAF
Wordnet ontology Domain Wordnet domain ontology Linear DAF Discourse Annotation LMF API OWL API Linear MAF Morphological Annotation Language Specific Domain Terms Linear SYNAF Syntactic Annotation Generic TMF Linear SEMAF Semantic Annotation Term Extraction (Tybot) Language Neutral Linear Generic FACTAF Fact Extraction (Kybot) Language Neutral & Specific from Piek Vossen Dottorato, Pisa, Maggio 2009

34 System components Wikyoto = wiki environment for a social group:
to model the terms and concepts of a domain and agree on their meaning, within group, across languages and cultures to define the types of knowledge and facts of interest Tybots = Term extraction robots, extract term data from text corpus Kybots = Knowledge yielding robots, extract facts from a text corpus Linguistic processors: tokenizers, segmentizers, taggers, grammars named entity recognition word sense disambiguation generate a layered text annotation in Kyoto Annotation Format (KAF) from Piek Vossen Dottorato, Pisa, Maggio 2009

35 KYOTO SYSTEM Fact Concept User User Linear SYNAF/SEMAF Term extraction
(Tybot) Semantic annotation Generic TMF Linear SEMAF Domain editing (Wikyoto) Fact extraction (Kybot) Fact User Concept User LMF API OWL API Linear Generic FACTAF Domain Wordnet Domain ontology Wordnet Ontology from Piek Vossen Dottorato, Pisa, Maggio 2009

36 Linguistic Expressions
Fact mining by Kybots [[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP Morpho-syntactic analysis Source Documents Linguistic Processors Ontology Logical Expressions Wordnets & Linguistic Expressions Generic Abstract Physical Fact analysis Patient [[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3 Process Substance Chemical Reaction H2O CO2 Domain Patient CO2 emission water pollution from Piek Vossen Dottorato, Pisa, Maggio 2009

37 Contribution of KYOTO KYOTO enables semantic search and fact extraction Software can partially understand language and exploit web 1 data Understanding is helped by the terms and concepts defined for each language KYOTO learns terms and concepts from text documents, Stored as structures that people and computers understand hundreds of thousands sources in the environment domain in many different languages spread all over the world changing every day KYOTO delivers a Web 2.0 environment for community based control Connects people across language and cultures Establish consensus and knowledge transition html pdf environment facts xls KYBOT Wordnet environment terms Ontology concepts WIKYOTO TYBOT from Piek Vossen Dottorato, Pisa, Maggio 2009

38 A common representation format: WordNet - LMF
Data Categories LexicalResource 1..1 1..* 0..1 GlobalInformation Lexicon SenseAxes 1..* 0..* 1..* 0..1 Meta LexicalEntry Synset SenseAxis 0..* 0..1 0..1 0..1 0..1 1..1 Lemma Sense Definition SynsetRelations Monolingual ExternalRefs Interlingual ExternalRefs 0..1 0..* 1..* 1..* 1..* Monolingual ExternalRefs Statement SynsetRelation Monolingual ExternalRef Interlingual ExternalRef 0..1 0..1 0..1 1..* Here is a uml-like view of wm lmf with its three main packages, the core one with general info and the core, the semantic package with synset and the package used to express interlingual linkages. Monolingual ExternalRef Meta Meta Meta 0..1 Meta from Monica Monachini Dottorato, Pisa, Maggio 2009

39 Centralized WordNet DC Registry
A list of 85 sem.rels as a result of a mapping of the KYOTO WordNet grid Intra-WN Inter-WN For the DCR, we collected from the various monolingual wordnets of the project all the relatiion holding between synsets and we tried to come up with an harmonised set of relations. This is maintaned as a centralized repository in order to ensure coherency and consistncy. WordNet LMF fully complies with LMF. We maintained adherence to its architectural principles, the main building blocks, the constitutive relations, the way how linguistic information are encoded by means of dcs. We only customized the representation of dcs, no longer represented as separate lexical objects but as attibutes constrained under some specific elements. This allowed us to obtain better parsing efficiency. from Monica Monachini Dottorato, Pisa, Maggio 2009

40 WordNet-LMF multilingual level - Cross-lingual synset relations
<!ELEMENT SenseAxes (SenseAxis+)> <!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)> <!ATTLIST SenseAxis id ID #REQUIRED relType CDATA #REQUIRED> <!ELEMENT Target EMPTY> <!ATTLIST Target ID CDATA #REQUIRED> <!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)> <!ELEMENT InterlingualExternalRef (Meta?)> <!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIRED externalReference CDATA #REQUIRED relType (at|plus|equal) #IMPLIED> IWN <fuoco_1, fiamma_1> n SWN <fuego_3, llama_1> n groups monolingual synsets corresponding to each other and sharing the same relations to English WN3.0 <fire_1 flame_1 flaming_1> n specifies the type of correspondence Here is the multilingual component based on the notion of sense axis which allow to encode the interlingual approach. On the left you have 3 different wordnet fuoco fuego fire with their wordnet lmf representation and as you can see the sense axis groups together monolingual synstes corresponding each other and sharing the same correspondence relation from the language-x to english, i.e. both fuoco and fuego are eq-synonym with fire. In the external ref we encode the link to an ontology. This architecture, where at monolingual level sysntes preserve their onto typing and at multiligual level are referenced to the same node in the same shared ontology, allow to obtain indirectly a mapping btw different ontological systems. link to ontology/(ies) from Monica Monachini Dottorato, Pisa, Maggio 2009

41 Ultimate goal Global standardization and anchoring of meaning such that: Machines can start to approach text understanding -> semantic web connects to the current web Communities can dynamically maintain knowledge, concepts and their terms in an easy to use system Cross-linguistic and cross-cultural sharing and communication of knowledge is enabled Comparable to a formalization of Wikipedia for humans AND machines across languages from Piek Vossen Dottorato, Pisa, Maggio 2009

42 Some steps for a “new generation” of LRs
From huge efforts in building static, large-scale, general-purpose LRs To non-static LRs rapidly built on-demand, tailored to spefic user needs From closed, locally developed and centralized resources To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them From Language Resources To Language Services Dottorato, Pisa, Maggio 2009

43 Distributed Language Services
A long-term scenario implying content interoperability standards, supra-national cooperation and development of architectures enabling accessibility Create new resources on the basis of existing Exchange and integrate information across repositories Compose new services on demand Collaborative & collective/social development and validation, cross-resource integration and exchange of information Language Grid Wiki Dottorato, Pisa, Maggio 2009

44 In the “Semantic Web” vision ...
…need to tackle the twofold challenge of content availability & multilinguality Natural convergence with HLT: multilingual semantic processing ontologies semantic-syntactic computational lexicons Dottorato, Pisa, Maggio 2009

45 Language Tech … & … Knowledge, Content
Ready??? Knowledge Markup Semantic Web LT & LRs How to cooperate?? Content Interoperable LRs & LT Dottorato, Pisa, Maggio 2009

46 LR and the future of LT or Content Tech
The need of ever growing and richer LRs for effective multilingual content processing requires a change in the paradigm, & the design of a new generation of LRs, based on open content interoperability standards The Semantic Web notion may be used to shape the LRs of the future, in the vision of an open space of sharable knowledge available on the Web for processing The effort of making available millions of “richly annotated words” for dozens of languages is not affordable by any single group This objective can only be achieved creating integrated Open and Distributed Linguistic Infrastructures Not only the linguistic experts can participate in these, but may include designers, developers, users of content encoding practices, etc. in wiki mode  Is the LR/LT field mature enough to broaden and open itself to the concept of cooperative effort of different set of communities?  Could a sort of “Language Genome” large initiative be effective? Storing lots of (annotated) facts Dottorato, Pisa, Maggio 2009

47 Today, many vitality & success signs… for LRs
In Spoken, Written, Multimodal areas … … in new emerging areas Statistical approaches… Different dimensions & layers: Content (Ontologies), Emotion, Time, … For Evaluation For Training LREC (> 900 submissions); many LRs at COLING and even at ACL!! ELRA (self-sustaining) & LDC LRE (new Journal: N. Ide & NC) ISO-TC37-SC4/WG4 (International Standards for LRs) AFNLP… FLaReNet ESFRI - CLARIN (also political & strategic role) New calls or initiatives in EU, US, ASIA, on LRs, interoperability, cooperation, … Dottorato, Pisa, Maggio 2009

48 BUT … an important point
In the ’90s There was a global vision of the field & its main components: Standards Creation of LRs Distribution Then: Automatic acquisition … towards the Infrastructure of LRs & LT ELRA LDC While today: There is an ever increasing set of initiatives for new LRs, basic robust technologies, models??, algorithms, We have a LR community culture BUT sort of scattered, opportunistic, not much coherence Dottorato, Pisa, Maggio 2009

49 Content interoperability
Today … The wealth of data & of basic technologies is such that: We should reflect again at the field as a whole & ask if Standards Creation of LRs Automatic acquisition Distribution are still “the” important components, or how they have changed/must change Content interoperability Collaborative creation & Manag. Dynamic LRs Sharing … Which new challenges towards a new & more mature infrastructure of LRs & LTs?? Dottorato, Pisa, Maggio 2009

50 These dimensions Content interoperability
Collaborative creation & Manag. Need more Dynamic LRs Technology exists Sharing + Distributed architectures/infrastr could be at the basis of a new Paradigm for LRs & LT & of a new Infrastructure ?? Dottorato, Pisa, Maggio 2009

51 Many dimensions around the notion of language
finally We need to put together technical, organisational, strategic, economic, political issues of LRs Two new European Infrastructural & Networking Initiatives Multilingualism Political issues e.g. a commonly agreed list of minimal requirements for “national” LRs: BLARK Need of bodies for a broad research agenda & strategic actions for LT&LRs (W/S /MM) based on all the dimensions Interdisciplinarity & Multidisciplinarity Cultural issues Language … and cultural identity Language … and the Humanities Economic, social issues Applications Services Technical issues Dottorato, Pisa, Maggio 2009

52 (eculture, egovernment, ehealth, …)
Technologies exist, but the infrastructure that puts them together and sustains them is still missing for Which Communities? Humanities Social Sciences Digital Libraries Cultural Heritage core Language Resources Language Technologies Standardisation Enabling infrastr CLARIN ResInfra FLaReNet Network on Multilinguality Grid Semantic Web Ontologists ICT Focus on cooperation Many application domains (eculture, egovernment, ehealth, …) for Dottorato, Pisa, Maggio 2009

53 CLARIN Common Language Resources and Technologies Infrastructure
ESFRI Research Infrastructures CLARIN Common Language Resources and Technologies Infrastructure for the Humanities & Social Sciences Large-scale pan-European collaborative effort (31+ countries) Make LRs & LTs available & readily usable to scholars of humanities & social sciences (& all disciplines) Need to overcome the present fragmented situation by harmonising structural and terminological differences Basis is a Grid-type infrastructure and Semantic Web technology The benefits of computer enhanced language processing become available only when a critical mass of coordinated effort is invested in building an enabling infrastructure, which can provide services in the form of provision of tools & resources as well as training & counseling across a wide span of domains The infrastructure will be based on a number of resource, service and expertise centres Dottorato, Pisa, Maggio 2009

54 CLARIN Mission Create a comprehensive and free to use distributed archive of LRs & LTs covering not only the languages of all member states, but also other languages studied and used in Europe Through the fact that the tools & resources will be interoperable across languages & domains, contribute to preserving and supporting multilingual & multicultural European heritage An operational open infrastructure of web services will introduce a new paradigm of distributed collaborative development Allow many contributors to add all kinds of new services based on existing ones, thus ensuring reusability and allowing scaling up to suit individual needs Dottorato, Pisa, Maggio 2009

55 How can we tackle these challenges?
J. Taylor “eScience is about global collaboration in key areas of science and the next generation of infrastructures that will enable it” Need to build new types of platforms to allow researchers to combine existing resources easily to new ones to tackle the big challenges to increase the productivity of all interested researchers, since currently too much time is wasted by preparatory work from P. Wittenburg Dottorato, Pisa, Maggio 2009

56 eScience Vision CLARIN establishes such a new generation of extended infrastructure Thus CLARIN is not about creating and building new language resources and technology, but making them available and accessible as services in a stable and persistent infrastructure to allow tackling the great challenges CLARIN: Grid Project: ISO TC37/SC4: Standards Project: from P. Wittenburg Dottorato, Pisa, Maggio 2009

57 We have still a long path …
& also a “new project” in an e-Contentplus Call for a: “Thematic Network on Language Resources”: FLaReNet To provide common recommendations (to the EC) for future actions To give priorities Need of ‘visions’ In a global context, in cooperation with CLARIN & also with non-EU members Dottorato, Pisa, Maggio 2009

58 Which Communities? Multilinguality
LRs & LTs exist, but a global vision, policy and strategy is still missing for Which Communities? Humanities Social Sciences Digital Libraries Cultural Heritage core Language Resources Language Technologies Standardisation Ontologists Content EU Forum CLARIN ResInf FLaReNet Network Multilinguality Focus on cooperation for EC Funding agencies Many application domains (eculture, egovernment, ehealth, intelligence, domotics, content industry, …) for Dottorato, Pisa, Maggio 2009

59 Fostering Language Resources Network
e Content plus A new European Network for Language Resources – Nicoletta Calzolari (coord.) Dottorato, Pisa, Maggio 2009

60 FLaReNet Fostering Language Resources Network
A European forum to facilitate interaction among LR stakeholders The Network structure considers that LRs present various dimensions and must be approached from many perspectives: technical, but also organisational economic legal political Addresses also multicultural and multilingual aspects, essential when facing access and use of digital content in today’s Europe N. Calzolari Dottorato, Pisa, Maggio 2009

61 Organised in Thematic Working Groups
A layered structure, with leading experts & groups (national and European institutions, SMEs, large companies) for all relevant LR areas (about 40 partners) in collaboration with CLARIN to ensure coherence of LR-related efforts in Europe FLaReNet will consolidate existing knowledge, presenting it analytically and visibly contribute to structuring the area of LRs of the future by discussing new strategies to: convert existing and experimental technologies related to LRs into useful economic and societal benefits integrate so far partial solutions into broader infrastructures consolidate areas mature enough for recommendation of best practices anticipate the needs of new types of LRs N. Calzolari Dottorato, Pisa, Maggio 2009

62 Thematic Areas To build together: Evolving RoadMap
The Chart for the area of LRs in its different dimensions Methods and models for LR building, reuse, interlinking and maintenance Harmonisation of formats and standards Definition of evaluation protocols and evaluation procedures Methods for the automatic construction and processing of LRs To build together: Evolving RoadMap Blueprint of actions and infrastructures N. Calzolari Dottorato, Pisa, Maggio 2009

63 Objectives & expected results
The largest Network of LR and HLT players, with diverse approaches, efforts and technologies Enable progress toward community consensus Give an extended picture of LRs & recast its definition in the light of recent scientific, methodological, technological, social developments Consolidate methods & approaches, common practices, frameworks and architectures A “roadmap” identifying areas where consensus has been achieved or is emerging vs. areas where additional discussion and testing is required, together with an indication of priorities Recommendations in the form of a plan of coherent actions for the EU and national organizations A European model for the LRs of the next years Ambitious! N. Calzolari Dottorato, Pisa, Maggio 2009

64 Outcomes of FLaReNet The outcomes will be of a directive nature
to help the EC, and national funding agencies, identifying priority areas of LRs of major interest for the public that need public funding to develop or improve A blueprint of actions will constitute input to policy development both at EU and national level for identifying new language policies that support linguistic diversity in Europe in combination with strengthening the language product market, e.g. for new products & innovative services, especially for less technologically advanced languages N. Calzolari Dottorato, Pisa, Maggio 2009

65 These Initiatives, … together
Call for international cooperation also outside Europe and will be relevant for setting up a global worldwide Forum of Language Resources and Language Technologies N. Calzolari Dottorato, Pisa, Maggio 2009

Download ppt "… e Progetti Risorse Linguistiche (lessici, corpora, ontologie, …)"

Similar presentations

Ads by Google