Presentation is loading. Please wait.

Presentation is loading. Please wait.

N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa

Similar presentations


Presentation on theme: "N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa"— Presentation transcript:

1 N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa glottolo@ilc.cnr.it The Future of KYOTO … with some historical notes to show a path along an evolving vision Language Resources in today EU context: META-SHARE,...

2 Why such needed LRs, are lacking after 30 years of R&D in the field?  1) Because the main trend until mid-’80s was to privilege the processing of so-called “critical” phenomena, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a language As a result CL was focusing on: As a result CL was focusing on: few examples - often artificially built lexicons made of few entries (toy lexicons) grammars with poor coverage  2) Because large-scale LRs are costly & their production requires a big organizing effort N. Calzolari22nd KYOTO Workshop, Gifu, Japan, January 2011 Old slide with Antonio Zampolli (’80s/early ‘90s) Why we still lack them??

3 … back from the early ‘80s It became evident that: Part of the results of meaning extraction, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured, were unmanageable at the formal representation level, and had to be blurred into unique features and values Unfortunately, it is still today difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries N. Calzolari32nd KYOTO Workshop, Gifu, Japan, January 2011 Automatic acquisition of lexical information from MRDs Was my first research & became central in the Pisa group (ACQUILEX) And also Amsler, Briscoe, Boguraev, Wilks’ group, IBM, then Japanese groups, … The trend was: “large-scale computational methods for the transformation of machine readable dictionaries into machine tractable dictionaries” Instead of relying on linguists’ introspection PioneeringResearch Historical notes

4 Automatic acquisition of info from texts: Automatic acquisition of info from texts: This trend has become today a consolidated & pervasive fact From acquisition of “linguistic information” To acquisition of “general knowledge”, with more data intensive, robust, reliable methods N. Calzolari42nd KYOTO Workshop, Gifu, Japan, January 2011 … back from the late ‘80s After acquisition from MRDs, Historical notes Need of adequate models to handle actual usage of language Lesson learned ( IN-)Adequacy of (current) lexicons Lesson learned Going from core sets to large coverage has implications not just in quantitative terms, but more interestingly in terms of changes to the models and the strategies of processes Lesson learned

5 N. Calzolari 5 2nd KYOTO Workshop, Gifu, Japan, January 2011 5 MultiLex GeneLex AcquiLexAcquiLex Xxx-LexXxx-Lex A. Zampolli: Let’s be coherent: Xxx-LexXxx-Lex After the “Grosseto Workshop” (1985): a turning point

6 ISO LMF Lexical Markup Framework N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011 6 Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions  Modular framework  LMF specs comply with modelling UML principles  an XML DTD allows implementation Builds on EAGLES/ISLE NEDOAsianLang.uages The field is mature NICT Language-Grid NICT Language-Grid Service Ontology ICTKYOTO LIRICS New initiatives … LexInfo

7 N. Calzolari72nd KYOTO Workshop, Gifu, Japan, January 2011 KYOTO A search environment using semantic technologies A “compass” for the web2.0 Interdisciplinarity scientific community (LRT, web technologies, knowledge engineers), companies, domain experts Multilingualism 7 languages (2 Asiatic languages) needs to share lexical/knowledge bases & tools both general & domain-related underthe form of lexical/ontological & sw repositories under the form of lexical/ontological & sw repositories Kyoto Core System is open & free The “resource” perspective

8 Annotation Format (KAF) Multi-level Annotation Format stand-off stand-off annotation uniform uniform representation for 7 languages  Shared through the languages Text Text: tokenisation, sentences, paragraphs with reference to the sources Terms Terms: words & multi-words, parts-of- speech, etc. Chunks Chunks: constituents & syntagmatic realization Dependencies Dependencies: grammatical functions L1 – Semantic modules OntoTagging ● L1 – Semantic modules: Multiword tagging, Sense Tagging, Named Entity Recognition, OntoTagging L2 – Semantic module ● L2 – Semantic module: event/fact extraction N. Calzolari82nd KYOTO Workshop, Gifu, Japan, January 2011 from Piek Vossen

9 N. Calzolari92nd KYOTO Workshop, Gifu, Japan, January 2011 KYOTO System & Adoption of Standards LinearMAF/SYNAF SEMAF Term extraction Tybot GenericTMF Semantic annotation Linear Generic FACTAF Fact extraction Kybot Domain editing Wikyoto Wordnet Domain Wordnet LMF API Ontology Domain ontology OWL API Concept User Fact User from Piek Vossen SourceDocuments Could be at the basis of a new standard?

10 2nd KYOTO Workshop, Gifu, Japan, January 2011 A common representation format for WordNets Wn IT Wn EN Wn EU Wn NL Wn JP Wn CH Wn ES representation format allowing easy access, integration & interoperability  endow WordNet with a representation format allowing easy access, integration & interoperability among resources Wn IT Wn EN Wn EU Wn NL Wn JP Wn CH Wn ES

11 2nd KYOTO Workshop, Gifu, Japan, January 2011N. Calzolari11 GlobalInformation Lemma Monolingual ExternalRef Monolingual ExternalRefs Sense LexicalEntry Statement Definition SynsetRelation SynsetRelations Monolingual ExternalRef Monolingual ExternalRefs Synset Lexicon Interlingual ExternalRef Interlingual ExternalRefs SenseAxis SenseAxes LexicalResource 1..1 1..*0..1 1..* 1..1 0..* 0..1 1..* Meta 0..1 Meta 0..1 Meta 0..1 Meta 0..* 0..1 1..* 0..* 0..1 1..* Data Categories from Monica Monachini

12 2nd KYOTO Workshop, Gifu, Japan, January 2011 A list of 85 sem.rels as a result of a mapping of the KYOTO WordNet grid Inter-WN Intra-WN N. Calzolari12

13 2nd KYOTO Workshop, Gifu, Japan, January 2011 N. Calzolari13 SWN 09686541-n <!ATTLIST SenseAxis id ID #REQUIRED relType CDATA #REQUIRED> <!ATTLIST Target ID CDATA #REQUIRED> <!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIRED externalReference CDATA #REQUIRED relType (at|plus|equal) #IMPLIED> IWN 00001251-n WordNet-LMF Multilingual level - Cross-lingual Relations WN3.0 13480848-n groups monolingual synsets corresponding to each other and sharing the same relations to English link to ontology/(ies) specifies the type of correspondence from Monica Monachini

14 N. Calzolari142nd KYOTO Workshop, Gifu, Japan, January 2011 Complex picture! Is there anything we need to do for Interoperability? Work within ISO:  LMF: abstract meta-model for lexical representation  Ontology Group or more Groups?  Language Resource Ontologies: ontology of data categories Real life:  Lexicons (e.g. WordNets) that are called Ontologies  Lexicons linked to Ontologies: to be used in applications, in multilingual systems, domains, …  Work on “ontologising” Lexicons: to allow exploiting various relations, to make inferences, …  Semantic Lexicons, with many types of relations among semantic units: these are often of “conceptual/world-knowledge” nature. Do we want DCs for these? ISO SC 4/WG 4 – Lexicon-Ontology relations PWI 24622 ISO SC 4/WG 4 – Lexicon-Ontology relations New work item: PWI 24622 KYOTO can contribute

15 N. Calzolari152nd KYOTO Workshop, Gifu, Japan, January 2011 To explore the need of doing something within ISO about the relations between Lexicon and Ontology Do we/ISO need to address another (lexical) layer?  How lexicons and ontologies are linked and information mapped from one to the other  The ontological layer in a/connected to a lexicon Possible issues/questions:  Is LMF enough to represent Ontological links?  How to connect work being done in ISO Lexical group and ISO Ontology groups?  Lexicon and Ontologies: separation? or lexicalised ontologies? or ontologies lexicons?  Lexicon, Ontologies and Domains  On a very different dimension: Ontology of lexical/semantic/conceptual categories? Standardised semantic categories, ontology labels?  Relation to multilinguality ... KYOTO can contribute

16 N. Calzolari162nd KYOTO Workshop, Gifu, Japan, January 2011 Input to Multilingual Web http://www.multilingualweb.eu/ http://www.multilingualweb.eu/ The MultilingualWeb project is exploring standards and best practices that support the creation, localization and use of multilingual web-based information The MultilingualWeb project is exploring standards and best practices that support the creation, localization and use of multilingual web-based information It aims to raise the visibility of existing best practices and standards and identify gaps It aims to raise the visibility of existing best practices and standards and identify gaps The core vehicle for this is a series of four workshops, for networking across communities that span the various aspects involved The core vehicle for this is a series of four workshops, for networking across communities that span the various aspects involved Next workshop on best practices aimed at development of Content for the Web, including creation of content ranging from personal authoring for blogs and social networking sites to development of large corporate or organizational enterprises: Next workshop on best practices aimed at development of Content for the Web, including creation of content ranging from personal authoring for blogs and social networking sites to development of large corporate or organizational enterprises: “Content on the Multilingual Web” 4-5 April 2011 Pisa, Italy KYOTO can contribute

17 N. Calzolari172nd KYOTO Workshop, Gifu, Japan, January 2011 A new paradigm of R&D in LRs & LT Since few years Open & distributed linguistic infrastructures for LRs & LT accumulation of knowledge & results Adopting the paradigm of accumulation of knowledge, so successful in more mature disciplines, based on sharing LRs, tools & results cooperation of many groups on common tasks Ability to build on each other achievements, allowing controlled & effective cooperation of many groups on common tasks (see HumanGenomeProject) e. g. initiatives to achieve international consensus on annotation guidelines collective intelligence Emerging concept of collective intelligence interoperability Emphasize interoperability among LRs & LT

18 Some steps for a “new generation” of LRs N. Calzolari182nd KYOTO Workshop, Gifu, Japan, January 2011 From huge efforts building static, large-scale, general-purpose LRs dynamic To dynamic LRs rapidly built on- demand, tailored to specific user needs From closed, locally developed and centralized resources To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them From Language Resources To Language Services Need of an infra that makes this vision operational

19 Lexical WEB As a critical step for semantic mark-up in the Semantic Web N. Calzolari192nd KYOTO Workshop, Gifu, Japan, January 2011 ComLex SIMPLE WordNets FrameNet Lex_x Lex_y with intelligent agents NomLex Standards for Content Interoperability Enough?? Global WordNet GRID BioLexicon SIMPLE-WEB

20 (Distributed) Language Services N. Calzolari202nd KYOTO Workshop, Gifu, Japan, January 2011 content interoperability standards supra-national cooperation architectures enabling accessibility Collaborative & collective/social development & validation Collaborative & collective/social development & validation, cross-resource integration & exchange of information Create new resources on the basis of existing Exchange & integrate information across repositories Compose new services on demand Can KYOTO contribute?

21 N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201121 Which Communities? Language Resources Language Technologies Standardisation Content/Ontologies System developers Integrators SSH EC EC National funding agencies National funding agencies Industry Industry Many applications/domains  MT  CLIR  …  e-government  content industry  intelligence  e-culture  e-health  domotics… core EUForum with Focus on cooperation Many LRs & LTs exist, but a global vision, policy & strategy is needed for CLARIN for SSH CLARIN FLaReNetNetworkFLaReNetNetwork META-NETNoEMETA-NETNoE Need to consider together technical technical organisational organisational strategic strategic economic, social economic, social cultural cultural legal legal political issues wrt LRs & LTs political issues wrt LRs & LTs Many dimensions Today

22 FLaReNet at a glance Fostering Language Resources Network FLaReNet at a glance An international Forum to facilitate interaction, to Overcome the fragmentation in LR & LT & recreate a community Anticipate the needs of new types of LR & LT & Language Infrastructures Create a shared policy for the next years  Foster a European strategy for consolidating the sector 22 http://www.flarenet.eu N. Calzolari222nd KYOTO Workshop, Gifu, Japan, January 2011 98 Institutional Members From 33 countries 351 Individual Subscribers Community mobilisation Essential Community mobilisation RI (also to prepare the ground for a RI) Community mobilisation Essential Community mobilisation RI (also to prepare the ground for a RI) “roadmap” A “roadmap”: a plan of actions as input to policy development A ( EU) model for the LRs/LTs area of the next years Ambitious!

23 N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201123 Create a shared repository of data formats, annotations, etc. as a major help to achieve standardisation Common repositories for tools & language data should be established that are universally and easily accessible by everyone Coordinate input to ISO/W3C standardisation work Results from Vienna & Barcelona Forum: Shaping the Future of the Multilingual Digital Europe Standards, Interoperability & Metadata are topics to be approached in cooperation Access to LRs is critical & should involve all the community Need to create the means to plug together different LR & LT, In a web-based resource and technology “grid” For a new world-wide language infrastructure

24 2 nd Blueprint Result of a permanent and cyclical consultation Result of a permanent and cyclical consultation  Inside the community it represents  Outside it, through connections with neighbouring projects, associations, initiatives, funding agencies three main “directions”: Organised along three main “directions”:  Infrastructural Aspects  Research and Development  Political and Strategic Issues development factors Reflect three major development factors that can boost or hinder the growth of the field of LRT N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201124 Provide feedback! http://www.flarenet.eu/sites/default/files/D8.2b.pdf

25 Sources: many meetings Operational Interoperability Asian Collaboration Workshop FL-SILT Workshop Lexicon/O ntology Standards NEERI 2 nd FLaReNet Forum Less- resourced Languages Automatic Acquisition Legal Issues Standards International Cooperation N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201125

26 N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201126 3 rd FLaReNet Forum The European Language Resources and Technologies Forum: Important role in defining recommendations 120 Participants from 22 Countries In Barcelona: 120 Participants from 22 Countries Define final recommendations Define final recommendations Previous Proceedings & Reports on the web  Blueprint discussed  Blueprint will be discussed  Also for adoption & endorsement by Institutional Members  Also for adoption & endorsement by FLaReNet Institutional Members

27 N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201127 IssueChallengeRecommended Actions Metadata Interoperability Interoperability of Metadata sets Set up a global infrastructure of common and uniform and/or interoperable metadata sets Metadata usable both by humans and by machines machine-understandable metadata Create machine-understandable metadata with formal syntax and clear semantics Automate the process of metadata creation Develop structured metadata Documentation Reliable documentation common best practices Reliable documentation of LRs according to common best practices Collect documentation Collect all possible and existing LR documentation standard documentation template Devise and adopt a widely agreed standard documentation template for all types of resources Infrastructural Aspects

28 Political and Strategic dimensions N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201128 IssueChallengeRecommended Actions Funding Agencies policies easy access Devise models to allow different types of players easy access to resources publicly funded publicly available Ensure that publicly funded resources are publicly available either free of charge or at a small distribution cost of best practices Encourage/enforce use of best practices or standards in LR production projects Make sustainability and sharing/distribution plans mandatory in projects concerning LR production LR citation Appropriate citation of Language Resources like traditional publications a standard protocol for citing Develop a standard protocol for citing language resources KYOTO can be an example

29 LRE Map: Why?? The Map as an answer to start to fill this gap, but also: “change in culture” To encourage the needed “change in culture” N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011 29 Problem: Lack of information & documentation about resources is, in the e- science paradigm, a very critical issue Non documented resources don’t exist!! Non documented resources don’t exist!! collective enterprise personal engagement in documenting resources A collective enterprise: Each researcher must become aware of the importance of his/her personal engagement in documenting resources A task as important as creating new resources and not an accessory to be disregarded service to the whole community As the necessary service to the whole community monitor the field Will become an essential instrument to monitor the field www.resourcebook.eu

30 N. Calzolari302nd KYOTO Workshop, Gifu, Japan, January 2011 How many LRs & Types at LREC? Corpora: 785 Lexicons: 289 Tagger/Parser: 181 Annotation tool: 134 Ontology: 73 Evaluation data: 40 Annotation Guidelines: 35... Submissions: 1288LR forms: 1994 30 How many LRs & Types at COLING? Submissions: 880 LR forms: 735 Corpora : 359 - 50% Tagger/Parser: 81 -11% Lexicons: 71 - 10% Evaluation data: 51 - 7% Ontology, Annotation tool, Evaluation tool, Tokenizer, NER < 20 - 2%

31 Languages: But obviously … N. Calzolari31 2nd KYOTO Workshop, Gifu, Japan, January 2011 170 !! image courtesy of Wordle (http://www.wordle.net)

32 Availability N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January 2011 32 Freely available! The wide majority of resources are freely available 3% 15% 25% LREC COLING

33 The Project META-NET N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201133 Network of Excellence  META-NET is a Network of Excellence (coord. Hans Uszkoreit) dedicated to fostering the technological foundations of the European multilingual information societyObjectives: large-scale concerted effort  Prepare the ground for a large-scale concerted effort by building a strategic alliance of national and international research programmes, corporate users and commercial technology providers and language communities  Strengthen the European research community through research networking and by creating new schemes and structures for sharing resources and efforts  Build bridges by approaching open problems in collaboration with other research fields such as machine learning, social computing, cognitive systems, knowledge technologies and multimedia content Final goal: META – The Multilingual Europe Technology Alliance

34 language communities policy makers and funding bodies user industries provider industries language technology community machine learning community semantic techno- logies community cognitive systems community multimedia content techno- logies The META Alliance N. Calzolari342nd KYOTO Workshop, Gifu, Japan, January 2011

35 Founding Members  Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany  Barcelona Media – Centre d'Innovació, Spain  Consiglio Nazionale Ricerche – Instituto di Linguistica Computazionale “Antonio Zampolli”, Italy  Institute for Language and Speech Processing, R.C. “Athena”, Greece  Charles University in Prague, Czech Republic  Centre National de la Recherche Scientifique – Laboratoire d'Informatique pour la Mécanique et les Sci.s de l'Ingénieur, France  Universiteit Utrecht, The Netherlands  Aalto University, Finland  Fondazione Bruno Kessler, Italy  Dublin City University, Ireland  Rheinisch Westfälische Technische Hochschule Aachen, Germany  Jožef Stefan Institute, Slovenia  Evaluations and Language Resources Distribution Agency, France N. Calzolari352nd KYOTO Workshop, Gifu, Japan, January 2011

36 Three Lines of Action  The META-NET objectives translate into three lines of action: N. Calzolari362nd KYOTO Workshop, Gifu, Japan, January 2011

37 The Process 2010 2011 2012 META-VISION communication within META-NET (META-VISION) communication in the wider LT community and among other stakeholders communication to policy makers funding bodies, public N. Calzolari372nd KYOTO Workshop, Gifu, Japan, January 2011

38  Data has become a key factor in LT R&D  A few indicators:  Increasing size & importance of LREC conference, corpora mailing list, etc.  Citation ranks of publications on language resources Data Intensive Sciences  Language research and language technology belong to the Data Intensive Sciences  Expensive data become valuable through sharing  However, the long demanded and well-contemplated instruments for managing and sharing this data are still missing N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201138

39 META-SHARE: Key Features open, integrated, secure, interoperable exchange infrastructure  META-SHARE is an open, integrated, secure, interoperable exchange infrastructure (resp. Stelios Piperidis) for language data & tools for the Human Language Technologies domain  ever-evolving, scalable, including free and for-a-fee LRs/LTs and services  including legacy, contemporary and emerging datasets, tools and technologies marketplace  A marketplace where language data & tools are documented, uploaded and stored in repositories, catalogued and announced, downloaded, exchanged, aiming to support a data economy (includes free and for-a-fee LRs/LTs and also services)  Standards-compliant  Standards-compliant, overcoming format, terminological and semantic differences distributed networked repositories  Based on distributed networked repositories accessible through common interfaces N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201139

40 What we’re offering share and distribute  A channel to share and distribute language data and tools  Technical solutions for building your own repositories  Protocols and mechanisms for making the descriptions of your resources (and the actual resources) harvestable  Guidelines and recommendations on standards used in the LR production and documentation processes  Recommendations on data and tools licensing issues  Access to large catalogues of documented, high-quality resources, as well as the actual data and tools N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January 201140 KYOTO can be among the first

41 Features  Single Sign-On  Easy Administration  Metadata Harvesting  Persistent Identifiers (PIDs)  Intuitive Search N. Calzolari41  Open Source  Service-Oriented  Distributed  Replication/Backup  Reporting & Statistics 2nd KYOTO Workshop, Gifu, Japan, January 2011

42 v0 architecture

43 On the communication/mobilisation side change of culture  A change of culture  Convincing arguments that data assets and their value do not necessarily grow if locked in the drawer  Incentives models  Incentives and models that can convince data holders that there is life after the announcement of data existence and/or sharing (share does not necessarily mean for free, nor for unbridled use)  Interoperability  Interoperability, common metadata, formats, etc. a data economy  In other words we need to create/reinforce a data economy based on widely agreed principles and rules, mutual understanding, sustainable and adaptive models, simplified copyright rules and licensing models  The present time window seems appropriate Challenges 43 N.Calzolari Multilingual Web, Madrid, 2010 KYOTO can be a “model” For other projects to follow

44 Collaborative iResources LR building as collaborative “common shared task” New methodology of work map of language data and mechanisms Assemble a comprehensive “map of language data and mechanisms” for the planet’s languages (  LRE Map) Interoperability Interoperability acquires even more value Needs consensual planning of common strategies towards shared objectives Not just the sum of many individual efforts But an organised, well-structured, collective enterprise Similar to more mature sciences: Physicists/Astronomers’s experiments … of X,000 people working on the same big enterprise N. Calzolari442nd KYOTO Workshop, Gifu, Japan, January 2011 Paradigm shif t META-SHARE is a big step that needs a real Paradigm shif t

45 N. Calzolari 452nd KYOTO Workshop, Gifu, Japan, January 2011 We wanted more & more data... Have we been too successful ?!? Main Statement Where do we (try to) encode what we know about language properties? In annotations PreambleVision BUT

46 N. Calzolari 462nd KYOTO Workshop, Gifu, Japan, January 2011 Strategy A Multilingual Annotation Plan As a Very Large International Initiative Collaborative Resources : A new paradigm for a big language map Means a change of mentality: going beyond “individual” research interests From “my approach” to some “compromise” allowing to go for big amounts/ integration/building on each other/…

47 N. Calzolari From no infrastructure... To many infrastructures/networks We were complaining there was no infrastructure... Have we been too successful?? many infrastructural/networking initiatives Now many infrastructural/networking initiatives Very good opportunity coordinated & coherent But only if we are able to act in a coordinated & coherent way Otherwise we spoil & confuse the field 47 2nd KYOTO Workshop, Gifu, Japan, January 2011N. Calzolari


Download ppt "N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa"

Similar presentations


Ads by Google