Presentation on theme: "N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa"— Presentation transcript:
N. Calzolari12nd KYOTO Workshop, Gifu, Japan, January 2011 Nicoletta Calzolari Istituto di Linguistica Computazionale – CNR – Pisa The Future of KYOTO … with some historical notes to show a path along an evolving vision Language Resources in today EU context: META-SHARE,...
Why such needed LRs, are lacking after 30 years of R&D in the field? 1) Because the main trend until mid-’80s was to privilege the processing of so-called “critical” phenomena, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a language As a result CL was focusing on: As a result CL was focusing on: few examples - often artificially built lexicons made of few entries (toy lexicons) grammars with poor coverage 2) Because large-scale LRs are costly & their production requires a big organizing effort N. Calzolari22nd KYOTO Workshop, Gifu, Japan, January 2011 Old slide with Antonio Zampolli (’80s/early ‘90s) Why we still lack them??
… back from the early ‘80s It became evident that: Part of the results of meaning extraction, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured, were unmanageable at the formal representation level, and had to be blurred into unique features and values Unfortunately, it is still today difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries N. Calzolari32nd KYOTO Workshop, Gifu, Japan, January 2011 Automatic acquisition of lexical information from MRDs Was my first research & became central in the Pisa group (ACQUILEX) And also Amsler, Briscoe, Boguraev, Wilks’ group, IBM, then Japanese groups, … The trend was: “large-scale computational methods for the transformation of machine readable dictionaries into machine tractable dictionaries” Instead of relying on linguists’ introspection PioneeringResearch Historical notes
Automatic acquisition of info from texts: Automatic acquisition of info from texts: This trend has become today a consolidated & pervasive fact From acquisition of “linguistic information” To acquisition of “general knowledge”, with more data intensive, robust, reliable methods N. Calzolari42nd KYOTO Workshop, Gifu, Japan, January 2011 … back from the late ‘80s After acquisition from MRDs, Historical notes Need of adequate models to handle actual usage of language Lesson learned ( IN-)Adequacy of (current) lexicons Lesson learned Going from core sets to large coverage has implications not just in quantitative terms, but more interestingly in terms of changes to the models and the strategies of processes Lesson learned
N. Calzolari 5 2nd KYOTO Workshop, Gifu, Japan, January MultiLex GeneLex AcquiLexAcquiLex Xxx-LexXxx-Lex A. Zampolli: Let’s be coherent: Xxx-LexXxx-Lex After the “Grosseto Workshop” (1985): a turning point
ISO LMF Lexical Markup Framework N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions Modular framework LMF specs comply with modelling UML principles an XML DTD allows implementation Builds on EAGLES/ISLE NEDOAsianLang.uages The field is mature NICT Language-Grid NICT Language-Grid Service Ontology ICTKYOTO LIRICS New initiatives … LexInfo
N. Calzolari72nd KYOTO Workshop, Gifu, Japan, January 2011 KYOTO A search environment using semantic technologies A “compass” for the web2.0 Interdisciplinarity scientific community (LRT, web technologies, knowledge engineers), companies, domain experts Multilingualism 7 languages (2 Asiatic languages) needs to share lexical/knowledge bases & tools both general & domain-related underthe form of lexical/ontological & sw repositories under the form of lexical/ontological & sw repositories Kyoto Core System is open & free The “resource” perspective
Annotation Format (KAF) Multi-level Annotation Format stand-off stand-off annotation uniform uniform representation for 7 languages Shared through the languages Text Text: tokenisation, sentences, paragraphs with reference to the sources Terms Terms: words & multi-words, parts-of- speech, etc. Chunks Chunks: constituents & syntagmatic realization Dependencies Dependencies: grammatical functions L1 – Semantic modules OntoTagging ● L1 – Semantic modules: Multiword tagging, Sense Tagging, Named Entity Recognition, OntoTagging L2 – Semantic module ● L2 – Semantic module: event/fact extraction N. Calzolari82nd KYOTO Workshop, Gifu, Japan, January 2011 from Piek Vossen
N. Calzolari92nd KYOTO Workshop, Gifu, Japan, January 2011 KYOTO System & Adoption of Standards LinearMAF/SYNAF SEMAF Term extraction Tybot GenericTMF Semantic annotation Linear Generic FACTAF Fact extraction Kybot Domain editing Wikyoto Wordnet Domain Wordnet LMF API Ontology Domain ontology OWL API Concept User Fact User from Piek Vossen SourceDocuments Could be at the basis of a new standard?
2nd KYOTO Workshop, Gifu, Japan, January 2011 A common representation format for WordNets Wn IT Wn EN Wn EU Wn NL Wn JP Wn CH Wn ES representation format allowing easy access, integration & interoperability endow WordNet with a representation format allowing easy access, integration & interoperability among resources Wn IT Wn EN Wn EU Wn NL Wn JP Wn CH Wn ES
2nd KYOTO Workshop, Gifu, Japan, January 2011N. Calzolari11 GlobalInformation Lemma Monolingual ExternalRef Monolingual ExternalRefs Sense LexicalEntry Statement Definition SynsetRelation SynsetRelations Monolingual ExternalRef Monolingual ExternalRefs Synset Lexicon Interlingual ExternalRef Interlingual ExternalRefs SenseAxis SenseAxes LexicalResource * * * * Meta 0..1 Meta 0..1 Meta 0..1 Meta 0..* * 0..* * Data Categories from Monica Monachini
2nd KYOTO Workshop, Gifu, Japan, January 2011 A list of 85 sem.rels as a result of a mapping of the KYOTO WordNet grid Inter-WN Intra-WN N. Calzolari12
2nd KYOTO Workshop, Gifu, Japan, January 2011 N. Calzolari13 SWN n IWN n WordNet-LMF Multilingual level - Cross-lingual Relations WN n groups monolingual synsets corresponding to each other and sharing the same relations to English link to ontology/(ies) specifies the type of correspondence from Monica Monachini
N. Calzolari142nd KYOTO Workshop, Gifu, Japan, January 2011 Complex picture! Is there anything we need to do for Interoperability? Work within ISO: LMF: abstract meta-model for lexical representation Ontology Group or more Groups? Language Resource Ontologies: ontology of data categories Real life: Lexicons (e.g. WordNets) that are called Ontologies Lexicons linked to Ontologies: to be used in applications, in multilingual systems, domains, … Work on “ontologising” Lexicons: to allow exploiting various relations, to make inferences, … Semantic Lexicons, with many types of relations among semantic units: these are often of “conceptual/world-knowledge” nature. Do we want DCs for these? ISO SC 4/WG 4 – Lexicon-Ontology relations PWI ISO SC 4/WG 4 – Lexicon-Ontology relations New work item: PWI KYOTO can contribute
N. Calzolari152nd KYOTO Workshop, Gifu, Japan, January 2011 To explore the need of doing something within ISO about the relations between Lexicon and Ontology Do we/ISO need to address another (lexical) layer? How lexicons and ontologies are linked and information mapped from one to the other The ontological layer in a/connected to a lexicon Possible issues/questions: Is LMF enough to represent Ontological links? How to connect work being done in ISO Lexical group and ISO Ontology groups? Lexicon and Ontologies: separation? or lexicalised ontologies? or ontologies lexicons? Lexicon, Ontologies and Domains On a very different dimension: Ontology of lexical/semantic/conceptual categories? Standardised semantic categories, ontology labels? Relation to multilinguality ... KYOTO can contribute
N. Calzolari162nd KYOTO Workshop, Gifu, Japan, January 2011 Input to Multilingual Web The MultilingualWeb project is exploring standards and best practices that support the creation, localization and use of multilingual web-based information The MultilingualWeb project is exploring standards and best practices that support the creation, localization and use of multilingual web-based information It aims to raise the visibility of existing best practices and standards and identify gaps It aims to raise the visibility of existing best practices and standards and identify gaps The core vehicle for this is a series of four workshops, for networking across communities that span the various aspects involved The core vehicle for this is a series of four workshops, for networking across communities that span the various aspects involved Next workshop on best practices aimed at development of Content for the Web, including creation of content ranging from personal authoring for blogs and social networking sites to development of large corporate or organizational enterprises: Next workshop on best practices aimed at development of Content for the Web, including creation of content ranging from personal authoring for blogs and social networking sites to development of large corporate or organizational enterprises: “Content on the Multilingual Web” 4-5 April 2011 Pisa, Italy KYOTO can contribute
N. Calzolari172nd KYOTO Workshop, Gifu, Japan, January 2011 A new paradigm of R&D in LRs & LT Since few years Open & distributed linguistic infrastructures for LRs & LT accumulation of knowledge & results Adopting the paradigm of accumulation of knowledge, so successful in more mature disciplines, based on sharing LRs, tools & results cooperation of many groups on common tasks Ability to build on each other achievements, allowing controlled & effective cooperation of many groups on common tasks (see HumanGenomeProject) e. g. initiatives to achieve international consensus on annotation guidelines collective intelligence Emerging concept of collective intelligence interoperability Emphasize interoperability among LRs & LT
Some steps for a “new generation” of LRs N. Calzolari182nd KYOTO Workshop, Gifu, Japan, January 2011 From huge efforts building static, large-scale, general-purpose LRs dynamic To dynamic LRs rapidly built on- demand, tailored to specific user needs From closed, locally developed and centralized resources To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them From Language Resources To Language Services Need of an infra that makes this vision operational
Lexical WEB As a critical step for semantic mark-up in the Semantic Web N. Calzolari192nd KYOTO Workshop, Gifu, Japan, January 2011 ComLex SIMPLE WordNets FrameNet Lex_x Lex_y with intelligent agents NomLex Standards for Content Interoperability Enough?? Global WordNet GRID BioLexicon SIMPLE-WEB
(Distributed) Language Services N. Calzolari202nd KYOTO Workshop, Gifu, Japan, January 2011 content interoperability standards supra-national cooperation architectures enabling accessibility Collaborative & collective/social development & validation Collaborative & collective/social development & validation, cross-resource integration & exchange of information Create new resources on the basis of existing Exchange & integrate information across repositories Compose new services on demand Can KYOTO contribute?
N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January Which Communities? Language Resources Language Technologies Standardisation Content/Ontologies System developers Integrators SSH EC EC National funding agencies National funding agencies Industry Industry Many applications/domains MT CLIR … e-government content industry intelligence e-culture e-health domotics… core EUForum with Focus on cooperation Many LRs & LTs exist, but a global vision, policy & strategy is needed for CLARIN for SSH CLARIN FLaReNetNetworkFLaReNetNetwork META-NETNoEMETA-NETNoE Need to consider together technical technical organisational organisational strategic strategic economic, social economic, social cultural cultural legal legal political issues wrt LRs & LTs political issues wrt LRs & LTs Many dimensions Today
FLaReNet at a glance Fostering Language Resources Network FLaReNet at a glance An international Forum to facilitate interaction, to Overcome the fragmentation in LR & LT & recreate a community Anticipate the needs of new types of LR & LT & Language Infrastructures Create a shared policy for the next years Foster a European strategy for consolidating the sector 22 N. Calzolari222nd KYOTO Workshop, Gifu, Japan, January Institutional Members From 33 countries 351 Individual Subscribers Community mobilisation Essential Community mobilisation RI (also to prepare the ground for a RI) Community mobilisation Essential Community mobilisation RI (also to prepare the ground for a RI) “roadmap” A “roadmap”: a plan of actions as input to policy development A ( EU) model for the LRs/LTs area of the next years Ambitious!
N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January Create a shared repository of data formats, annotations, etc. as a major help to achieve standardisation Common repositories for tools & language data should be established that are universally and easily accessible by everyone Coordinate input to ISO/W3C standardisation work Results from Vienna & Barcelona Forum: Shaping the Future of the Multilingual Digital Europe Standards, Interoperability & Metadata are topics to be approached in cooperation Access to LRs is critical & should involve all the community Need to create the means to plug together different LR & LT, In a web-based resource and technology “grid” For a new world-wide language infrastructure
2 nd Blueprint Result of a permanent and cyclical consultation Result of a permanent and cyclical consultation Inside the community it represents Outside it, through connections with neighbouring projects, associations, initiatives, funding agencies three main “directions”: Organised along three main “directions”: Infrastructural Aspects Research and Development Political and Strategic Issues development factors Reflect three major development factors that can boost or hinder the growth of the field of LRT N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January Provide feedback!
Sources: many meetings Operational Interoperability Asian Collaboration Workshop FL-SILT Workshop Lexicon/O ntology Standards NEERI 2 nd FLaReNet Forum Less- resourced Languages Automatic Acquisition Legal Issues Standards International Cooperation N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January
N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January rd FLaReNet Forum The European Language Resources and Technologies Forum: Important role in defining recommendations 120 Participants from 22 Countries In Barcelona: 120 Participants from 22 Countries Define final recommendations Define final recommendations Previous Proceedings & Reports on the web Blueprint discussed Blueprint will be discussed Also for adoption & endorsement by Institutional Members Also for adoption & endorsement by FLaReNet Institutional Members
N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January IssueChallengeRecommended Actions Metadata Interoperability Interoperability of Metadata sets Set up a global infrastructure of common and uniform and/or interoperable metadata sets Metadata usable both by humans and by machines machine-understandable metadata Create machine-understandable metadata with formal syntax and clear semantics Automate the process of metadata creation Develop structured metadata Documentation Reliable documentation common best practices Reliable documentation of LRs according to common best practices Collect documentation Collect all possible and existing LR documentation standard documentation template Devise and adopt a widely agreed standard documentation template for all types of resources Infrastructural Aspects
Political and Strategic dimensions N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January IssueChallengeRecommended Actions Funding Agencies policies easy access Devise models to allow different types of players easy access to resources publicly funded publicly available Ensure that publicly funded resources are publicly available either free of charge or at a small distribution cost of best practices Encourage/enforce use of best practices or standards in LR production projects Make sustainability and sharing/distribution plans mandatory in projects concerning LR production LR citation Appropriate citation of Language Resources like traditional publications a standard protocol for citing Develop a standard protocol for citing language resources KYOTO can be an example
LRE Map: Why?? The Map as an answer to start to fill this gap, but also: “change in culture” To encourage the needed “change in culture” N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January Problem: Lack of information & documentation about resources is, in the e- science paradigm, a very critical issue Non documented resources don’t exist!! Non documented resources don’t exist!! collective enterprise personal engagement in documenting resources A collective enterprise: Each researcher must become aware of the importance of his/her personal engagement in documenting resources A task as important as creating new resources and not an accessory to be disregarded service to the whole community As the necessary service to the whole community monitor the field Will become an essential instrument to monitor the field
N. Calzolari302nd KYOTO Workshop, Gifu, Japan, January 2011 How many LRs & Types at LREC? Corpora: 785 Lexicons: 289 Tagger/Parser: 181 Annotation tool: 134 Ontology: 73 Evaluation data: 40 Annotation Guidelines: Submissions: 1288LR forms: How many LRs & Types at COLING? Submissions: 880 LR forms: 735 Corpora : % Tagger/Parser: % Lexicons: % Evaluation data: % Ontology, Annotation tool, Evaluation tool, Tokenizer, NER < %
Languages: But obviously … N. Calzolari31 2nd KYOTO Workshop, Gifu, Japan, January !! image courtesy of Wordle (http://www.wordle.net)
Availability N. Calzolari 2nd KYOTO Workshop, Gifu, Japan, January Freely available! The wide majority of resources are freely available 3% 15% 25% LREC COLING
The Project META-NET N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January Network of Excellence META-NET is a Network of Excellence (coord. Hans Uszkoreit) dedicated to fostering the technological foundations of the European multilingual information societyObjectives: large-scale concerted effort Prepare the ground for a large-scale concerted effort by building a strategic alliance of national and international research programmes, corporate users and commercial technology providers and language communities Strengthen the European research community through research networking and by creating new schemes and structures for sharing resources and efforts Build bridges by approaching open problems in collaboration with other research fields such as machine learning, social computing, cognitive systems, knowledge technologies and multimedia content Final goal: META – The Multilingual Europe Technology Alliance
language communities policy makers and funding bodies user industries provider industries language technology community machine learning community semantic techno- logies community cognitive systems community multimedia content techno- logies The META Alliance N. Calzolari342nd KYOTO Workshop, Gifu, Japan, January 2011
Founding Members Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Germany Barcelona Media – Centre d'Innovació, Spain Consiglio Nazionale Ricerche – Instituto di Linguistica Computazionale “Antonio Zampolli”, Italy Institute for Language and Speech Processing, R.C. “Athena”, Greece Charles University in Prague, Czech Republic Centre National de la Recherche Scientifique – Laboratoire d'Informatique pour la Mécanique et les Sci.s de l'Ingénieur, France Universiteit Utrecht, The Netherlands Aalto University, Finland Fondazione Bruno Kessler, Italy Dublin City University, Ireland Rheinisch Westfälische Technische Hochschule Aachen, Germany Jožef Stefan Institute, Slovenia Evaluations and Language Resources Distribution Agency, France N. Calzolari352nd KYOTO Workshop, Gifu, Japan, January 2011
Three Lines of Action The META-NET objectives translate into three lines of action: N. Calzolari362nd KYOTO Workshop, Gifu, Japan, January 2011
The Process META-VISION communication within META-NET (META-VISION) communication in the wider LT community and among other stakeholders communication to policy makers funding bodies, public N. Calzolari372nd KYOTO Workshop, Gifu, Japan, January 2011
Data has become a key factor in LT R&D A few indicators: Increasing size & importance of LREC conference, corpora mailing list, etc. Citation ranks of publications on language resources Data Intensive Sciences Language research and language technology belong to the Data Intensive Sciences Expensive data become valuable through sharing However, the long demanded and well-contemplated instruments for managing and sharing this data are still missing N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January
META-SHARE: Key Features open, integrated, secure, interoperable exchange infrastructure META-SHARE is an open, integrated, secure, interoperable exchange infrastructure (resp. Stelios Piperidis) for language data & tools for the Human Language Technologies domain ever-evolving, scalable, including free and for-a-fee LRs/LTs and services including legacy, contemporary and emerging datasets, tools and technologies marketplace A marketplace where language data & tools are documented, uploaded and stored in repositories, catalogued and announced, downloaded, exchanged, aiming to support a data economy (includes free and for-a-fee LRs/LTs and also services) Standards-compliant Standards-compliant, overcoming format, terminological and semantic differences distributed networked repositories Based on distributed networked repositories accessible through common interfaces N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January
What we’re offering share and distribute A channel to share and distribute language data and tools Technical solutions for building your own repositories Protocols and mechanisms for making the descriptions of your resources (and the actual resources) harvestable Guidelines and recommendations on standards used in the LR production and documentation processes Recommendations on data and tools licensing issues Access to large catalogues of documented, high-quality resources, as well as the actual data and tools N. Calzolari2nd KYOTO Workshop, Gifu, Japan, January KYOTO can be among the first
Features Single Sign-On Easy Administration Metadata Harvesting Persistent Identifiers (PIDs) Intuitive Search N. Calzolari41 Open Source Service-Oriented Distributed Replication/Backup Reporting & Statistics 2nd KYOTO Workshop, Gifu, Japan, January 2011
On the communication/mobilisation side change of culture A change of culture Convincing arguments that data assets and their value do not necessarily grow if locked in the drawer Incentives models Incentives and models that can convince data holders that there is life after the announcement of data existence and/or sharing (share does not necessarily mean for free, nor for unbridled use) Interoperability Interoperability, common metadata, formats, etc. a data economy In other words we need to create/reinforce a data economy based on widely agreed principles and rules, mutual understanding, sustainable and adaptive models, simplified copyright rules and licensing models The present time window seems appropriate Challenges 43 N.Calzolari Multilingual Web, Madrid, 2010 KYOTO can be a “model” For other projects to follow
Collaborative iResources LR building as collaborative “common shared task” New methodology of work map of language data and mechanisms Assemble a comprehensive “map of language data and mechanisms” for the planet’s languages ( LRE Map) Interoperability Interoperability acquires even more value Needs consensual planning of common strategies towards shared objectives Not just the sum of many individual efforts But an organised, well-structured, collective enterprise Similar to more mature sciences: Physicists/Astronomers’s experiments … of X,000 people working on the same big enterprise N. Calzolari442nd KYOTO Workshop, Gifu, Japan, January 2011 Paradigm shif t META-SHARE is a big step that needs a real Paradigm shif t
N. Calzolari 452nd KYOTO Workshop, Gifu, Japan, January 2011 We wanted more & more data... Have we been too successful ?!? Main Statement Where do we (try to) encode what we know about language properties? In annotations PreambleVision BUT
N. Calzolari 462nd KYOTO Workshop, Gifu, Japan, January 2011 Strategy A Multilingual Annotation Plan As a Very Large International Initiative Collaborative Resources : A new paradigm for a big language map Means a change of mentality: going beyond “individual” research interests From “my approach” to some “compromise” allowing to go for big amounts/ integration/building on each other/…
N. Calzolari From no infrastructure... To many infrastructures/networks We were complaining there was no infrastructure... Have we been too successful?? many infrastructural/networking initiatives Now many infrastructural/networking initiatives Very good opportunity coordinated & coherent But only if we are able to act in a coordinated & coherent way Otherwise we spoil & confuse the field 47 2nd KYOTO Workshop, Gifu, Japan, January 2011N. Calzolari