Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language resources, standardization and modern trends in NLP Simon Krek Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia.

Similar presentations


Presentation on theme: "Language resources, standardization and modern trends in NLP Simon Krek Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia."— Presentation transcript:

1 Language resources, standardization and modern trends in NLP Simon Krek Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia

2 COST Action

3 Working Groups / Objectives WG1: Integrated interface to European dictionary content WG2: Retro-digitized dictionaries WG3: Innovative e-dictionaries WG4: Lexicography and lexicology from a pan-European perspective

4 Innovative e-dictionaries The third working group will focus on the development of digitally born dictionaries, focusing on the latest developments in e- lexicography and the interface between lexicography and computational linguistics. Work will be carried out on: the analysis of the possible impact of automatic acquisition of lexical data the analysis of the interface between dictionary and computational lexica (cf. wordnets) and syntactically and semantically annotated corpora (cf. FrameNet, SemCor, Senseval) the investigation of the possible use of dictionary content for computational linguistic applications

5 Electronic lexicography in the 21st century The first eLex conference: New challenges, new applications, Louvain-la-Neuve (Belgium), 22 to 24 October 2009 The second eLex conference: New applications for new users, Bled (Slovenia), 10 to 12 November 2011 The third eLex conference: Thinking outside the paper, Tallinn (Estonia), 17 to 19 October 2013 The fourth eLex conference: Linking Lexical Data in the digital age, Herstmonceux Castle (UK), 11 to 13 August 2015

6 eLex 2011 Language data for digital natives: old wine in a new bottle or...? Text mining is a challenge Content is a problem Presentation is a bigger problem

7 What is in the middle? (Web, Mobile) Design Lexicography Natural Language Processing ? Text mining is a challenge Content is a problem Presentation is a bigger problem

8 Sinclair: Floating dictionary (2001) »A few years ago I felt that the time was ripe to plan a new kind of dictionary, one that would never exist on paper, but would be automatic or almost automatic in its selfupdating. It would, so to speak, float on top of a corpus, rather like a jellyfish, its tendrils constantly sensing the state of the language. As well as reporting on the settled usage and meanings of the words and phrases of a language, like a normal dictionary does, the floating dictionary, when interrogated, dips into the corpus and checks this information, offering instances that match its criteria for the senses; also it explores further to see if there are any instances that conflict with the criteria, and may signify a development of a sense or the emergence of a new usage altogether. Within the limits of its powers, it organises this evidence as a comment on the existing dictionary entry.«

9 Does dictionary content know itself? LT community now has a basic idea how to store various types of information also SW community: RDF, RDFa, RDFS, OWL, SKOS, and more standardization in human-oriented dictionary encoding was never really successful (XML, TEI?) the question is: if different types of lexicographic information intended for human users will have to know each other – will the format be dictated by LT standards? (Probably yes.)

10 Similar domain, different task EU projects: http://www.xlike.org/, http://xlime.eu/http://www.xlike.org/http://xlime.eu/ The goal of the XLike project is to develop technology to monitor and aggregate knowledge that is currently spread across mainstream and social media, and to enable cross-lingual services for publishers, media monitoring and business intelligence. xLiMe proposes to extract knowledge from different media channels and languages and relate it to cross-lingual, cross-media knowledge bases. By doing this in near real-time we will provide a continuously updated and comprehensive view on knowledge diffusion across media.

11 Sevices Newsfeed a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world http://newsfeed.ijs.si/visual_demo/ http://enrycher.ijs.si/ EventRegistry a system that can analyze news articles and identify world events can identify groups of articles in different languages that describe the same event http://eventregistry.org/

12 EventRegistry system architecture

13 ENeL perspective Complex story about events = complex story about words/languages Slovene Estonian English German French Hungarian Croatian Basque Swedish … Cross-lingual horizontal axis Diachronic vertical axis 2015 1950 1900 1850 1800 …

14 Cross-lingual synchronic horizontal axis "Never without data" Existing lexical resources (dictionaries, BableNet, AnyNet, Linked Data, etc.) Corpora, the Web and NLP Definition extraction (and generation) RANLP 2009, International workshop on definition extraction Language Technology for eLearning (http://www.lt4el.eu/)http://www.lt4el.eu/ Extraction of grammatical or lexical information Kookkurrenzdatenbank (http://corpora.ids-mannheim.de/ccdb/)http://corpora.ids-mannheim.de/ccdb/ Sketch Engine (http://www.sketchengine.co.uk/)http://www.sketchengine.co.uk/ Extraction of good (dictionary) examples ENeL Vienna workshop Extraction of translation equivalents Linguee etc. Extraction of Multi-word Expressions (Parseme)

15 Automatically Constructed Dictionary Content Complex multimodal information extraction

16 Explain, combine, exemplify Definitions Found Generated Combinations Collocations as subject as object Multi-word expressions Knowledge- Rich Contexts

17 Real-time data Streaming Twitter News Feeds

18 Sounds, graphics and visuals Sounds Speech Synthesis Recorded / Speech Recognition Graphics Images Videos

19 Multi-lingual, cross-lingual (Hidden) parallel corpora hub language

20 ENeL WG1: Integrated interface to European dictionary content WG2: Retro-digitized dictionaries WG3: Innovative e-dictionaries WG4: Lexicography and lexicology from a pan-European perspective

21 ENeL WG1: Integrated interface to European dictionary content WG2: Retro-digitized dictionaries WG3: Innovative e-dictionaries WG4: Lexicography and lexicology from a pan-European perspective

22 Retro-digitization Digital Agenda for Europe (Europe 2020 Strategy – one of the pillars) Commission’s Recommendation on the digitization and online accessibility of cultural material and digital preservation Put in place solid plans for their investments in digitization and foster public-private partnerships to share the gigantic cost of digitization (recently estimated at € 100 billion). Make 30 million objects available through Europeana by 2015, including all Europe's masterpieces which are no longer protected by copyright, and all material digitized with public funding.

23 Retro-digitized dictionaries encode and enrich dictionary data (standards and tools) (the question is: if different types of lexicographic information intended for human users will have to know each other – will the format be dictated by LT standards?) definitions examples etymology other types of information linking dictionary data with historical corpora http://nl.ijs.si/imp/

24 Lexical Cloud

25 Integrated interface to European (dictionary / lexical) content Any dictionary AnypediaAnyNet Any corpus Any base

26 Conclusion any word/concept in any language on any device offers a story about its current life and its history what is a "concept" (in the sense of "event")? X-Nets? Wikipedia? what is the central format? what is the appropriate context? EU projects? ICT? Cultural Heritage? Infrastructure (e.g. Clarin)?


Download ppt "Language resources, standardization and modern trends in NLP Simon Krek Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia."

Similar presentations


Ads by Google