Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLARIN-PL CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G4.19 Research Group

Similar presentations


Presentation on theme: "CLARIN-PL CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G4.19 Research Group"— Presentation transcript:

1 CLARIN-PL CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G4.19 Research Group

2 Basic Notions  Language Technology (LT)  language resources and tools  robust in terms of quality and coverage  multipurpose  component based  Language Technology Infrastructure  a software framework (architecture or platform)  for combining language tools with language resources into processing chains (or pipelines)  the defined processing chains are next applied to language data sources  interoperability, also with the external systems Humanistyka Cyfrowa Warszawa CLARIN-PL

3 LT in Humanities and Social Sciences: Barriers  Physical – language tools and resources are not accessible in Internet  Informational – descriptions are not available or there is no means for searching  Technological – lack of commonly accepted standards for LT, lack of a common platform, varieties of technological solutions, insufficient users’ computers  Related to knowledge – the use of LT requires programming skills or knowledge from the area of natural language engineering  Legal – licences for language resources and tools (LRTs) limit their applications Humanistyka Cyfrowa Warszawa CLARIN-PL

4 CLARIN Support for Humanities & Social Sciences  CLARIN is ERIC type consortium of  11 countries (Austria, Bulgaria, Czech Republic, Denmark, Estonia, Germany, Lithuania, The Netherlands, Poland, Portugal, Sweden) and The Dutch Language Union  1 observer: Norway  Focus area:  Supporting research in Humanities and Social Sciences  Users: researchers, PhD students, students and scientific institutions  CLARIN Mission  To significantly lower the barriers for the use of Language Technology in Humanities & Social Sciences (H&SS)  To facilitate or enable research methods based on automated analysis of text and speech resources Humanistyka Cyfrowa Warszawa CLARIN-PL

5 CLARIN Offer  Integration of different LT components into one interoperable system  Common, flexible meta-data standard (CMDI)  Central searching for resources (Virtual Language Observatory)  One sign on and one login into the distributed infrastructure  Decreased Physical and Informational Barriers  Common standards: promoting, co-ordinating, harmonising  Web Services for Language Tools and Resources  Decreased Technological Barrier  Installation-free, access via Web Applications  Decreased Knowledge Barrier  Common licences and promotion of the open access  Decreased Legal Barrier Humanistyka Cyfrowa Warszawa CLARIN-PL

6 CLARIN: Portal Humanistyka Cyfrowa Warszawa CLARIN-PL

7 CLARIN: Virtual Language Observatory Humanistyka Cyfrowa Warszawa CLARIN-PL

8 CLARIN: Federated Content Search – Searching Corpora Humanistyka Cyfrowa Warszawa CLARIN-PL

9 LTI Development Paradigms  Bottom-up  a collected offer approach  based on linking together the already existing Language Resources and Tools  focused on accessibility, technical interoperability and processing chains  Top-down  following on user-centred design paradigm  research applications for H&SS are a starting point  Bi-directional  linking of Language Resources and Tools  combined with the development of research applications Humanistyka Cyfrowa Warszawa CLARIN-PL

10 Bi-directional LTI Development  Idea  development of the necessary elements  a distributed network infrastructure  basic LT processing chain  combined with user-centred approach to the development of research applications  Top-down part  close co-operation with key users from the H&SS domain  a metaphor of the Agile-like light weight software designing method with emphasis to prototyping  amendments to the shape of the technical basis: LRTs, standards,  inspirations, identification of the further user needs, next iterations Humanistyka Cyfrowa Warszawa CLARIN-PL

11 CLARIN-PL: the Consortium  Polish scientific consortium  Wrocław University of Technology, G4.19 Research Group  Institute of Computer Science, Polish Academy of Science  Polish-Japanese Institute of Information Technology, Chair of Multimedia  University of Łódź, PELCRA group at Chair of English Language and Applied Linguistics  Institute of Slavic Studies, Polish Academy of Science  Wrocław University  Goal: implementation of the Polish part of the CLARIN ERIC LTI  Follows the bi-directional approach to LTI development Humanistyka Cyfrowa Warszawa CLARIN-PL

12 CLARIN-PL: Mission  Starting point  Several publicly available language resources and tools for Polish,  But still many were lacking  Deeper technological barrier: restricted applications  CLARIN-PL Pillars:  CLARIN-PL Language Technology Centre  the Polish node of the CLARIN distributed infrastructure  Complete set of the basic Language Resources & Tools for Polish  Research applications for H&SS  first set for key users and selected H&SS sub-domains. Humanistyka Cyfrowa Warszawa CLARIN-PL

13 CLARIN-PL Language Technology Centre  Location in Wrocław University of Technology  based on modified D-Space system from Lindat (Czech CLARIN)  One sign-on, one login (a member of the Pioneer.id Federation)  Advanced repository system for language resources  Persistent Identifiers for resources and tools  Rich CMDI meta-data – CLARIN wide visibility in the central search  Interface for Federated Content Search  depositing service for researchers from H&SS  application for the Data Seal of Approval  Adherence to all CLARIN specifications about standards and protocols  Web Services for LRTs:  the basic processing chain of Polish  Prototype system for flexible composition of the natural language processing chains  support for developers SOAP & REST interfaces  Web Applications for LRTs  Knowledge Sharing: expertise and support for the users Humanistyka Cyfrowa Warszawa CLARIN-PL

14 CLARIN-PL: Language Resources 1.Polish Morphological Dictionary 2.Polish Speech Corpora 3.Annotated Polish Corpora 4.Bilingual Corpora 5.Polish Historical Corpus 6.Semantic lexicon  Wordnet for Polish  formal description of lexical meanings 7.Dictionary of Multiword Expressions 8.Bilingual semantic lexicon 9.Lexicon of Proper Names 10.Syntactic-semantic Valency Dictionary 11.Robust syntactic-semantic grammar Humanistyka Cyfrowa Warszawa CLARIN-PL

15 CLARIN-PL: Language Resources 1.Polish Morphological Dictionary 2.Polish Speech Corpora 3.Annotated Polish Corpora 4.Bilingual Corpora 5.Polish Historical Corpus 6.Semantic lexicon  plWordNet 3.0  formal description of lexical meanings 7.Dictionary of Multiword Expressions 8.Bilingual semantic lexicon 9.Lexicon of Proper Names 10.Syntactic-semantic Valency Dictionary: 11.Robust syntactic-semantic grammar Humanistyka Cyfrowa Warszawa CLARIN-PL

16 CLARIN-PL: Language Resources  Starting point – a set of large resources  a huge National Corpus of Polish (1 billion tokens)  plWordNet 2.1 – a very large wordnet for Polish  Korpus Politechniki Wrocławskiej – an open Polish corpus with rich annotation  Expanded resources  plWordNet 3.0 – a huge semantic lexicon of Polish  a comprehensive description of the Polish lexico-semantic system (~ lemmas, ~ senses)  fully mapped to English Princeton WordNet  described formally by mapping to an ontology  Dictionary of multiword expressions described syntactically  NELexicon 2.0 – a huge lexicon of Polish Proper Names (2.5 mln) Humanistyka Cyfrowa Warszawa CLARIN-PL

17 CLARIN-PL: Language Resources for Polish  Expanded resources  Conversational corpus (following PELCRA and NKJP)  A large semantic valency lexicon for Polish predicative lexical units  Newly built resources  Transcribed training-testing Polish speech corpus  Bi-lingual corpora:  Polish-English, Polish-Bulgarian-Russian, Polish-Lithuanian  Polish historical corpus (for the years )  Corpora annotated for: meta-data, anaphora, time expressions, spatial expressions, semantic relations and situations Humanistyka Cyfrowa Warszawa CLARIN-PL

18 plWordNet 2.2 in CLARIN-PL Humanistyka Cyfrowa Warszawa CLARIN-PL

19 plWordNet 2.2 in CLARIN-PL Humanistyka Cyfrowa Warszawa CLARIN-PL

20 CLARIN-PL: Language Tools for Polish  Systems for searching corpora, especially Polish corpora  Spokes for conversational and bilingual corpora  Poliqarp 2.0 for richly annotated  Historical corpora [New]  Text mining (information extraction)  Recognition and classification of Proper Names  Recognition of anaphoric links  Recognition and classification of time expressions and spatial expressions [New]  Situation recognition [New]  Extraction of multiword expressions (collocations)  A generic set of morpho-syntactic tools for Polish that can be adapted to a domain specified by the user [New] Humanistyka Cyfrowa Warszawa CLARIN-PL

21 CLARIN-PL: Language Tools for Polish  Word Sense Disambiguation based on plWordNet  Shallow semantic parser [New]  Deep syntactic-semantic parser [New]  Tools for the extraction of the semantic-pragmatic information from documents and collections of documents, e.g.  keywords [New],  semantic relations between text fragments  and text summaries Humanistyka Cyfrowa Warszawa CLARIN-PL

22 Basic Language Tools for Polish 1.Segmentation into tokens and sentences 2.Morphological analysis 3.Morphological guessing of unknown words (both without context and context sensitive) 4.Morpho-syntactic tagging 5.Word Sense Disambiguation 6.Chunker and shallow syntactic parser 7.Named Entity Recognition and disambiguation 8.Co-reference and anaphora resolution 9.Temporal expression recognition 10.Semantic relation recognition 11.Event recognition 12.Shallow semantic parser 13.Deep syntactic parser with disambiguated output: dependency and constituent 14.Deep semantic parser Humanistyka Cyfrowa Warszawa CLARIN-PL

23 Basic Language Tools for Polish 1.Segmentation into tokens and sentences 2.Morphological analysis 3.Morphological guessing of unknown words (both without context and context sensitive) 4.Morpho-syntactic tagging 5.Word Sense Disambiguation 6.Chunker and shallow syntactic parser 7.Named Entity Recognition and disambiguation 8.Co-reference and anaphora resolution 9.Temporal expression recognition 10.Semantic relation recognition 11.Event recognition 12.Shallow semantic parser 13.Deep syntactic parser with disambiguated output: dependency and constituent 14.Deep semantic parser Humanistyka Cyfrowa Warszawa CLARIN-PL

24 Basic Language Tools for Polish 1.Segmentation into tokens and sentences 2.Morphological analysis 3.Morphological guessing of unknown words (both without context and context sensitive) 4.Morpho-syntactic tagging 5.Word Sense Disambiguation 6.Chunker and shallow syntactic parser 7.Named Entity Recognition and disambiguation 8.Co-reference and anaphora resolution 9.Temporal expression recognition 10.Semantic relation recognition 11.Event recognition 12.Shallow semantic parser 13.Deep syntactic parser with disambiguated output: dependency and constituent 14.Deep semantic parser Humanistyka Cyfrowa Warszawa CLARIN-PL

25 CLARIN-PL: Processing Chain for Polish Humanistyka Cyfrowa Warszawa CLARIN-PL

26 CLARIN-PL: Recognition and classification of Proper Names Humanistyka Cyfrowa Warszawa CLARIN-PL

27 Bi-directional - Top-down Part: First Applications  Approaching users  already active, interested, working on large textual and speech resources, …  covering a maximal variety of research areas, e.g. linguistics, literary studies, psychology, political studies and sociology  matching the available language tools for Polish  the first set of several prototype application illustrating possibilities and facilitating identification of the needs  First applications  Spokes – searching corpora of conversational data  A system for collecting Polish text corpora from the Web  A open textometric and stylometric system focused on Polish  Semantic text classification for sociology  Literary Map Humanistyka Cyfrowa Warszawa CLARIN-PL

28 Spokes (University of Łódź) Humanistyka Cyfrowa Warszawa CLARIN-PL

29 System for Collecting Polish Text Corpora from the Web  Requests from the users revealed gaps in the available technology  existing corpus building systems were too sensitive to text encoding errors found in the web  not designed for informal corpora like blogs  A system for collecting Polish text corpora from the Web had to be constructed:  based on tools from the Masaryk University in Brno  to detect texts including larger number of errors (by morphological analysis)  supports semi-automated extraction of texts from blogs, posts on forums, etc.  integrated with tools for processing Humanistyka Cyfrowa Warszawa CLARIN-PL

30 Open Textometric and Stylometric System  System designed for characteristic features of Polish  like rich inflection, weakly constrained word order  Based on several existing components including Stylo (Eder & Rybicki)  Enabling the use of features defined on any level of the linguistic structure:  from the level of word forms  up to the level of the semantic-pragmatic structures.  Available as Web Application and a Web Service  Stylometric techniques appear to be applicable in many tasks of H&SS  sociology (characteristic features that are for different subgroups), political studies (similarity and differences between political parties), literary studies … Humanistyka Cyfrowa Warszawa CLARIN-PL

31 Semantic Text Classification for Sociology  Users: Collegium Civitas, Warsaw  Goal  Support for large scale analysis of the source materials  Automatically annotate documents and text fragments with pre-defined semantic categories  Definition of categories by examples  Automated semantic grouping of documents and text fragments  Support for  Corpus building  Manual annotation of the learning sub-corpus  Automated annotation process  Statistical analysis of the results Humanistyka Cyfrowa Warszawa CLARIN-PL

32 GeTClasS – Generalised Text Classification for Sociology Humanistyka Cyfrowa Warszawa CLARIN-PL

33 Literary Map  Users: Digital Humanities Centre of The Institute of Literary Research (Polish Academy of Sciences)  Goal  Support for using maps in the literary criticism  Tool for the identification of all geographical names in the literary text (or a corpus) and mapping them onto a geographical map  Tasks 1.Identification and semantic classification of the referring language expressions 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic relations and statistical analysis Humanistyka Cyfrowa Warszawa CLARIN-PL

34 Literary Map Humanistyka Cyfrowa Warszawa CLARIN-PL

35 Conclusions  Application of LT to the research in Humanities & Social Sciences seem to be much more challenging than in commercial systems!  LT for Polish achieved a stage in which valuable support can be provided for research applications  Bi-directional approach combines  development of the basic, universal set of language tools and resources  with inspirations from the research applications Humanistyka Cyfrowa Warszawa CLARIN-PL

36 Thank you very much for your attention! Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]

37 Bi-directional: bottom-up part  LRTs and LRT chains can be useful …  if the required tools and resources exist,  and, they are robust!  What is the minimal set of LRTs?  What kind of LRTs can be called robust?  automated applications in H&SS seem to require high quality of language tools and mostly large coverage of resource  BLARK – The Basic Language Resource Kit  “the minimal set of language resources that is necessary to do any precompetitive research and education at all” (Krauwer, 2003) and also basic processing chains  possible reference point to compare LRTs for different languages PALC 2014 Łódź CLARIN-PL


Download ppt "CLARIN-PL CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G4.19 Research Group"

Similar presentations


Ads by Google