Telecom and Informatics 1 Linguateca activities Diana Santos Luís Costa www.linguateca.pt.

Slides:



Advertisements
Similar presentations
Science and Policy Integration for COastal Systems Assessment
Advertisements

United Nations Statistics Division
Information and Communication Technologies 1 Working with Portuguese corpora Diana Santos Linguateca
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Mobile Marketing in Practice
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
How to write a publishable qualitative article
THE ADVANCED TECHNOLOGY ENVIRONMENTAL AND ENERGY CENTER (ATEEC) Summative External Evaluation July 1, 2013 – June 30, 2014 PRELIMINARY OUTLINE.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Creating Online Class Communities Jennifer Dorman Discovery Education
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:
This chapter is extracted from Sommerville’s slides. Text book chapter
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
CLARIN tools for workflows Overview. Objective of this document  Determine which are the responsibilities of the different components of CLARIN workflows.
VITA- Virtual Learning for the Management of successful SMEs LEONARDO DA VINCI PROJECT – TRANSFER OF INNOVATION January 2009 – December PTI-LEO
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Content Strategy.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
Report on UNSD activities since the last meeting of the Expert Group on International Economic and Social Classifications Meeting of the Expert Group on.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
NEPTUNE Canada Workshop Oceans 2.0 Project Environment NEPTUNE Canada DMAS Team Victoria, BC February 16, 2009.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
ENABLER, BLARK, what’s next? Steven Krauwer Utrecht University / ELSNET.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
BDMX 8023 Report Format Briefing for DBA Students Dr Ajay Chauhan Professor OYAGSB May 16, 2015.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Treebank Troubles Eckhard Bick Southern Denmark University
Elucidative Programming Kurt Nørmark Aalborg University Denmark SIGDOC September 2000.
Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 1 Yes, user! compiling a corpus according to what the user wants.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Consultative process for finalizing the Guidance Document to facilitate the implementation of the clearing-house mechanism regional and national nodes.
Information Retrieval
Candidates: Administrators:
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
INFSO-RI SA2 ETICS2 first Review Valerio Venturi INFN Bruxelles, 3 April 2009 Infrastructure Support.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
BG 5+6 How do we get to the Ideal World? Tuesday afternoon What gaps, challenges, obstacles prevent us from attaining the vision now? What new research.
General Architecture of Retrieval Systems 1Adrienn Skrop.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Advanced Higher Computing Science
Language Identification and Part-of-Speech Tagging
Natural Language Processing (NLP)
Software Documentation
CSE 635 Multimedia Information Retrieval
The InWEnt Blended-learning approach; GC21 as an e-learning and Blended-learning platform 22/02/2019 An introduction course on InWEnt Blended-learning.
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
CS246: Information Retrieval
Applied Linguistics Chapter Four: Corpus Linguistics
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Telecom and Informatics 1 Linguateca activities Diana Santos Luís Costa

Telecom and Informatics 2 Purpose of the talk Introduce Linguateca as a SINTEF project as an international organization Show work done Propose contact points with SINTEF and 4030

Telecom and Informatics 3 History at SINTEF May 1998 The Computational Processing of Portuguese project is launched (as a two-year special project) May 2000 The Computational Processing of Portuguese project is extended as an ordinary SINTEF project for three more years, whose goals also include the launching of a larger (virtual) organization February 2002 The name is changed to Linguateca and the whole project redesigned so that it should last until 2006

Telecom and Informatics 4 What is Linguateca Improve Portuguese processing Dissemination Resource creation Evaluation A virtual organization with four nodes Oslo, Braga, Lisbon, Oporto, … 5 full-time, 3 part-time workers Collaboration partners in more locations: Odense, Lisbon, São Carlos, Porto Alegre,... A follow-up of the Computational Processing of Portuguese project, created in 1998, by the then Ministry of Science and Technology

Telecom and Informatics 5 The Linguateca context Customers: the Portuguese authorities Primary users: the NLP, HLT, LE, CL community dealing with the Portuguese language Other users: researchers and teachers of Portuguese; IR people Goal: improve the work and the results of the product developers and language researchers, so that the whole Portuguese-speaking community could later on benefit NLP: natural language processing; HLT: human language technology; LE: language engineering; CL: computational linguistics

Telecom and Informatics 6 Assumptions of Linguateca First things first Find out what are the problems and bottlenecks of Portuguese processing International entities or bodies cannot solve our problems In any case not better than us Resource building is time consuming, and "market driven" Language (and not region, or nation) should be the unit for natural language processing So Brazil and Portugal should cooperate closely Public resources are a must for scientific progress There are enough barriers already

Telecom and Informatics 7 Dissemination of information and resources on Portuguese processing Web catalogue with a dedicated search engine Forum and a contact service Creation of publically available language resources Making the available resources more available: Web services Creating new ones: both Web and physical access Promotion of joint evaluation using the evaluation contest or evaluation campaign model Web site and discussion list [avalia] Organization of a workshop (June 2002) and a conference (AVALON' 2003) Organization of the first evaluation contest for Portuguese: Morfolimpíadas Linguateca activities

Telecom and Informatics 8 Dissemination: some numbers size of site 1,047 Web pages 1,392 resource links 643 own documentation 723 publication entries size of audience (1st May 2003) number of visits: 926,887 number of queries to our on-line services: 50,954 size of recognition 685 Web pointers to us published papers or reports, and other presentations 24 (+2) in Portuguese; 15 (+4) in English

Telecom and Informatics 9 Creation of language resources copyright clearing creation programming resource specific tools testing version dealing producing information and documentation evaluation does it meet the goals? is it being correctly used? giving support

Telecom and Informatics 10 Resource creation and dissemination AC/DC: querying a variety of (annotated) corpora developed outside (rights obtained), or in-house COMPARA: querying English-Portuguese CETEMPúblico and CETENFolha: large amounts of newspaper language, divided in extracts and scrambled Floresta Sintá(c)tica: manually revised syntactically analysed text Web services AC/DC service collection DISPARA Águia AneLL (Lisbon): morphosyntactic tagging of private texts GC (Oporto): comparable corpora environment (English-Portuguese)

Telecom and Informatics 11 COMPARA On-going collaboration with Ana Frankenberg-Garcia Text team (Lisbon) and engineering team (Oslo): communication; clearly defined workflow, with at least six steps for each text pair A general Web system for parallel corpora, DISPARA, evolved Currently 29 text pairs; 36 in the processing queue 12,500 queries since May 2000 from all over the world

Telecom and Informatics 12 Floresta Sintá(c)tica The first treebank for Portuguese Collaboration with Eckhard Bick and the VISL project (Odense) Main activities: October 2000 to December 2001; a few things added afterwards Workflow: a complex process with several revision steps and three different automatic modules (a parser, a tree transducer and a CQP converter) Tools: Pica-Pau, a tree editor; Águia, a Web interface Resource: 1,500 trees (ca. 35,000 words) in phrase structure format and in CG dependency format, both Web searchable and downloadable Sub-projects: inter-annotator test; sentence separation evaluation; streamlined revision using Águia; use as golden standard in Morfolimpíadas Status: waiting for renovation; discussion in Avalon’2003

Telecom and Informatics 13 Evaluation The most challenging task History: Tutorial on evaluation of NLP systems in Atibaia (Brazil), 2000 Some papers on resource and problem evaluation (2001, 2002) Movement with a Web site and a dedicated mailing-list in 2002 Preparatory encounter dedicated to “joint evaluation” June 2002 Morfolimpíadas Trial in September March 2003 Contest May-June 2003 Avalon’2003 Named entity recognition Portuguese IR Machine translation and alignment Some syntax evaluation

Telecom and Informatics 14 Morfolimpíadas: cooperatively evaluating morphological analysers for Portuguese Evaluation contest paradigm Importance for science and for community building Shared task, consensual result, objective measures, knowledgeable organization Why morphology Mildly inflected language (70 verb forms) Simple and well defined (?) problem, no infinite set of members Traditionally the first module in a set of NLP tools The task for which there was greater interest Goals Exemplify the paradigm with a relatively short schedule Assess the state of the art in morphology (also looking at tokenization) Measure the problem

Telecom and Informatics as Morfolimpíadas: overview Seven participating systems, out of out there 3 Portugal 2 Brazil 2 Int 5 “real” morphological analysers, 1 spellchecker and 1 stemmer Organization: Linguateca Oslo (+Oporto+contractors) Setup: Registration, providing some data Ran their system over 80,000 running text words, in three different formats Processing: uts.SYSTEM.def.preze.ze.hi.gr.un.le

Telecom and Informatics 16 Zebras: transform into an internal format Every system (with a wildly different output format) is turned into “zebraic” format Every zebra output is apparently similar but intriguingly different Zebras may still require hienas to deal with complex issues (clitics and contractions) Zebra programming requires a full understanding of the high and low level details of the systems (underlying linguistic conception, tokenization behaviour)

Telecom and Informatics 17 Further processing Grammatical analyses are turned into one analysis named GRAM Some sets of always ambiguous interpretations in the verbal paradigm are turned into one first and third person singular of some tenses personal and impersonal infinitive third person plural of Perfeito and Mais que perfeito Numbers are dealt with in a simple form Punctuation marks and proper names are handled to yield a hopefully more similar output Tokenization problems are dealt with to some extent

Telecom and Informatics 18 Leoas: tearing files to pieces Distribution by text Distribution by variant Distribution by genre Distribution by medium Rationale: Is system performance correlated with type of text? Variant?

Telecom and Informatics 19 System’s signature No. of tokens No. of analyses Distribution of analyses per form Distribution of PoS ambiguity Distribution of lemma ambiguity No. of verbs No. of tokens which can be analysed as verbs No. of verb analyses No. of guessed analyses No. of derived analyses...

Telecom and Informatics 20 System comparison Qualitative: different kinds of information Using the raw output ranking systems in terms of tokenization, verbishness, etc. indexed per text genre, variant, etc. Using a golden list: a set of manually agreed upon ”right answers” to input forms (Extremely) time consuming task Large room for disagreement Several decision sources (dictionaries, Web, own intuition) Large gray zones (foreign terms, colloquial language, specialized words, PoS classification vs. the flexibility of natural language, common faults, tokenization) Using sets of cleverly chosen forms from the automatic output conflation

Telecom and Informatics 21 Domadores: still more is required Partially agreeing pieces of information (systems more informative than others) ADJ of kind t3 : Noun and Adj VPP vs ADJ: VPP and ADJ: only VPP amada ADJ amado vs. amada VPP amar Adj t3 Adj related N N related Adj Adj t3 Adj N Reducing information Adding information

Telecom and Informatics 22 Challenge(s) with Morfolimpíadas Produce informative and intuitively satisfying measures Satisfy participants while at the same time showing problems and remaining work Produce quantitative and qualitative data that can be used beyond the actual contest Make it interesting enough to have further contests in the future, with more participants (e.g. from industry) and maybe several tracks Reuse the experience gained in the organization of other evaluation contests ?

Telecom and Informatics 23 Concluding remarks Research as a goal? No, this is a political, facilitating project Research as a precondition Research as a side effect (Development and maintenance) and (observation and contact) are the main keywords Evaluation of activity Remarkable increase in number of public resources Large maintained site with a considerable number of visits Occurrence of the first evaluation contest for Portuguese Problems Too few people for too large an endeavour