PaNoLa: Parsing Nordic Languages Eckhard Bick

Slides:



Advertisements
Similar presentations
1 PHP at Yahoo! Michael J. Radwin October 20, 2005.
Advertisements

©2005 Fondazione Politecnico di Milano SIG A8: Engineer demand and offer in Europe 0 Competences Assessment based on Semantic Networks: the eCCO tool Clementina.
Why Students Struggle: Perception vs. Reality
Anne Gilleran BECTA Research Conference London 13 June 2003 The Digital Generation Student Voices from the eWatch Study BECTA Research Conference 13th.
STRUCTURE OF EDUCATION SYSTEM in Norway
1 L U N D U N I V E R S I T Y Integrating Open Access Journals in Library Services & Assisting Authors in choosing publishing channels 4th EBIB Conference.
Chapter 1: The Database Environment
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
The GEO Secretariat Progress Report to GEO-2 November 2003 Helen M. Wood Secretariat Director.
ActionDescription 1Decisions about planning and managing the coast are governed by general legal instruments. 2Sectoral stakeholders meet on an ad hoc.
What is valorisation ? Growth €
The Managing Authority –Keystone of the Control System
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
1 Adequate Yearly Progress (AYP) U.S. Department of Education Adapted by TEA September 2003.
Making the System Operational
Chapter 3 Critically reviewing the literature
|epcc| NeSC Workshop Open Issues in Grid Scheduling Ali Anjomshoaa EPCC, University of Edinburgh Tuesday, 21 October 2003 Overview of a Grid Scheduling.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Christian Fortmann & Martin Forst InSTIL/ICALL2004 Symposium, Venice 1 A German LFG for CALL Christian Fortmann, Martin Forst Institut für Maschinelle.
Petroleum Law – TPA – Spring 2007 Access to infrastructure Petroleum Law - JUR5410 8/12 March Catherine Banet, Phd Research fellow Scandinavian Institute.
Use of ICT Tools in Basic Education in Europe: Case Study from Finland
Public service interpreting in Norway - The Norwegian National Register of Interpreters Leonardo Doria de Souza Norwegian Directorate of Integration.
VOORBLAD.
Fredrik Olsson 1 Licentiate-thesis proposal, Software Architectures for Language Engineering: Designing for Information Refinement Fredrik Olsson.
COMPUTER B Y : L K. WINDOWS INFORMATION B Y : L K.
Introduction to Computational Linguistics
Competencies for an Adult Literacy Teacher for Immigrants the Nordic Alfa Council Antra Carlsen, NVL EBSN Conference in Prague May 30-31, 2012.
Requirements Analysis 1. 1 Introduction b501.ppt © Copyright De Montfort University 2000 All Rights Reserved INFO2005 Requirements Analysis Introduction.
1 Workshop on inventories of greenhouse gas emissions from aviation and navigation May 2004, Copenhagen EU greenhouse gas emission trends and projections.
Copyright 2001 Advanced Strategies, Inc. 1 Data Bridging An Overview Prepared for DIGIT By Advanced Strategies, Inc.
Korkeakoulujen arviointineuvosto — Rådet för utvärdering av högskolorna — The Finnish Higher Education Evaluation Council (FINHEEC) eLearning and Virtual.
1 © 2006 Curriculum K-12 Directorate, NSW Department of Education and Training English K-6 Syllabus Using the syllabus for consistency of assessment.
1 Bologna Process Seminar Friday 12 May The Mobility Challenge Sorbonne Declaration, May 1998 “At both undergraduate and graduate level, students.
Chapter 13 The Data Warehouse
Grammar for Fun: IT-based Gmmar Teaching with VISL Eckhard Bick, 2004 Eckhard Bick.
1 L U N D S U N I V E R S I T E T LUB Nordic Scientific and Scholarly journal publishing – interesting times (NSSJP) Uppsala J ö rgen.
Linguistic and Logical Tools for an Advanced Interactive Speech System in Spanish J. Álvarez, V. Arranz, N. Castell & M. Civit TALP Research Centre UPC,
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
What it’s ? “parsing” Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Treebank Troubles Eckhard Bick Southern Denmark University
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Supertagging CMSC Natural Language Processing January 31, 2006.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Generality and Openness in Enabling Methodologies for Morphology and Text Processing Anssi Yli-Jyrä Department of General Linguistics, University of Helsinki.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
1 Chair of Language Technology. 2 Outline General information Staff Teaching –Courses –Supervision Research –Fields –Main results –Participation in conferences.
Nordplus Nordic languages Prepared by Jolanta Sirtautiene 2014.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Modern lexicography in Iceland 10th annual conference of EFNIL at Budapest October Guðrún Kvaran - University of Iceland.
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark
Constraint Grammar ESSLLI Tuesday: Lexicon, PoS, Morphology.
Nordic CLARIN Network Bente Maegaard University of Copenhagen 11 December 2017.
Constraint Grammar ESSLLI
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

PaNoLa: Parsing Nordic Languages Eckhard Bick

PaNoLa Goals ● 1. Integrate existing and stimulate new Constraint Grammar-research in Nordic countries ● 2. Internet based Grammar Teaching, applying the VISL model to different Nordic languages ● 3. Morphologically and syntactically annotated corpus data

Participants ● University of Southern Denmark (Eckhard Bick, Anette Wulff) Danish CG as well as CGs for 6 other languages ● Oslo University (Janne Bondi Johannessen, Kristin Hagen) Bokmål and Nynorsk CGs ● Helsinki University (Fred Karlsson):Finnish and Swedish CGs ● Göteborg University (Torbjörn Lager) µTBL-system (corpus trained automatic CG) ● Tartu University (Heli Uibo, Kaili Müürisep): Estonian CG ● Tromsø University (Trond Trosterud): Sami CG ● The Greenlandic Language Secretariat Oqaasileriffik (Per Langgård) ● Iceland University of Education (Jóhanna Karlsdottir) ● University of the Faroe Islands (Zakaris Hansen)

Project framework ● Funding: Nordic Council of Ministries ● Funded project period: PaNoLa: January 2002 – December 2003: da, no, sv, fi PaNoLa-addon: 2004: is, fo, smi, kl PaNoLa-plus: 2005 (- 2006): is, fo, smi, kl planned: PaNoLa-neighbour: 2005/6 (- 2007): lit, lav, ru ● Historical basis and ongoing cooperation PaNoLa PaNoLa addon PaNoLa-plus PaNoLa-neighbour da, no, sv, fi is, fo, smi, kl lit, lav, ru

Project framework ● Network aspect: 4 workshops in Denmark, Norway, Iceland and Sweden Odense, May 2002 Ustaoset, October 2002 Reykjavik, June 2003 Göteborg, October 2003 Odense, October 2004 Fefor, Marts 2005 (Tallin, April 2005) planned: Thorshavn, September 2005 ● Administration, Web-server, Data-integration: VISL/ISK, University of Southern Denmark ● Satellite projects: e.g. Arboretum, GREI, Arborest

Constraint Grammar ● Rule and lexicon based robust parsing (Karlsson et. al. 1995), methodological paradigm ● Shared conceptual and notational conventions, allowing productive research transfer ● Language dependent differences: Lexicon, rules (Inter-scandinavian comparative payoff?) ● Compiler and rule type differences ● Focus differences: tagging? Parsing? Semantics? Teaching? Corpus annotation? QA?, NER?,...

Rule formalism and architecture cg1-compilercg2- compiler visl-cg- compiler Swe CG Fin CG Oslo- Bergen tagger DanGram, Sami other VISL languages µ-TBL Lingsoft-compatible Needs more rules than cg2 Sets as targets Barrier- conditions “cg2-like” plus substitute operator for correcting hybrid input Automatic learning, local context, rule ordering PoS Syntax Case roles Swedish or language-indep. trained CG ☻ cgx- compiler Est CG da smi no est svfi

The Lexical Base TWOLCore lexicon + morphological analyser Swe CG Fin CG Oslo-Bergen tagger DanGram Corpus dependent Valency potential (especially for verbs) Semantic sets NER µ-TBL Full semantic prototype lexicon Samic CG Est CG

Theoretical Framework (Syntax) Cg2tree (MC) (visl-psg) Traditional CG: Flat dependency Word based form and function tags Dependency filter (SH) TIGER formatPENN format Visl2penn (EB) Visl2tiger (LN, EB,..) Treebank format PSG- Grammar Danish Norwegian Editing tools Search interfaces ☻ ☻☻ ☻ ☻ Korpus90/2000 Oslo-Bergen Corpus Arboretum Redwood

Treebank data compatibility CGCG-depVISL VISL- dep TIGERTIGER-depMALT-dep DTAG- dep CG cg2dep depspli cator cg2visl (visl-psg + grammar) depspli cator cg2visl | visl2tiger.pl cg2visl | visl2tiger.pl | tiger2dep.pl cg2dep | visldep2malt depspli cator CG- dep visldep2malt VISL tree 2cg visl2tiger.pl visl2tiger.pl | tiger2dep.pl visl2tiger.pl | tiger2dep.pl | tigerdep2malt VISL- dep TIGERtiger2dep.pl TIGER -dep tigerdep2malt, (NTN tools) (NTN tools) MALT(NTN tools) DTAG(NTN tools)

Accessibility ● Strong focus on making tools and corpora freely accessible on the internet ● Provide notational and complexity filters to bridge differences between different research and teaching traditions ● VISL's open source philosophy for reconciling academic and commercial use: Free compilers and corpora, but allowing for the protection (i.e. commercializability) of grammars, lexica and end-user applications ☻ ☻ ☻

Related applicative CG-projects ● CG spell/grammar checking (No, Da) Lingsoft / Microsoft ● Named Entity Recognition (Da, No) Nomen Nescio (Nordic Network) ● Treebanks (Da Arboretum, Norwegian plans) Nordic Treebank Network ● Question Answering systems (Da) Aminova Dialogue Systems ● Teaching (e.g. VISL-GYM, VISL-HHX, GREI)

PaNoLa's other leg: CALL Integrating and strengthening Nordic languages in the VISL grammar teaching system ● A unified system of grammatical categories and structural analysis for 22 languages (Dienhart 2000 and Bick 2001) ● Color codes and symbolic notation ● Systematic focus on form & function ● Preexisting server and programming infrastructure ● School and university teaching contacts at all levels ● Internet based games and exercises ● Graded complexity filters

notational harmonization vs. linguistic differences: The greenlandic example QUE:par CJT:cl =S:pronSuumuna #'Hvilken/Hvad' =fA:icl ==Od:g ===D:nnaasut #'planternes' ===H:nqorsuttaat #'deres det grønne' ==P:v-pcp1kiilorpassuakkaarlugu #gørende det i kilovis =A:g ==H:nnunamut #'jorden' ==D:nuumassuseqanngitsumut #'på den livløse' =P:vsiaruartilertaraa #får det til at brede sig CJT:cl- =fA:cl- ==S:napullu #og sneen CO:conj_lu -CJT:cl =-fA:cl ==P:vaanniariaraangat #så ofte den begynder at smelte =P:vsiaruaatipallatsittarlugu #får det til at vælte frem ? KAL22a)Suumuna naasut qorsuttaat kiilorpassuakkaarlugu nunamut uumassuseqanngitsumut siaruartilertaraa apullu aanniariaraangat siaruaatipallatsittarlugu? (Hvad var det der gjorde, at kilo efter kilo af det grønne plantestof kunne vælte frem fra den livløse jord, lige så snart det blev varmt nok i vejret og de sidste rester af sne var væk?) ==H:nnunamut #på jorden ===R:n('nuna')nuna- ===D:in('mut',fleksiver)-mut ==D:nuumassuseqanngitsumut ===R:v('uuma')uuma- ===D:in('ssusiq')-ssuse- ===D:iv('qar')-qa- ===D:iv('ngngit')-nngit- ===D:in('Tuq')-su- ===D:in('mut',fleksiver)-mut ==P:vaanniariaraangat ===R:v('aak')aan- ===D:iv('niar')-nia- ===D:iv('riar')-riar- ===D:iv('gaangat',fleksiver)-aangat =P:vsiaruaatipallatsittarlugu ==R:v('siaruar')siarua- ==D:iv('ute')-ati- ==D:iv('pallak')-pallat- ==D:iv('tit')-sit- ==D:iv('Tar')-tar- ==D:iv('lugu',fleksiver)-lugu

Greenlandic word-internal tree structures

Teaching corpora ● Pedagogically structured ● XML-markup for teaching topic and didactical progression ● Finnish and Swedish modelled on Danish and Norwegian examples files (comparative possibilities) ● compatibility with and importability for research treebanks (e.g. Sofie)

Interactive teaching trees

Grammar games: Labyrinth

Grammar Games: Word Fall

Integrating the CG and CALL legs ● Nordic CG expertise is used to provide live analyses as input for the teaching modules, if necessary by CGI- communication between university servers, e.g. Oslo-SDU ● Descriptional harmonization issues (e.g. Word class) ● Determine matching complexity (e.g. subclause analysis?)

CG leg evaluation ● CG-grammars improve incrementally, so evaluation is less definite than for probabilistic systems, and can change over time. ● Results depend on tag granularity and test genre ● Some numbers: -- DanGram: F-Score for PoS, 94.9 for function (Bick 2003) -- DanGram NER: 5% typing errors, 2% chunking errors -- Bokmål CG: 97.2% lexical F-score (Hagen & Johannessen 2003) -- Nynorsk CG: 96.2% lexical F-score -- SWECG 1.0: recall 99.7% at a precision of 95% (pre-PaNoLa) -- µ-TBL CG for Swedish: 98.1% lexical accuracy when allowing for 1.04 tags pr. Word (Lager 1999)

Teaching leg evaluation ● GREI evaluation: improvement of grammatical skills after using VISL tools (104 children 7 th and 8 th grade) ● Same level tests before & after using VISL/GREI, test & control groups ● Subjective results: All users thought VISL was more fun (games more than trees), and that their grammatical skills had improved ● Objective results: Test group performed 14.5% better than control group (7 th grade), resp. 7% (8 th grade) and 12% at the secondary level. ● Differences were positive for both PoS and sentence analysis, but more marked for the latter

Teaching corpora differences across PaNoLa languages ● Preposition frequency: 11% (Bokmål), 11.4% (Danish), 13.4% (Nynorsk), 0.5% (Finnish) ● PoS: “klappe i”, “tage på”, “skrive noget om” are tagged as ADV in Danish, as PRP in Norwegian samples ● Danish infinitive markers ('at') tagged as CONJ in Norwegian ● Subclass solutions: e.g. Da/Fi distinction between adjunct and argument adverbials, not made by No/Se (fA/As/Ao vs. A) ● Tradition interference: Swedish analysis had zero constituents, because it was annotated according to the English VISL model

Outlook ● Continued development of Nordic Constraint Grammars and CG applications ● Ongoing CALL service for schools ● Presence of the CG paradigm in other Nordic networks ● “Post-PaNoLa”: VISL adaptations for other minor Nordic languages (Faeroese, Icelandic, Samic, Estonian...)