1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.

Slides:



Advertisements
Similar presentations
2017/3/25 Test Case Upgrade from “Test Case-Training Material v1.4.ppt” of Testing basics Authors: NganVK Version: 1.4 Last Update: Dec-2005.
Advertisements

Chapter 26 Legacy Systems.
Implementation of a Validated Statistical Computing Environment Presented by Jeff Schumack, Associate Director – Drug Development Information September.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.
17 Copyright © 2005, Oracle. All rights reserved. Deploying Applications by Using Java Web Start.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
1 G54PRG Programming Lecture 1 Amadeo Ascó Adam Moore G54PRG Programming Lecture 1 Amadeo Ascó 3 Java Programming Language.
Part Two: Using Xaira to explore corpora Richard Xiao
1() Information Extraction – why Google doesnt even come close Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Campaign Overview Mailers Mailing Lists
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
Funded by: European Commission – 6th Framework Project Reference: IST WP6 review presentation GATE ontology QuestIO - Question-based Interface.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
University of Sheffield, NLP Case study: GATE in the NeOn project Diana Maynard University of Sheffield.
1/(19) GATE Evaluation Tools GATE Training Course October 2006 Kalina Bontcheva.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 4: Machine Learning.
GATE, Human Language and Machine Learning Hamish Cunningham, Valentin.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
University of Sheffield NLP Module 11: Advanced Machine Learning.
Dr. Alexandra I. Cristea XHTML.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
14-1 © Prentice Hall, 2004 Chapter 14: OOSAD Implementation and Operation (Adapted) Object-Oriented Systems Analysis and Design Joey F. George, Dinesh.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
L EC. 01: J AVA FUNDAMENTALS Fall Java Programming.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
UNIT-V The MVC architecture and Struts Framework.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.
Controlled Language for Ontology Editing Adam Funk, Valentin Tablan, Kalina Bontcheva, Hamish Cunningham, Brian Davis, Siegfried Handschuh.
Starting Chapter 4 Starting. 1 Course Outline* Covered in first half until Dr. Li takes over. JAVA and OO: Review what is Object Oriented Programming.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
GATE technical workshop: introduction Hamish Cunningham Sheffield, March.
Software Architecture for Language Engineering (SALE) – where next? Hamish.
GATE, a General Architecture for Text Engineering Hamish Cunningham Department.
NOOJ 0.1 Max Silberztein Université de Franche-Comté 6th INTEX Workshop Sofia, Bulgaria, May 2003.
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA.
Information Extraction From Medical Records by Alexander Barsky.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
JAPE and Java Kalina Bontcheva, Department of Computer Science, University.
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Machine Learning in GATE Valentin Tablan. 2 Machine Learning in GATE Uses classification. [Attr 1, Attr 2, Attr 3, … Attr n ]  Class Classifies annotations.
Chapter – 8 Software Tools.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
GATE and the Semantic Web
Presentation transcript:

1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham Department of Computer Science, University of Sheffield Structure of the talk: A brief introduction to GATE Multilingual infrastructure in GATE Simple multilingual IE components

2(18) GATE is... An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, a graphical development environment. GATE comes with... Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at

3(18) Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Weka...) (Almost) everything is a component, and component sets are user-extendable (Almost) all operations are available both from API and GUI

4(18) Component-based development CREOLE – Collection of REusable Objects for Language Engineering: Java Beans: an OO way of chunking software GATE components: modified Java Beans with XML configuration The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Three types: Language Resources, Processing Resources, Visual Resources Why bother? Allows the system to load arbitrary language processing components

5(18) Language Resources (LRs) LRs are documents, ontologies, corpora, lexicons, …… LRs can be associated with DataStores (Oracle, PostgreSQL, XML, Java Serialisation) Documents / corpora: –Diverse document formats: text, html, XML, , RTF, SGML –Optional format-preserving markup analyse / save Standoff annotation model (start, end, type, features), derivative of TIPSTER, compatible with ATLAS and XCES Coping with diverse character encodings: New internationalised versions of JVM support >100 different encodings. Other encodings: developing system for user-entry of mapping tables (remove programming from the process)

6(18) Processing Resources (PRs) Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing) freebies with GATE Controllers: execute a set of PRs –SerialController: sequential run of arbitrary PR set –SerialAnalyserController: analyser PRs over corpus –Conditional controllers: execute depend on features –Parallel controller? PRs + Controller = Applications Application parameterisation state can be saved and restored, and used for embedding / batching

7(18) Visual Resources (VRs)

8(18) VRs (2): Coreference

9(18) VRs (3): Syntax

10(18) Displaying Multilingual Data GATE uses standard (& imperfect) Java rendering engine for displaying text.

11(18) GATE Unicode Kit (GUK) Complements Java’s facilities Support for defining Input Methods (IMs) Currently 30 IMs for 17 languages Pluggable in other applications (e.g. JEdit, EUDICO) Can use virtual kybd or standard layouts over QWERTY IMs defined in plain text files GUK comes with a standalone Unicode editor Editing Multilingual Data

12(18) Processing Multilingual Data All processing, visualisation and editing tools use GUK

13(18) Multilingual IE Components The ANNIE system – a reusable and easily extendable set of components

14(18) The Unicode Tokeniser A very portable component for multliple languages: splits text into typed tokens based on FSM dynamically constructed from rules based on character categories defined by the Unicode, e.g.: UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word; output generally localised by a later module (e.g. “don’t” … “do” “n’t”) 23 rules seem able to handle without changes Indo- European languages. the English tokeniser: Unicode tokeniser + pattern grammar FST

15(18) POS tagging in new languages TIDES Surprise Language: Hepple tagger but substituted Cebuano/Hindi lexicon for English Used empty ruleset since no training data available Used default heuristics (e.g. return NNP for capitalised words) Very experimental, but reasonable results 67% correctness for Hindi and 75% for Cebuano Adaptation time per language - 2 days

16(18) Porting NE grammars Most English JAPE rules based on POS tags and gazetteer lookup Grammars can be reused for languages with similar word order, orthography etc. No time to make detailed study of Cebuano, but very similar in structure to English Most of the rules left as for English, but some adjustments to handle especially dates Used both English and Cebuano grammars and gazetteers, because NEs appear in both languages

17(18) TIDES Evaluation Results CebuanoEnglish Baseline EntityPRFPRF Person Org Location Date Total

18(18) Conclusion GATE – a Unicode-based NLP infrastructure, particularly suitable for multilingual adaptation of IE systems Requires little involvement of native speakers and very little annotated data for a basic job Future work –Improving multilingual support, e.g., morphology support, automatic language and encoding identification –Learning gazetteer lists from annotated corpora