Presentation is loading. Please wait.

Presentation is loading. Please wait.

1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.

Similar presentations


Presentation on theme: "1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham."— Presentation transcript:

1 1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham Department of Computer Science, University of Sheffield Structure of the talk: A brief introduction to GATE Multilingual infrastructure in GATE Simple multilingual IE components

2 2(18) GATE is... An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, a graphical development environment. GATE comes with... Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at

3 3(18) Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Weka...) (Almost) everything is a component, and component sets are user-extendable (Almost) all operations are available both from API and GUI

4 4(18) Component-based development CREOLE – Collection of REusable Objects for Language Engineering: Java Beans: an OO way of chunking software GATE components: modified Java Beans with XML configuration The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Three types: Language Resources, Processing Resources, Visual Resources Why bother? Allows the system to load arbitrary language processing components

5 5(18) Language Resources (LRs) LRs are documents, ontologies, corpora, lexicons, …… LRs can be associated with DataStores (Oracle, PostgreSQL, XML, Java Serialisation) Documents / corpora: –Diverse document formats: text, html, XML, , RTF, SGML –Optional format-preserving markup analyse / save Standoff annotation model (start, end, type, features), derivative of TIPSTER, compatible with ATLAS and XCES Coping with diverse character encodings: New internationalised versions of JVM support >100 different encodings. Other encodings: developing system for user-entry of mapping tables (remove programming from the process)

6 6(18) Processing Resources (PRs) Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing) freebies with GATE Controllers: execute a set of PRs –SerialController: sequential run of arbitrary PR set –SerialAnalyserController: analyser PRs over corpus –Conditional controllers: execute depend on features –Parallel controller? PRs + Controller = Applications Application parameterisation state can be saved and restored, and used for embedding / batching

7 7(18) Visual Resources (VRs)

8 8(18) VRs (2): Coreference

9 9(18) VRs (3): Syntax

10 10(18) Displaying Multilingual Data GATE uses standard (& imperfect) Java rendering engine for displaying text.

11 11(18) GATE Unicode Kit (GUK) Complements Java’s facilities Support for defining Input Methods (IMs) Currently 30 IMs for 17 languages Pluggable in other applications (e.g. JEdit, EUDICO) Can use virtual kybd or standard layouts over QWERTY IMs defined in plain text files GUK comes with a standalone Unicode editor Editing Multilingual Data

12 12(18) Processing Multilingual Data All processing, visualisation and editing tools use GUK

13 13(18) Multilingual IE Components The ANNIE system – a reusable and easily extendable set of components

14 14(18) The Unicode Tokeniser A very portable component for multliple languages: splits text into typed tokens based on FSM dynamically constructed from rules based on character categories defined by the Unicode, e.g.: UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word; output generally localised by a later module (e.g. “don’t” … “do” “n’t”) 23 rules seem able to handle without changes Indo- European languages. the English tokeniser: Unicode tokeniser + pattern grammar FST

15 15(18) POS tagging in new languages TIDES Surprise Language: Hepple tagger but substituted Cebuano/Hindi lexicon for English Used empty ruleset since no training data available Used default heuristics (e.g. return NNP for capitalised words) Very experimental, but reasonable results 67% correctness for Hindi and 75% for Cebuano Adaptation time per language - 2 days

16 16(18) Porting NE grammars Most English JAPE rules based on POS tags and gazetteer lookup Grammars can be reused for languages with similar word order, orthography etc. No time to make detailed study of Cebuano, but very similar in structure to English Most of the rules left as for English, but some adjustments to handle especially dates Used both English and Cebuano grammars and gazetteers, because NEs appear in both languages

17 17(18) TIDES Evaluation Results CebuanoEnglish Baseline EntityPRFPRF Person Org Location Date Total

18 18(18) Conclusion GATE – a Unicode-based NLP infrastructure, particularly suitable for multilingual adaptation of IE systems Requires little involvement of native speakers and very little annotated data for a basic job Future work –Improving multilingual support, e.g., morphology support, automatic language and encoding identification –Learning gazetteer lists from annotated corpora


Download ppt "1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham."

Similar presentations


Ads by Google