©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield NLP Module 4: Machine Learning.
University of Sheffield NLP Module 11: Advanced Machine Learning.
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Information Retrieval in Practice
Search Engines and Information Retrieval
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Modules, Hierarchy Charts, and Documentation
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
Overview of Search Engines
©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.
Logics for Data and Knowledge Representation Exercises: Languages Fausto Giunchiglia, Rui Zhang and Vincenzo Maltese.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Some Advances in Transformation-Based Part of Speech Tagging
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Information Extraction From Medical Records by Alexander Barsky.
Identifying Comparative Sentences in Text Documents
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
CSA2050 Introduction to Computational Linguistics Parsing I.
Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom
MedKAT Medical Knowledge Analysis Tool December 2009.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Part-of-speech tagging
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
University of Sheffield NLP Module 1: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Information Retrieval in Practice
Introduction to Machine Learning and Text Mining
What is SPaG? pelling unctuation nd rammar. What is SPaG? pelling unctuation nd rammar.
Social Knowledge Mining
Dept. of Computer Science University of Liverpool
Chunk Parsing CS1573: AI Application Development, Spring 2003
Information Retrieval and Web Design
Presentation transcript:

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, CSC 9010: Text Mining Applications Fall, 2012 ANNIE Introduction Dr. Paula Matuszek

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Recap l The goal of information extraction is to pull some well-defined information out of a large corpus of unstructured documents and put it in a more structured form, for easier access and understandability. l Typically you have –a domain model –a knowledge model –an extraction engine

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Examples l Some examples of information extraction in use: –Watson: –I2EOnDemand –ANNIE demo: –The MUC and TREC conferences sponsored by the Information Technology Laboratory of the National Institute of Standards (NIST)MUCTREC –The TIPSTER program of DARPATIPSTER

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Watson Information Sources l Software is built on top of UIMA: unstructured information management application: framework built by IBM and now open-source. l The information corpus was downloaded and indexed offline; no web access during the game. l For the game, Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage, including the full text of Wikipedia. ( ) l Corpus was developed from a large variety of text sources: –baseline from wikipedia, Project Gutenberg, newspaper articles, thesauri, etc. –extend with web retrieval, extract potentially relevant text “nuggets”, score for informative, merge the best into corpus l Primary corpus started as unstructured text, not semantically tagged. l About 2% of Jeopardy! answers can be looked up directly. l Also leverages semistructured and structured sources such as Wordnet and Yago.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, MUC tracks l MUC-1 (1987), MUC-2 (1989): Naval operations messages. l MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. l MUC-5 (1993): Joint ventures and microelectronics domain. l MUC-6 (1995): News articles on management changes. l MUC-7 (1998): Satellite launch reports.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, TREC Tracks l Focus is information retrieval, not just information extraction l Ongoing: find them at l Some of the 2012 tracks: –Crowdsourcing Track –Knowledge Base Acceleration Track –Legal Track –Medical Records Track –Microblog Track l Some earlier tracks –Entity Track –Chemical IR Track –Genomics Track

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Knowledge-Based Approaches l Systems like GATE and I2E use an approach based in natural language processing and modeling knowledge –domain model –knowledge model l There are also machine-learning approaches using classifiers and sequence models l And, of course, hybrid approaches. GATE includes some machine-learning based tools (not in ANNIE)

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Domain Model l Terms: enumerated strings which are members of some class l Classes: Categories of terms, such as “locations”, “proteins”, “diseases Both are often organized into an ontology l Extraction rules

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Terms, Classes, Ontologies l I2EOnDemand has a rich ontology in the medical domain; we used it for searching l In ANNIE these are represented by gazetteer lists and by an index file which describes all the gazetteers, including their major type and possibly minor type and language. l GATE also has a number of ontology plugins and tools

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Domain Rules l LHS matches some pattern in the domain, RHS does something l In ANNIE, RHS typically adds an annotation l In I2EOnDemand –tag or annotate members of classes –identify and extract various kinds of relations

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Knowledge Model l What you’re trying to extract l Can be pretty simple and generic: –tuples of the form entity-relation-entity –entity:class relations l Can be more detailed and specific: Org NameVillanova University Org AliasNova, Villanova Org Description“The local university” Org TypeEducational Institution Org LocaleVillanova, PA

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Knowledge Model in ANNIE l The knowledge model isn’t specified explicitly; it can basically be considered as all of the annotations which the various processing resources can produce. Each processing resource has an implicit knowledge model l Additional models are defined in JAPE, which can create additional annotations.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Extraction Engine l Tool which applies rules to text and extracts matches –Tokenizer –Part of Speech (POS) Tagger –Term and class tagger –Rule engine: match LHS, execute RHS l Rule engine is iterative l May include an interactive component which is essentially a query engine against already extracted information

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, ANNIE/GATE Extraction Engine l The entire GATE system can be considered as an extraction engine (although it does a lot more) l Most of the processing resources in ANNIE and many others available in CREOLE are specific extraction engines l Some systems (eg, AeroText, OpenCalais) make an explicit distinction among the domain model, the knowledge model, and the extraction engine. l The distinction in ANNIE is less clear

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, On to ANNIE: l A family of Processing Resources for language analysis included with GATE l Stands for A Nearly-New Information Extraction system. l Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on. l (LaSIE is the forerunner of ANNIE, focused specifically on information extraction for the TREC conferences)

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, ANNIE IE Modules

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, ANNIE Standard Components l These are what is loaded when you load ANNIE and run the default application –Document Reset –Tokenizer –Gazetteer: lists of entities –Sentence Splitter/Regex sentence splitter –Part of Speech Tagger –Named Entity Transducer –Orthomatcher

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Document Reset l Task: reset a document to its original state. Releases old annotations for garbage collection. l Parameters: –annotationTypes: specify annotations to remove. Default is all. –setsToKeep: –setsToRemove: –keepOriginalMarkupsAS: Boolean.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, ANNIE Component: Tokenizer l Task: split text into simple tokens l In ANNIE, used as one piece of the English Tokenizer l Five types of tokens, some of which have attributes or subtypes l Uses tokenizer rules –left hand side (LHS): pattern to be matched –right hand side (RSH): action to be taken

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Default Token Types l Token Types. All have attributes string and length l Tokens –word. Any contiguous set of upper or lowercase letters, including a hyphen. –orthography attribute (orth): upperInitial, allCaps, lowerCase, mixedCaps –number. Any combination of successive digits. –symbol. Currency symbols, ^, +, =, etc.. –symbolkind attribute: currency –punctuation. –position attribute (position) startpunct, endpunct –subkind attribute (subkind) dashpunct l SpaceTokens –space. any non-control character –control. any control character

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Tokenizer Rule l Operations used on the LHS: – | (or) – * (0 or more occurrences) – ? (0 or 1 occurrences) – + (1 or more occurrences) l The RHS uses ’;’ as a separator, and has the following format: {LHS} > {Annotation type};{attribute1}={value 1};...;{attribute n}={value n}

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Example Tokenizer Rule –"UPPERCASE_LETTER" "LOWERCASE_LET TER"* –> –Token;orth=upperInitial;kind=word; –The sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, English Tokenizer l Task: adapt generic tokenizer for POS tagger by dealing with constructs involving apostrophes –don’t --> do + n’t –you’ve --> you + ‘ve l Uses a JAPE transducer: Java Annotation Patterns Engine. –LHS: annotation pattern description –RHS: annotation manipulation statement l If you’re going to use the POS tagger, always use this tokenizer

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, ANNIE Component: Gazetteer l Task: identify entity names based on lists –Located in /plugins/ANNIE/resources/gazetee r –defined in lists.def, which gives name, major type, minor type, language l Example lists.def entries: –airports.lst:location:airport –city.lst:location:city –festival.lst:date:festival –day.list:date:day

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Gazetteer lists l The gazetteer lists used are plain text files, with one entry per line. l Typical lists include –named entities, such as locations, organizations, dates, names, –grammatical entities such as determiners –components of entities, such location prefixes –anything else that can usefully be enumerated l Don’t always have a minor type or language. Many are one-offs. (spur_ident)

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Gazetteer Parameters l Init-time parameters –listsURL: index file. Default is lists.def –encoding: default is UTF-8 –gazetteerFeatureSeparator. Not required. –caseSensitive (Boolean). Default is true. l Run-time parameters –document to process –annotationSetName: what to call the Lookup annotation set. Not required. –wholeWordsOnly: Default is true –longestMatchOnly: Default is true. EG: Amazon UK

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Gazetteer Editor l You can add or modify –the name of a list –the major and minor categories in the list –an entry in the list l You can delete a list by right-clicking the name l You can also add a new list l Or edit everything outside of GATE with a text editor.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, ANNIE Component: Sentence Splitter l Task: just what it says: segments the text into sentences. l The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds. l This module is required for the POS tagger.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, POS Tagger l Modified version (Hepple) of the Brill tagger, probably the best-known of the POS taggers –produces a POS tag as an annotation on each word or symbol –Uses a default lexicon and rule set –alternate lexicons for all upper-case and all lower-case corpora. l Must run English Tokenizer and Sentence Splitter first

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Partial List of Hepple (Penn treebank) Tags CC - coordinating conjunction: ‘and’, ‘but’, ‘nor’, ‘or’, ‘yet’, plus, minus, less. CD - cardinal number IN - preposition or subordinating conjunction JJ - adjective: Hyphenated compounds that are used as modifiers; happy-go-lucky. JJR - adjective - comparative: Adjectives with the comparative ending ‘-er’’. JJS - adjective - superlative: Adjectives with the superlative ending ‘-est’ (and ‘worst’)’. NN - noun - singular or mass NNS - noun - plural NP - proper noun - singular NPS - proper noun - plural POS - possessive ending: Nouns ending in ‘’s’ or ‘’’. PP - personal pronoun RB - adverb: most words ending in ‘-ly’. Also ‘quite’, ‘too’, ‘very’, others RBR - adverb - comparative: adverbs ending with ‘-er’ with a comparative meaning. RBS - adverb - superlative SYM - symbol: technical symbols or expressions that aren’t English words. UH - interjection: Such as ‘my’, ‘oh’, ‘please’, ‘uh’, ‘well’, ‘yes’. VBD - verb - past tense VBG - verb - gerund or present participle VBN - verb - past participle VBP - verb - non-3rd person singular present VB - verb - base form: subsumes imperatives, infinitives and subjunctives. VBZ - verb - 3rd person singular present

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Example Lexicon Entries l '30s CD NNS l & CC SYM l CD l Announcement NN l Figuring VBG l Frequently RB l... l A total of over 17,000 entries

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Brill Tagger l A Brill tagger starts by assigning all the known words from the lexicon l It next assigns NNP to capitalized unknown words and NN to everything else unknown. l Finally, a bunch of replacement rules change the assignment based on various features and context.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Example Brill Tagger Rules l replace(pos 'NN' 'JJ') # [suffix#less#[0]] –"replace tag NN with JJ if the word in question ends in "less"". l replace(pos 'VB' 'NN') # [canHave#'NN'#[0] pos#'DT'#[~1]] –"replace tag VB with NN if the the word in question can have tag NN (according to the lexicon) and if the previous word is tagged DT" From

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, LOTS of Parameters l Init-Time: all optional –encoding - encoding to be used for reading rules and lexicons –lexiconURL - The URL for the lexicon file –rulesURL - The URL for the ruleset file l Run-time: –document - The document to be processed. Required –inputASName - Input annotation set. Optional –outputASName - Output annotation set. Optional –baseTokenAnnotationType - name of annotation type for tokens ( default = Token). Required. –baseSentenceAnnotationType - name of annotation type for sentences ( default = Sentence). Required. –outputAnnotationType - default = Token. Required. –failOnMissingInputAnnotations - Default is true.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Named Entity Transducer l Task: find terms that suggest entities l Hand-crafted grammars define patterns over the annotations l Written in JAPE l Doesn’t require that you run anything else, but doesn’t do much if you don’t already have annotations. l Adds new token types

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Required Parameters l Initialization –encoding. default is UTF-8 –grammar. default is /plugins/ANNIE/resources/N E/main.jape (which loads a bunch more) l Runtime –Corpus name (will default if there’s only one)

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Example JAPE rules Phase:Unknown Input: Location Person Date Organization Lookup Rule: Known Priority: 100 ( {Location}| {Person}| {Date}| {JobTitle}):known --> {} Rule:Unknown Priority: 50 ( {Token.category == NNP}) :unknown --> :unknown.Unknown = {kind = "PN", rule = Unknown} (This is only a part of the actual rules)

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Orthomatcher l Task: add co-reference information to Named Entities based on orthographic information. In a typical text, a NE may be referred to multiple times, with abbreviations and acronyms –Creates match lists for tagged NEs –May assign an entity type to previously unclassified entities –Won’t reassign already classified entities l Without NE Transducer annotations doesn’t actually do anything.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Required Parameters l Required Init-time –definitionFileURL. Default is /plugins/ANNIE/resources/othomatc her/listsNM.def. –encoding. Default is UTF-8. –minimumNicknameLikehihood: default is l Run-time –corpus or document –Can also have: AnnotationTypes: list of types to be processed. Default: Organization, Person, Location, Date. Not required.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Co-reference Editor l Can modify the reference chains within a document –Delete an existing member of the chain –Add a new item to a chain l This will change all relevant chains: adding an item to a chain will change the chain for all the items in it This will not change NE types or other annotations; it doesn’t propagate.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Some Additional ANNIE Modules l These modules are not included in the default ANNIE load, but they are available in processing resources –Morphological analyzer (stemmer). Add “root” annotation –Pronomial Coreference. Add “coreference” annotation l May need to load tools from CREOLE manager first

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, Summary l The default ANNIE application uses a sequence of processing resources to create annotations over unstructured text l The tokenizer, sentence splitter, POS tagger and orthomatcher are (mostly) domain-independent l The gazetteer lists and Named Entity Transducer are domain-dependent; ANNIE comes with a reasonably good default set. l The domain information is largely captured in gazetteer lists and JAPE rules.