ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin
Alala2 GATE and Information Extraction ● Basic introduction to IE and GATE ● Overview of ANNIE ● JAPE: rule writing ● JAPE debugger
GATE and IE ● IE is one of the core tasks GATE is designed for ● IE is the basis for many other, more complex applications, e.g. semantic annotation ● Cornerstone of IE is Named Entity Recognition
Alala4 A Typical IE System 1.Pre-processing –format detection –tokenisation –word segmentation –sense disambiguation –sentence splitting –POS tagging 2.Named entity detection –entity detection –coreference
Alala5 Two Approaches to IE Knowledge Engineering ● rule based ● developed by experienced language engineers ● make use of human intuition ● obtain marginally better performance ● development could be very time consuming ● some changes may be hard to accommodate Learning Systems ● use statistics or other machine learning ● developers do not need LE expertise ● requires large amounts of annotated training data ● some changes may require re-annotation of the entire training corpus
Alala6 Named Entity Recognition ● NE involves identification of proper names in texts, and classification into a set of predefined categories of interest. ● Three universally accepted categories: person, location and organisation ● Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), addresses etc. ● Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
Alala7 ANNIE Unicode Tokeniser FS Gazetteer Lookup Sentence Splitter Hepple POS Tagger Input: URL or text Document format (XML, HTML, SGML, , …) GATE Document Character Class Sequence Rules Lists JAPE Sentence Patterns Brill Rules Lexicon Semantic Tagger Ortho Matcher JAPE IE Grammar Cascade GATE Document XML dump of IE Annotations Output: ANNIE IE modules NOTE: square boxes are processes, rounded ones are data. Pronominal Coreferencer JAPE Grammar
Alala8 Unicode Tokeniser Bases tokenisation on Unicode character classes Language-independent tokenisation Declarative token specification language, e.g.: "UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperInitial; kind=word Look at the ANNIE English tokeniser and at tokenisers for other languages (in plugins directory) for more information and examples
Alala9 Gazetteer ● Set of lists compiled into Finite State Machines ● 60k entries in 80 types, inc.: organization; artifact; location; amount_unit; manufacturer; transport_means; company_designator; currency_unit; date; government_designator;... ● Each list has attributes MajorType and MinorType and Language): city.lst: location: city: english currency_prefix.lst: currency_unit: pre_amount currency_unit.lst: currency_unit: post_amount ● Attributes are used as input to JAPE grammars ● List entries may be entities or parts of entities, or they may contain contextual information (e.g. job titles often indicate people)
Alala10 The Named Entity Grammar ● JAPE phases run sequentially and constitute a cascade of FSTs over annotations ● hand-coded rules applied to annotations to identify NEs ● annotations from format analysis, tokeniser. POS tagger and gazetteer modules ● use of contextual information ● rule priority based on pattern length, rule status and rule ordering ● Common entities: persons, locations, organisations, dates, addresses.
Orthomatcher ● Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown ● Matching rules are invoked between annotations of the same type, or between an existing annotation and an “Unknown” annotation ● The latter is the only case where an annotation type can be changed ● Lookup tables of aliases and exceptions (i.e. overriding of matching rules) ● Also pronominal coreference (see User Guide)
Alala12 JAPE: a Jolly And Pleasant Experience ● Grammars (cascades of phases) – Phases (lists of rules) ● Rules – LHS (patterns) – RHS (actions) ● Priority – Implicit ● longest match ● first mention – Explicit ● priority
LHS of JAPE rules ● The LHS of the rule contains patterns to be matched, in the form of annotations (and optionally their attributes). ● Annotation types to be recognised must be declared at the beginning of the phase ● Annotations may be combined using traditional operators [ | * + ?] ● There is no negative operator ● More than one pattern can be matched in a single rule ● Left and right context (not to be annotated) can be matched
Examples of LHS patterns ({Lookup.majorType == location}) :loc ({Token.string == "in"} | {Token.string == "by"}) ({Year}) :date ( ({Lookup.majorType == jobtitle} ):jobtitle {Surname} ):person
RHS of JAPE rules ({Lookup.majorType == location}) :loc :loc.Location = {kind = “city", rule = “Location1"} ( ({Lookup.majorType == jobtitle} ):jobtitle {Surname} ):person :jobtitle.JobTitle = {rule = "PersonJobTitle"}, :person.Person = {kind = “Surname", rule = "PersonJobTitle"}
Complex RHS ● JAPE RHS is quite limited in what you can do ● But you can use any Java you like on the RHS of the rule ● Useful for e.g. removing temporary annotations and percolating and manipulating features from previous annotations ● Also means you can use JAPE for many other things apart from just creating annotations, e.g. counting things, manipulating the text, adding annotations to the document, etc. ● And you don’t have to be a JAVA expert to do it. ● Although it helps to have friends who are….
Example of using Java in a rule Rule: FirstName ({Lookup.majorType == person_first}):person --> { gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person"); gate.Annotation personAnn = (gate.Annotation)person.iterator().next(); gate.FeatureMap features = Factory.newFeatureMap(); features.put("gender", personAnn.getFeatures().get("minorType")); features.put("rule", "FirstName"); outputAS.add(person.firstNode(), person.lastNode(), "FirstPerson", features); }
Available Java objects ● bindings: binding variables ● doc: GATE Document ● annotations: all GATE Document annotations ● inputAS, outputAS: phase input and output annotations ● ontology See documentation for more details…..
Alala19 JAPE Application modes ● Brill (fires all matches) ● First (shortest match fires) ● Once (Phase exits after first match) ● All (as for Brill, but matching continues from offset following the current one, not from the end of the last match) ● Appelt (priority ordering: longest match fires, then explicit rule priority, then first defined rule fires) Note that prioritisation only operates within a single phase, not globally
20 {A}+ Application Modes A AA Appelt Once Brill First All
Example: “China Sea” Rule: Location1 Priority: 25 ( ({Lookup.majorType == loc_key, Lookup.minorType == pre})? {Lookup.minorType == country} {Lookup.majorType == loc_key, Lookup.minorType == post})? ) :locName --> :locName.Location = {kind = "location", rule = "Location1"} Rule: Location2 Priority: 20 ({Lookup.minorType == location}) :location --> :location.Name = {kind = "location", rule=GazLocation}
JAPE Hints and Tricks ● JAPE is quite limited in some respects as to what can be done – There is no negative operator – It can be slow if it is badly written, e.g. ({Token})* – Context is consumed, which can make rule-writing awkward – Priority can be difficult to set correctly ● But fear not, there is generally a sneaky way around it…..
How to avoid a pattern from matching Rule: disablePattern Priority: 1000 ( ) {} ● Instead of having a negative operator, we can simply put a high priority rule which does nothing when fired. ● This will be preferred to a lower priority rule which performs the action intended, i.e. only in the case when the former pattern doesn’t apply.
How to play with input annotations Input: Person Organisation VerbWork Split … Rule: RelationWorkIn ({Person} {VerbWork} {Organisation}) {… /* create annotation of type “Relation” */ …} ● Use existing annotations to find relations ● We ignore Tokens to enable more flexibility, i.e. there could be additional words between the annotations specified ● Split ensures we don’t cross sentence boundaries
How to deal with overlapping annotations ● Because matched annotations are consumed, when two annotations overlap (e.g. in gazetteer lists), the second one will never be matched. ● E.g. for the string “hALCAM” with Lookups hAL, ALCAM, and CAM, ALCAM will never be matched ● Solution is to delete the annotations once matched, and then rerun the same grammar phase over the text ● The process may need to be repeated several times (determine by trial and error)
More examples ● In the GATE User Guide under the section “Useful tricks with JAPE” ● Look in the ANNIE grammars and in the foreign language grammars – there are many examples of little tricks ● Check the GATE mailing list archives
Custom Processing Resource for your grammars 1. Java developer extends GATE's default JAPE Transducer creating Java class package com.yourcompany; import gate.creole.Transducer; public class CustomTransducer extends Transducer {} 2. JAPE developer adds definition in the plugin’s creole.xml My custom JAPE Transducer com.yourcompany.CustomTransducer java.lang.String java.net.URL java.lang.String 3. GATE user opens custom resource in GATE GUI Right-Click on “Processing Resources” In the pop-up menu select “New >” --> “My custom JAPE Transducer”
JAPE debugger ● Speeds up the development of JAPE grammars ● Integrated in GATE GUI ● Friendly for non-experts Allows you to: ● Inspect the pattern matching ● Find overridden rules ● Detect complex inter-rule influence ● And many other things
Inspection of pattern matching
Overridden rules
Inter-rule influence ( finding problem)
Inter-rule influence (what is that?)
Inter-rule influence (problem synopsis) Text processed: … of the J. L. Kellog Graduate School of Management and the Indiana University School of Business … Conflicting rule: Rule: NotPersonFull Priority: 80 // Det + Surname // This rule was commented course //J.L. Kellog processed without J. // ( {Token.category == DT} | {Token.category == PRP} | {Token.category == RB} ) ( (PREFIX)* (UPPER) (PERSONENDING)? ):foo Shadowed rule: Rule: PersonFullExt Priority: 100 // F.W. Jones Fred Jones // Andrew "Flip" Filipowski // Andrew J. "Flip" Filipowski //({Token.category == DT})? ( ((FIRSTNAME | FIRSTNAMEAMBIG))+ (INITIALS)? ((FIRSTNAME | FIRSTNAMEAMBIG) )* (PREFIX)* ((UPPER)):surname (PERSONENDING)? ):person -->
Coming soon…..JAPE4 What JAPE4 IS: ● a new version of internal language in GATE release 4 ● language is based on original JAPE ● incorporate best practices from JAPE, Jape+ and Japec ● 3-5 times faster than JAPE What JAPE4 IS NOT: ● an improved version of original Jape, Jape+ or Japec but rather a new language ● a language backward compatible with JAPE In most cases it seems to be possible to easily modify original Jape, Jape+ or Japec grammars to be compatible with JAPE4 specification.