ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Slides:

Advertisements

Similar presentations

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.

Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.

We have developed CV easy management (CVem) a fast and effective fully automated software solution for effective and rapid management of all personnel.

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004

An Introduction to GATE

Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

University of Sheffield, NLP Case study: GATE in the NeOn project Diana Maynard University of Sheffield.

University of Sheffield NLP Module 2: Introduction to IE and ANNIE.

Information Extraction with GATE

University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.

University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.

University of Sheffield NLP Module 4: Machine Learning.

1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.

University of Sheffield NLP Module 11: Advanced Machine Learning.

ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.

Tutorial 6 Creating a Web Form

ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.

SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION

1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,

Advanced JAPE Mark A. Greenwood. University of Sheffield NLP Recap Installed and run GATE Understand the idea of  LR – Language Resources  PR – Processing.

Customizing Word Microsoft Office Word 2007 Illustrated Complete.

About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.

Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.

Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.

Customers Training Where “Lean” principles are considered common sense and are implemented with a passion!

Form Handling, Validation and Functions. Form Handling Forms are a graphical user interfaces (GUIs) that enables the interaction between users and servers.

Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.

© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.

University of Sheffield NLP Module 3: Introduction to JAPE.

Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.

Survey of Semantic Annotation Platforms

ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

Introduction to XML 1. XML XML started out as a standard data exchange format for the Web Yet, it has quickly become the fundamental instrument in the.

CSC3315 (Spring 2009)1 CSC 3315 Programming Languages Hamid Harroud School of Science and Engineering, Akhawayn University

Information Extraction From Medical Records by Alexander Barsky.

1 ADVANCED MICROSOFT EXCEL Lesson 9 Applying Advanced Worksheets and Charts Options.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Ngoc Minh Le - ePi Technology Bich Ngoc Do – ePi Technology

Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.

Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.

E.g.: MS-DOS interface. DIR C: /W /A:D will list all the directories in the root directory of drive C in wide list format. Disadvantage is that commands.

©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.

Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.

MedKAT Medical Knowledge Analysis Tool December 2009.

JAPE and Java Kalina Bontcheva, Department of Computer Science, University.

University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.

 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.

University of Sheffield NLP Module 3: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.

Mr C Johnston ICT Teacher

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.

1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

University of Sheffield NLP Module 1: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

Business rules.

Information Extraction (IE)

Homework 1 Hints.

Automatic Web Security Unit Testing: XSS Vulnerability Detection Mahmoud Mohammadi, Bill Chu, Heather Richter, Emerson Murphy-Hill Presenter:

Introduction to Scripting

Header is variable size because of …

Module 3: Introduction to JAPE

Presentation transcript:

ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin

Alala2 GATE and Information Extraction ● Basic introduction to IE and GATE ● Overview of ANNIE ● JAPE: rule writing ● JAPE debugger

GATE and IE ● IE is one of the core tasks GATE is designed for ● IE is the basis for many other, more complex applications, e.g. semantic annotation ● Cornerstone of IE is Named Entity Recognition

Alala4 A Typical IE System 1.Pre-processing –format detection –tokenisation –word segmentation –sense disambiguation –sentence splitting –POS tagging 2.Named entity detection –entity detection –coreference

Alala5 Two Approaches to IE Knowledge Engineering ● rule based ● developed by experienced language engineers ● make use of human intuition ● obtain marginally better performance ● development could be very time consuming ● some changes may be hard to accommodate Learning Systems ● use statistics or other machine learning ● developers do not need LE expertise ● requires large amounts of annotated training data ● some changes may require re-annotation of the entire training corpus

Alala6 Named Entity Recognition ● NE involves identification of proper names in texts, and classification into a set of predefined categories of interest. ● Three universally accepted categories: person, location and organisation ● Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), addresses etc. ● Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

Alala7 ANNIE Unicode Tokeniser FS Gazetteer Lookup Sentence Splitter Hepple POS Tagger Input: URL or text Document format (XML, HTML, SGML, , …) GATE Document Character Class Sequence Rules Lists JAPE Sentence Patterns Brill Rules Lexicon Semantic Tagger Ortho Matcher JAPE IE Grammar Cascade GATE Document XML dump of IE Annotations Output: ANNIE IE modules NOTE: square boxes are processes, rounded ones are data. Pronominal Coreferencer JAPE Grammar

Alala8 Unicode Tokeniser Bases tokenisation on Unicode character classes Language-independent tokenisation Declarative token specification language, e.g.: "UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperInitial; kind=word Look at the ANNIE English tokeniser and at tokenisers for other languages (in plugins directory) for more information and examples

Alala9 Gazetteer ● Set of lists compiled into Finite State Machines ● 60k entries in 80 types, inc.: organization; artifact; location; amount_unit; manufacturer; transport_means; company_designator; currency_unit; date; government_designator;... ● Each list has attributes MajorType and MinorType and Language): city.lst: location: city: english currency_prefix.lst: currency_unit: pre_amount currency_unit.lst: currency_unit: post_amount ● Attributes are used as input to JAPE grammars ● List entries may be entities or parts of entities, or they may contain contextual information (e.g. job titles often indicate people)

Alala10 The Named Entity Grammar ● JAPE phases run sequentially and constitute a cascade of FSTs over annotations ● hand-coded rules applied to annotations to identify NEs ● annotations from format analysis, tokeniser. POS tagger and gazetteer modules ● use of contextual information ● rule priority based on pattern length, rule status and rule ordering ● Common entities: persons, locations, organisations, dates, addresses.

Orthomatcher ● Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown ● Matching rules are invoked between annotations of the same type, or between an existing annotation and an “Unknown” annotation ● The latter is the only case where an annotation type can be changed ● Lookup tables of aliases and exceptions (i.e. overriding of matching rules) ● Also pronominal coreference (see User Guide)

Alala12 JAPE: a Jolly And Pleasant Experience ● Grammars (cascades of phases) – Phases (lists of rules) ● Rules – LHS (patterns) – RHS (actions) ● Priority – Implicit ● longest match ● first mention – Explicit ● priority

LHS of JAPE rules ● The LHS of the rule contains patterns to be matched, in the form of annotations (and optionally their attributes). ● Annotation types to be recognised must be declared at the beginning of the phase ● Annotations may be combined using traditional operators [ | * + ?] ● There is no negative operator ● More than one pattern can be matched in a single rule ● Left and right context (not to be annotated) can be matched

Examples of LHS patterns ({Lookup.majorType == location}) :loc ({Token.string == "in"} | {Token.string == "by"}) ({Year}) :date ( ({Lookup.majorType == jobtitle} ):jobtitle {Surname} ):person

RHS of JAPE rules ({Lookup.majorType == location}) :loc  :loc.Location = {kind = “city", rule = “Location1"} ( ({Lookup.majorType == jobtitle} ):jobtitle {Surname} ):person  :jobtitle.JobTitle = {rule = "PersonJobTitle"}, :person.Person = {kind = “Surname", rule = "PersonJobTitle"}

Complex RHS ● JAPE RHS is quite limited in what you can do  ● But you can use any Java you like on the RHS of the rule ● Useful for e.g. removing temporary annotations and percolating and manipulating features from previous annotations ● Also means you can use JAPE for many other things apart from just creating annotations, e.g. counting things, manipulating the text, adding annotations to the document, etc. ● And you don’t have to be a JAVA expert to do it. ● Although it helps to have friends who are….

Example of using Java in a rule Rule: FirstName ({Lookup.majorType == person_first}):person --> { gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person"); gate.Annotation personAnn = (gate.Annotation)person.iterator().next(); gate.FeatureMap features = Factory.newFeatureMap(); features.put("gender", personAnn.getFeatures().get("minorType")); features.put("rule", "FirstName"); outputAS.add(person.firstNode(), person.lastNode(), "FirstPerson", features); }

Available Java objects ● bindings: binding variables ● doc: GATE Document ● annotations: all GATE Document annotations ● inputAS, outputAS: phase input and output annotations ● ontology See documentation for more details…..

Alala19 JAPE Application modes ● Brill (fires all matches) ● First (shortest match fires) ● Once (Phase exits after first match) ● All (as for Brill, but matching continues from offset following the current one, not from the end of the last match) ● Appelt (priority ordering: longest match fires, then explicit rule priority, then first defined rule fires) Note that prioritisation only operates within a single phase, not globally

20 {A}+ Application Modes A AA Appelt Once Brill First All

Example: “China Sea” Rule: Location1 Priority: 25 ( ({Lookup.majorType == loc_key, Lookup.minorType == pre})? {Lookup.minorType == country} {Lookup.majorType == loc_key, Lookup.minorType == post})? ) :locName --> :locName.Location = {kind = "location", rule = "Location1"} Rule: Location2 Priority: 20 ({Lookup.minorType == location}) :location --> :location.Name = {kind = "location", rule=GazLocation}

JAPE Hints and Tricks ● JAPE is quite limited in some respects as to what can be done – There is no negative operator – It can be slow if it is badly written, e.g. ({Token})* – Context is consumed, which can make rule-writing awkward – Priority can be difficult to set correctly ● But fear not, there is generally a sneaky way around it…..

How to avoid a pattern from matching Rule: disablePattern Priority: 1000 ( )  {} ● Instead of having a negative operator, we can simply put a high priority rule which does nothing when fired. ● This will be preferred to a lower priority rule which performs the action intended, i.e. only in the case when the former pattern doesn’t apply.

How to play with input annotations Input: Person Organisation VerbWork Split … Rule: RelationWorkIn ({Person} {VerbWork} {Organisation})  {… /* create annotation of type “Relation” */ …} ● Use existing annotations to find relations ● We ignore Tokens to enable more flexibility, i.e. there could be additional words between the annotations specified ● Split ensures we don’t cross sentence boundaries

How to deal with overlapping annotations ● Because matched annotations are consumed, when two annotations overlap (e.g. in gazetteer lists), the second one will never be matched. ● E.g. for the string “hALCAM” with Lookups hAL, ALCAM, and CAM, ALCAM will never be matched ● Solution is to delete the annotations once matched, and then rerun the same grammar phase over the text ● The process may need to be repeated several times (determine by trial and error)

More examples ● In the GATE User Guide under the section “Useful tricks with JAPE” ● Look in the ANNIE grammars and in the foreign language grammars – there are many examples of little tricks ● Check the GATE mailing list archives

Custom Processing Resource for your grammars 1. Java developer extends GATE's default JAPE Transducer creating Java class package com.yourcompany; import gate.creole.Transducer; public class CustomTransducer extends Transducer {} 2. JAPE developer adds definition in the plugin’s creole.xml My custom JAPE Transducer com.yourcompany.CustomTransducer java.lang.String java.net.URL java.lang.String 3. GATE user opens custom resource in GATE GUI Right-Click on “Processing Resources” In the pop-up menu select “New >” --> “My custom JAPE Transducer”

JAPE debugger ● Speeds up the development of JAPE grammars ● Integrated in GATE GUI ● Friendly for non-experts Allows you to: ● Inspect the pattern matching ● Find overridden rules ● Detect complex inter-rule influence ● And many other things

Inspection of pattern matching

Overridden rules

Inter-rule influence ( finding problem)

Inter-rule influence (what is that?)

Inter-rule influence (problem synopsis) Text processed: … of the J. L. Kellog Graduate School of Management and the Indiana University School of Business … Conflicting rule: Rule: NotPersonFull Priority: 80 // Det + Surname // This rule was commented course //J.L. Kellog processed without J. // ( {Token.category == DT} | {Token.category == PRP} | {Token.category == RB} ) ( (PREFIX)* (UPPER) (PERSONENDING)? ):foo Shadowed rule: Rule: PersonFullExt Priority: 100 // F.W. Jones Fred Jones // Andrew "Flip" Filipowski // Andrew J. "Flip" Filipowski //({Token.category == DT})? ( ((FIRSTNAME | FIRSTNAMEAMBIG))+ (INITIALS)? ((FIRSTNAME | FIRSTNAMEAMBIG) )* (PREFIX)* ((UPPER)):surname (PERSONENDING)? ):person -->

Coming soon…..JAPE4 What JAPE4 IS: ● a new version of internal language in GATE release 4 ● language is based on original JAPE ● incorporate best practices from JAPE, Jape+ and Japec ● 3-5 times faster than JAPE What JAPE4 IS NOT: ● an improved version of original Jape, Jape+ or Japec but rather a new language ● a language backward compatible with JAPE In most cases it seems to be possible to easily modify original Jape, Jape+ or Japec grammars to be compatible with JAPE4 specification.