Presentation on theme: "1/(19) GATE Evaluation Tools GATE Training Course October 2006 Kalina Bontcheva."— Presentation transcript:
1/(19) GATE Evaluation Tools GATE Training Course October 2006 Kalina Bontcheva
2/(19) System development cycle 1.Collect corpus of texts 2.Annotate manually gold standard 3.Develop system 4.Evaluate performance 5.Go back to step 3, until desired performance is reached
3/(19) Corpora and System Development Gold standard data created by manual annotation Corpora are divided typically into a training and testing portion Rules and/or learning algorithms are developed or trained on the training part Tuned on the testing portion in order to optimise –Rule priorities, rules effectiveness, etc. –Parameters of the learning algorithm and the features used (typical routine: 10-fold cross validation) Evaluation set – the best system configuration is run on this data and the system performance is obtained No further tuning once evaluation set is used!
4/(19) Some NE Annotated Corpora MUC-6 and MUC-7 corpora - English CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and German http://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch http://cnts.uia.ac.be/conll2003/ner/ http://cnts.uia.ac.be/conll2002/ner/ TIDES surprise language exercise (NEs in Cebuano and Hindi) ACE – English - http:// www.ldc.upenn.edu/Projects/ACE/ http:// www.ldc.upenn.edu/Projects/ACE/
5/(19) Some NE Annotated Corpora MUC-6 and MUC-7 corpora - English CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and German http://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch http://cnts.uia.ac.be/conll2003/ner/ http://cnts.uia.ac.be/conll2002/ner/ TIDES surprise language exercise (NEs in Cebuano and Hindi) ACE – English - http:// www.ldc.upenn.edu/Projects/ACE/ http:// www.ldc.upenn.edu/Projects/ACE/
6/(19) The MUC-7 corpus 100 documents in SGML News domain Named Entities: 1880 Organizations (46%) 1324 Locations (32%) 887 Persons (22%) Inter-annotator agreement very high (~97%) http://www.itl.nist.gov/iaui/894.02/related_projects/muc/p roceedings/muc_7_proceedings/marsh_slides.pdfhttp://www.itl.nist.gov/iaui/894.02/related_projects/muc/p roceedings/muc_7_proceedings/marsh_slides.pdf
7/(19) The MUC-7 Corpus (2) CAPE CANAVERAL, Fla. &MD; Working in chilly temperatures Wednesday night, NASA ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission. Endeavour, with an international crew of six, was set to blast off from the Kennedy Space Center on Thursday at 4:18 a.m. EST, the start of a 49-minute launching period. The nine day shuttle flight was to be the 12th launched in darkness.
8/(19) ACE – Towards Semantic Tagging of Entities MUC NE tags segments of text whenever that text represents the name of an entity In ACE (Automated Content Extraction), these names are viewed as mentions of the underlying entities. The main task is to detect (or infer) the mentions in the text of the entities themselves Rolls together the NE and CO tasks Domain- and genre-independent approaches ACE corpus contains newswire, broadcast news (ASR output and cleaned), and newspaper reports (OCR output and cleaned)
9/(19) ACE Entities Dealing with –Proper names – e.g., England, Mr. Smith, IBM –Pronouns – e.g., he, she, it –Nominal mentions – the company, the spokesman Identify which mentions in the text refer to which entities, e.g., –Tony Blair, Mr. Blair, he, the prime minister, he –Gordon Brown, he, Mr. Brown, the chancellor
11/(19) Annotate Gold Standard – Manual Annotation in GATE GUI
12/(19) Ontology-Based Annotation (coming in GATE 4.0)
13/(19) Two GATE evaluation tools AnnotationDiff Corpus Benchmark Tool
14/(19) AnnotationDiff Graphical comparison of 2 sets of annotations Visual diff representation, like tkdiff Compares one document at a time, one annotation type at a time Gives scores for precision, recall, F_measure etc.
16/(19) Corpus Benchmark Tool Compares annotations at the corpus level Compares all annotation types at the same time, i.e. gives an overall score, as well as a score for each annotation type Enables regression testing, i.e. comparison of 2 different versions against gold standard Visual display, can be exported to HTML Granularity of results: user can decide how much information to display Results in terms of Precision, Recall, F-measure
17/(19) Corpus structure Corpus benchmark tool requires particular directory structure Each corpus must have a clean and marked directory Clean holds the unannotated version, while marked holds the marked (gold standard) ones There may also be a processed subdirectory – this is a datastore (unlike the other two) Corresponding files in each subdirectory must have the same name
18/(19) How it works Clean, marked, and processed Corpus_tool.properties – must be in the directory from where gate is executed Specifies configuration information about –What annotation types are to be evaluated –Threshold below which to print out debug info –Input set name and key set name Modes –Default – regression testing –Human marked against already stored, processed –Human marked against current processing results