University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)

Slides:

Advertisements

Similar presentations

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.

Advertisements

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004

University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.

University of Sheffield NLP Module 4: Machine Learning.

Data Mining and Text Analytics GATE, by Joel Bywater.

2003 Mateusz Żochowski, Marcin Borzymek Software Life Cycle Analysis.

CS 325: Software Engineering January 13, 2015 Introduction Defining Software Engineering SWE vs. CS Software Life-Cycle Software Processes Waterfall Process.

1 D. Bekhouche/ Y. Pollet/ B. Grilheres/ X. Denis University of Salford, UK 06/24/2004 PSI Rouen Perception System Information 9 th International Conference.

Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.

Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.

Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,

The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Survey of Semantic Annotation Platforms

Information Extraction From Medical Records by Alexander Barsky.

A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.

The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.

1 Technologies for (semi-) automatic metadata creation Diana Maynard.

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.

Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.

Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.

October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.

IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.

MedKAT Medical Knowledge Analysis Tool December 2009.

University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Natural Language Interfaces to Ontologies Danica Damljanović

Stand Up Comedy Project/Product Management

©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.

Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf.

Open Health Natural Language Processing Consortium

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

 System Requirement Specification and System Planning.

WP4 Models and Contents Quality Assessment

Unit 6 Application Design KLB Assignment.

PresQT Workshop, Tuesday, May 2, 2017

TextCrowd – Collaborative semantic enrichment of text-based datasets

Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-

Introduction Machine Learning 14/02/2017.

Effect of centrally acting ACE inhibitors on Alzheimer’s disease progress: A retrospective longitudinal study using SLAM BRC Case register Dr Gayan Perera.

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Semantic natural language understanding at scale using Spark, machine-learned annotators and deep-learned ontologies David Claudiu Branzan.

Dr Gayan Perera Epidemiologist

Systems Analysis and Design

Natural Language Processing (NLP)

Abstract descriptions of systems whose requirements are being analysed

Supervised Machine Learning

Template library tool and Kestrel training

Tagging documents made easy, using machine learning

cTAKES: Demo Clinical Text Analysis and Knowledge Extraction System

HCI in the software process

The design process Software engineering and the design process for interactive systems Standards and guidelines as design rules Usability engineering.

The design process Software engineering and the design process for interactive systems Standards and guidelines as design rules Usability engineering.

Introduction to Software Engineering

Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-

Writing Analytics Clayton Clemens Vive Kumar.

Topics in Linguistics ENG 331

Extracting Semantic Concept Relations

HCI in the software process

Temple Ready within an Hour of Collection Capture

HCI in the software process

Natural Language Processing (NLP)

Extreme Programming.

Human Computer Interaction Lecture 14 HCI in Software Process

Hierarchical, Perceptron-like Learning for OBIE

Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.

Information Retrieval

Natural Language Processing (NLP)

Presentation transcript:

University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)

University of Sheffield, NLP Aims of this module ● Turning resources into applications: SLAM ● RichNews: multimedia application and demo ● Musing: Business Intelligence application ● KIM CORE Timelines application and demo ● GATE MIMIR: Semantic search and indexing in use ● The GATE Process

University of Sheffield, NLP Semantic Annotation for the Life Sciences

University of Sheffield, NLP Aim of the application ● Life science semantic annotation is much more than generic annotation of genes, proteins and diseases in text, in order to support search ● There are many highly use-case specific annotation requirements that demand a fresh look at how we drive annotation – our processes ● Processes to support annotation ● Many use cases are ad-hoc and specialised ● Clinical research – new requirements every day ● How can we support this? What tools do we need?

University of Sheffield, NLP Background ● The user ● SLAM: South London and Maudsley NHS Trust ● BRC: Biomedical Research Centre ● CRIS: Case Register Information System ● February and March 2010 ● Proof of concept around MMSE ● Requirements analysis, installation, adaptation ● Since 2010 ● In production ● Cloud based system ● Further use cases

University of Sheffield, NLP Clinical records ● Generic entities such as anatomical location, diagnosis and drug are sometimes of interest ● But many of the enquiries we have seen are more often interested in large numbers of very specific, and ad hoc entities or events ● This example is with a UK National Biomedical Research Centre ● An example – cognitive ability as shown by the MMSE score ● Illustrates a typical (but not the only) process

University of Sheffield, NLP Types of IE systems ● Deep or shallow analysis ● Knowledge Engineering or Machine Learning approaches ● Supervised ● Unsupervised ● Active learning ● GATE is agnostic

University of Sheffield, NLP Supervised learning architecture

University of Sheffield, NLP Unable to assess MMSE but last one on 1/1/8 was 21/30

University of Sheffield, NLP Today she scored 5/30 on the MMSE

University of Sheffield, NLP Today she scored 5/30 on the MMSE I reviewed Mrs. ZZZZZ on 6th March

University of Sheffield, NLP A shallow approach ● Pre-processing, including ● morphological analysis ● “Patient was seen on” vs “I saw this patient on” ● POS tagging ● “patient was [VERB] on [DATE]” ● Dictionary lookup ● “MMSE”, “Mini mental”, “Folstein”, “AMTS” ● Coreference ● “We did an MMSE. It was 23/30”

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January 2008.

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|....

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|....

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|.... IdType 1sentence

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|.... IdType 1sentence

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|.... IdType 1sentence 2token 3token 4token 5token 6token 7token

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|.... IdTypeStartEnd 1sentence039 2token03 3token48 4token912 5token1315 6token1516 7token1618

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|.... IdTypeStartEndFeatures 1sentence039 2token03pos=PP 3token48pos=NN 4token912pos=VB 5token1315pos=CD 6token1516pos=SM 7token1618pos=CD

University of Sheffield, NLP Annotations His MMSE was 23/30 on 15 January |....|....|....|.... IdTypeStartEndFeatures 1sentence039 2token03pos=PP 3token48pos=NN 4token912pos=VBroot=be 5token1315pos=CDtype=num 6token1516pos=SMtype=slash 7token1618pos=CDtype=num

University of Sheffield, NLP Dictionary lookup His MMSE was 23/30 on 15 January |....|....|....|.... Month

University of Sheffield, NLP Dictionary lookup His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month

University of Sheffield, NLP Limitations of dictionary lookup ● Dictionary lookup is designed for finding simple, regular terms and features ● False positives ● “He may get better” ● “Mother is a smoker” ● “He often burns the toast, setting off the smoke alarm” ● Cannot deal with complex patterns ● For example, recognising addresses using just a dictionary would be impossible ● Cannot deal with ambiguity ● I for Iodine, or I for me?

University of Sheffield, NLP Pattern matching ● The early components in a GATE pipeline produce simple annotations (Token, Sentence, Dictionary lookups) ● These annotations have features (Token kind, part of speech, major type...) ● Patterns in these annotations and features can suggest more complex information ● We use JAPE, the pattern matching language in GATE, to find these patterns

University of Sheffield, NLP Patterns His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month {number}{Month}{number}

University of Sheffield, NLP Patterns His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month Date

University of Sheffield, NLP Patterns His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month {number}{slash}{number}

University of Sheffield, NLP Patterns His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month Score

University of Sheffield, NLP Patterns His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month ScoreDate

University of Sheffield, NLP Patterns His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month ScoreDate {MMSE}{BE}{Score}{?}{Date}

University of Sheffield, NLP Patterns His MMSE was 23/30 on 15 January |....|....|....|.... MMS E Month ScoreDate MMSE with score and date

University of Sheffield, NLP Patterns are general ● MMSE was 23/30 on 15 January 2009 ● Mini mental was 25/30 on 12/08/07 ● MMS was 25/30 last week ● MMSE is 25/30 today ● With adaptation ● MMSE 25 out of 30 ● Long range dependencies on dates

University of Sheffield, NLP MMSE pipeline

University of Sheffield, NLP MMSE pipeline Import CSV into GATE

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise Sentence split

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise Sentence split POS tag

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise Sentence split POS tag Dictionary Lookup

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise Sentence split POS tag Date patterns Dictionary Lookup

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise Sentence split POS tag Date patterns Score patterns Dictionary Lookup

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise Sentence split POS tag Date patterns Score patterns MMSE patterns Dictionary Lookup

University of Sheffield, NLP MMSE pipeline Import CSV into GATE Tokenise Sentence split POS tag Date patterns Score patterns MMSE patterns Dictionary Lookup Export back to CSV

University of Sheffield, NLP Writing patterns ● Requires training ● Depending on time and skills, domain expert may take on some rule writing ● Requirements not always clear, and users do not always understand what the technology can do ● Needs a process to support ● Domain expert manually annotates examples ● Language engineer writes rules ● Measure accuracy of rules ● Repeat

University of Sheffield, NLP The process as agile development ● IE system development is often linear ● Guidelines → annotate → implement ● This is similar to the “waterfall” method of software development ● Gather requirements → design → implement ● This has long been known to be problematic ● In contrast, our approach is agile

University of Sheffield, NLP The process as agile development ● Recognise that requirements change ● Embrace that change ● Use it to drive development ● Developers and software engineers work alongside each other to understand requirements ● Early and iterative delivery ● Feedback to collect further requirements ● Reduces cost of annotation

University of Sheffield, NLP Annotation - a process Gather examples

University of Sheffield, NLP A process Write rules Gather examples

University of Sheffield, NLP A process Write rules Run over Unseen documents Gather examples

University of Sheffield, NLP A process Write rules Run over Unseen documents Human correction Gather examples

University of Sheffield, NLP A process Write rules Run over Unseen documents Measure performance Human correction Gather examples

University of Sheffield, NLP A process Write rules Run over Unseen documents Measure performance Human correction Good enou gh? Gather examples

University of Sheffield, NLP A process Write rules Run over Unseen documents Measure performance Human correction Examine errors Good enou gh? No Gather examples

University of Sheffield, NLP

Annotation Diff

University of Sheffield, NLP A process Write rules Run over Unseen documents Measure performance Human correction Examine errors Good enou gh? No Gather examples

University of Sheffield, NLP A process Improve rules Run over Unseen documents Measure performance Human correction Examine errors Good enou gh? No Gather examples

University of Sheffield, NLP A process Improve rules Run over Unseen documents Measure performance Human correction Examine errors Good enou gh? No Yes Gather examples

University of Sheffield, NLP Supporting the process ● We need to train and implement the process ● We need tools to support this process ● Quality Assurance Tools ● Workflow: GATE Teamware ● Annotation pattern search: Mimir ● Coupling search and annotation: Khresmoi ● Develop a pattern and run it over text ● Human correction ● Feedback

University of Sheffield, NLP GATE Teamware: defining workflows

University of Sheffield, NLP GATE Teamware: managing projects

University of Sheffield, NLP 62 GATE Teamware: monitoring projects

University of Sheffield, NLP When is it good enough? Performance Effort 100%

University of Sheffield, NLP When is it good enough? □Like disease ○A small number of very common ones ○Lots of rare ones □For a straightforward use case, ○3 or 4 iterations ○plateau at around 90%

University of Sheffield, NLP Implications ● Ad-hoc annotation and search is as important an approach as generic annotation ● We need tools and processes to support this style of annotation ● An agile annotation process involves users, helps us to elicit their requirements, and reduces the cost of annotation