Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

An Introduction to GATE
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Technical Tips and Tricks for User Support Mike Gardner
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Text Analytics on UIMA and UIMA Semantic Search Engine ISM209 David Lewis Student Project Presentation
Use Case Modelling Visual Annotator for studying ICU Notes Bacchus Beale.
UIMA Introduction SHARPn Summit June 11, 2012
Overview of Search Engines
DEiXTo.
Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.
Lecturer: Ghadah Aldehim
Chapter 7 Requirement Modeling : Flow, Behaviour, Patterns And WebApps.
What Can Do for You! Fabian Christ
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.
M1G Introduction to Programming 2 1. Designing a program.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
1 A Web Specific Language for Content Management Systems Viðar Svansson, Roberto E. Lopez-Herrejon Computing Laboratory University of Oxford.
Multi-agent Research Tool (MART) A proposal for MSE project Madhukar Kumar.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Practical Project of the 2006 Joint International Master’s Degree.
UIMA SHARP 4 - NLP May 25, Outline UIMA Terminology (not just TLAs) Parts of a UIMA pipeline Running a pipeline Viewing annotations Creating a new.
1 Peter Fox Xinformatics 4400/6400 Week 11, April 16, 2013 Information Audit and dealing with Unstructured Information.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
POSTECH DP & NM Lab. (1)(1) POWER Prototype (1)(1) POWER Prototype : Towards Integrated Policy-based Management Mi-Joung Choi
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
OWL Representing Information Using the Web Ontology Language.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Service Creation, Advertisement and Discovery Including caCORE SDK and ISO21090 William Stephens Operations Manager caGrid Knowledge Center February.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de A.
Differences Training BAAN IVc-BaanERP 5.0c: Application Administration, Customization and Exchange BaanERP 5.0c Tools / Exchange.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
® IBM Software Group © 2007 IBM Corporation Module 1: Getting Started with Rational Software Architect Essentials of Modeling with IBM Rational Software.
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
Getting Your Content in the Penn State Student Portal Presented By James Leous, Program Manager James Vuccolo, Lead Research Programmer.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Information Retrieval in Practice
Search Engine Architecture
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
Combining GATE and UIMA
Presentation transcript:

Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 2 Overview What is UIMA? First Experiments NLP Teaching Conclusion

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 3 UIMA: Unstructured Information Management Architecture a software architecture for developing and deploying unstructured information management (UIM) applications UIM application: a software system –analyse large volumes of unstructured information to discover, organize, and deliver relevant knowledge to the end user software architecture which specifies –component interfaces, data representations, …

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 4 UIMA: Unstructured Information Management Architecture … interfaces to a collection of data items (e.g., documents) to be analyzed. Collection Readers return CASes that contain the documents to analyze, possibly along with additional metadata. … takes a CAS, analyzes its contents, and produces an enriched CAS. Analysis Engines can be recursively composed of other Analysis Engines (called an Aggregate Analysis Engine). Aggregates may also contain CAS Consumers. … may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from tags in the original HTML) into the CAS. CAS: Common Analysis Structure CPE: Collecting Processing Manager … consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 5 Analysis Engine (AE): –a component that analyzes artifacts (e.g. documents) and infers information about them –consists of two parts: Java classes (typically packaged as one or more JAR files) and AE descriptors (one or more XML files) –the configuration settings for the Analysis Engine as well as –a description of the AE’s input and output requirements. UIMA: Unstructured Information Management Architecture

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 6 UIMA: Unstructured Information Management Architecture analysis engine Annotator processing resources type system Annotation Interface define annotation type: name features (begin, end, …) describe analysis engine: annotator class input parameter output of annotations external resources interface resources linked to a type system uses define an annotator create Java XML

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 7 Aggregate Analysis Engine: –combine different analysis engine within one Analysis Engine UIMA: Unstructured Information Management Architecture [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 8 Overview Introduction First Experiments NLP Teaching Conclusion

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 9 First Experiments: UIMA vs. GATE base line: –2 persons, 2 systems, 1 corpus and 1 extraction task –skills/experiences of the persons: UIMAGATEEclipse/Java Person 1 Person 2  

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 10 Task of the Experiment process a corpus of websites –to detect and extract information relevant for tourists opening times of museum, prices of hotels,… corpus: –30 tourism web sites of Egypt –additional 20 web sites of Washington, New York, London output: –Prolog facts for a reasoner –Questions: Which museum is now open? …

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 11 Evaluation Topics/Points ease of getting acquainted with system?: –quality of docus: completeness, clarity, up-to-date, …? –tutorials, use cases, …? processing and linguistic resources? –lexica, Gazetteer lists, tools tools for resource maintenance and extension? –quality: selfexplanatory, robust, comfortable speed of processing? single document vs. large corpora? limitations, suggestions for improvement? support for im-/export of a variety of document formats?

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 12 Excerpts from the Corpus The Egyptian Museum is open the hours: 9am-5pm daily The Military Museum is open the hours: Summer: 8am- 5:30pm; winter: 8am-4:30pm Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter) 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri …

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 13 UIMA Application several annotators (like a pipeline) museum pattern time pattern interval of times restrictions museum information... *Fraunces Tavern Museum* 54 Pearl St Tuesday-Friday, 12pm?5pm; … regular expressions window covering two time intervals and a restriction window covering a museum and opening hours Prolog facts: museumopen('Fraunces Tavern Museum ', ' T12:00:00',' T17:00:00'). museumopen('Fraunces Tavern Museum ', ' T12:00:00',' T17:00:00'). museumopen('Fraunces Tavern Museum ', ' T12:00:00',' T17:00:00').

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 14 UIMA: Results information annotated in the documents: –names of museums, hotels –times, time intervals –time restrictions –prices, intervals of prices (hotel prices) –keywords for museum category –names of pharaohs (annotated with a correction of mispellings) information about hotel and museum are exported into Prolog facts and into a short textual summary –templates filled with the detected information hotels: Price information about Cosmopolitan Hotel : $157 museums: *** *Fraunces Tavern Museum* *** Open from 12:00:00 to 17:00:00; Restriction: Tuesday-Friday

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 15 UIMA vs. GATE: Conclusion no final judgement about: use GATE or UIMA –depends on your task task description expected results which processing resources are necessary your preferences for interface prefer the Eclispe environment (or other Java editors) prefer a comfortable GUI

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 16 UIMA vs. GATE: Conclusion GATE: tools available comfortable GUI UIMA: plain framework simplified definition of (complex) result structures simplified pre- and postprocessing of annotations both are extensible –e.g. for processing German documents

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 17 'German' Extension of Processing Resources XDOC document suite –tools for processing German documents –tools implemented in CommonLisp for UIMA –Java reimplementation of the tools –several analysis engines

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 18 XDOC in UIMA annotation of –part-of-speech (Morphix, heuristics) –semantic categories –named entities (vehicles, cities, …) a coarse approach for classification of PP –using maxent library

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 19 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -good -illustrative examples (tutorial) -completeness: sometimes it is very shortly described -experiences with Eclipse and Java programming are advantageous -prior knowledge about Java and Eclipse is helpful

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 20 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -annotators only from tutorial -sentence annotation -word annotation -date/time annotators -examples for using regular expressions etc. -external resources can be integrated: -lexical resources as external resources (text files) -existing processing resources -implementation of an interface is necessary

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 21 UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -specific Eclipse component editors or -simple text editors

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 22 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -faster than GATE? -in CPE detailed information about processing time for each module

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 23 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -Collection Reader -document(s) from a directory

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 24 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? no limitations: –all is possible, but implementation or interfacing by user wish: –more processing and linguistic resources within the distribution

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 25 UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? -import: CAS Initializer -export: CAS Consumer -transform annotations in any other format -export of -document + annotations -only annotations -required: Java application

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 26 Overview Introduction First Experiments NLP Teaching Conclusion

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 27 NLP Teaching course: Information Extraction aim of the course: to make our students acquainted with information extraction as basic NLP technology –UIMA, GATE students: computer science, data-knowledge engineering skills of the students: programming Java

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 28 NLP Teaching different corpora: –news about FIFA world cup 2006 in Germany, –description of drugs, –announcements of new books, … tasks for students –to develop different anaylsis engines and combine them for annotation of URLs, addresses, name of players, results of games, … using regular expressions, external resources, maximum entropy models

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 29 NLP Teaching

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 30 UIMA: A Students View easy to handle Java programming (environment) problems of students: –to understand the dependencies between the several descriptors for teaching helpful (future work): –a 'comparator' of different solutions of students –which solution is the best, related to a 'master' solution

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 31 Overview Introduction First Experiments NLP Teaching Conclusion

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 32 Conclusion UIMA: –easy to learn and to handle –support the management of different annotations different processing resources –integration of external resources (processing resources as well lexical resources) –splitting of 'processing steps': reader, initalizer, analysis engine, consumer 'wish-list': –a kind of jape transducer interface to GATE's processing resources is available –'comparator' for evaluation of solutions

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 33

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 34 XDOC in UIMA

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 35 Introduction really?

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 36 Introduction first experiments with UIMA –processing tourism web sites, news about the FIFA world cup 2006 in Germany, … –integration of tools from the XDOC document suite using UIMA in a course on Information Extraction

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 37 Introduction "IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies." November 2005; Version of UIMA is available