Cognitive Computation Group Curator Overview December 3, 2013

Slides:



Advertisements
Similar presentations
An Introduction to GATE
Advertisements

Chapter 17: WEB COMPONENTS
An Introduction to Edison Vivek Srikumar 17 th April 2012.
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies.
An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
SCRIPTING LANGUAGE. The first interactive shells were developed in the 1960s to enable remote operation of the first time-sharing systems, and these,
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
Guide To UNIX Using Linux Third Edition
Russell Taylor Lecturer in Computing & Business Studies.
Overview of Search Engines
Creating Web Page Forms
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Chapter 6: Hostile Code Guide to Computer Network Security.
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Web Development & Design Foundations with XHTML Chapter 9 Key Concepts.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Cognitive Computation Group Natural Language Processing Tutorial May 26 & 27, 2011
1 Web Developer & Design Foundations with XHTML Chapter 6 Key Concepts.
ELN – Natural Language Processing Giuseppe Attardi
INTRODUCTION TO WEB DATABASE PROGRAMMING
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
ITCS 6010 SALT. Speech Application Language Tags (SALT) Speech interface markup language Extension of HTML and other markup languages Adds speech and.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Standard API and Contextualization
Hive Facebook 2009.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
The Basics of Javadoc Presented By: Wes Toland. Outline  Overview  Background  Environment  Features Javadoc Comment Format Javadoc Program HTML API.
MinorThird 서울시립대학교 인공지능연구실 곽별샘
FI-CORE Data Context Media Management Chapter Release 4.1 & Sprint Review.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CGI Common Gateway Interface. CGI is the scheme to interface other programs to the Web Server.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
Part 4 Processing and saving data with CGI/Perl Psychological Science on the Internet: Designing Web-Based Experiments From the Ground Up R. Chris Fraley.
David Lawrence 7/8/091Intro. to PHP -- David Lawrence.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Dr. Abdullah Almutairi Spring PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages. PHP is a widely-used,
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture Transforming Data: Using Apache Xalan to apply XSLT transformations Marc Dumontier Blueprint Initiative Samuel Lunenfeld Research Institute.
PHP stands for …….. “PHP Hypertext Pre-processor” and is a server-side scripting language like ASP. PHP scripts are executed on the server PHP supports.
1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.
Website Source Code Free Download.
Information Retrieval in Practice
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
z/Ware 2.0 Technical Overview
Spark Presentation.
Natural Language Processing (NLP)
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Chapter 27 WWW and HTTP.
Introduction to Apache
Overview of big data tools
Extracting Recipes from Chemical Academic Papers
Natural Language Processing (NLP)
Presented By: Kwangsung Oh
Natural Language Processing (NLP)
Presentation transcript:

Cognitive Computation Group Curator Overview December 3,

Available from CCG in Curator Tokenization/Sentence Splitting Part Of Speech Chunking Lemmatizer Named Entity Recognition Coreference Semantic Role Labeling Wikifier 3 rd party syntactic parsers:  Charniak  Stanford (dependency and constituency) Page 2

Academic research use of NLP tools Find tools written in the language you’re programming with, e.g. python, Java, perl, c++… …with a nice API Page 3 public class myApp { POSTagger tagger; …. public Result doSomething( String text ) { List taggedWords = tagger.tag( text ); … }

Using NLP tools (cont’d) Page 4 …OR maybe, it’s written in Ocaml and only runs from the command line and writes to a file… … so write a shell script that runs the first tool and pipes its output to your tool… …and write a parser to map from that output to your data structures… …or maybe you could learn Ocaml and write a web service wrapper... Generally, people either…  tend to use a lot of File I/O and custom parsing -- cumbersome and usually extremely non-portable.  Use a specific package in a specific language (e.g. NLTK), and stick to it.  Write all their own tools.

The growing problem… Usually, complex applications like QA benefit from using many NLP tools. For many tasks – e.g. POS, NER, syntactic parsing – there are numerous packages available from various research groups.  But they use different languages  …and different APIs…  …and you don’t know for certain which tool of each type would be the best, so you’d like to try out different combinations…  …and as tools get more sophisticated, they tend to need more memory. CCG tools: Old NER: 1G; Old Coref: 1G; SRL/Nom: 4G each; new NER: 6-8G; Wikifier: 8G…. Even if they are all in Java, you may not have a machine that can run them all in one VM. Page 5

CURATOR Page 6

Curator Page 7 NER SRL POS, Chunker Cache Curator

What does the Curator give you? Supports distributed NLP resources  Single point of contact  Single set of interfaces  Common interchange format (Thrift)  Code generation in many programming languages (using Thrift) Programmatic interface  Defines set of common data structures used for interaction Caches processed data Enables highly configurable NLP pipeline Overhead: Annotation is all at the level of character offsets: Normalization/mapping to token level required Need to wrap tools to provide requisite data structures Page 8

Getting Started With the Curator Installation:  Download the curator package and uncompress the archive  Install prerequisites: thrift, apache ant, boost, mongodb  Run bootstrap.sh The default installation comes with the following annotators (Illinois, unless mentioned) :  Sentence splitter and tokenizer  POS tagger  Lemmatizer  Shallow Parser  Named Entity Recognizer  Coreference resolution system  Stanford and Charniak parsers  Semantic Role Labeler (+ Nominalized verb RL)

Basic Concept Different NLP annotations can be defined in terms of a few simple data structures:  Record: A big container to store all annotations of a text  Span: A span of text (defined in terms of characters) along with a label (A single token, or a single POS tag) 3. Node: A Span, a Label, and a set of children (indexes into a common list of Nodes)  Labeling: A collection of Spans (POS tags for the text)  Trees and Forests: A collection of Nodes (Parse trees)  Clustering: A collection of Labelings (Co-reference) Note: spans use one-past-the-end indexing  “The” at beginning of sentence has character offsets ‘0,3’

Spans, Labelings, etc. The Span is the basic unit of information in Curator’s data structures. A Span has a label, a pair of offsets (one-past-the-end – see the Labeling/Span example further on), and a key/value map to contain additional information While the different data structures (Labelings, Trees, etc.) are provided with specific uses in mind, there are no specific constraints on how any given application represents its information  Part of Speech will probably use the Span label to store POS information, but the key/value map could be used instead  Coreference may store additional information about mentions in a mention chain in their key/value maps Page 11

Example of a Labeling and Span The tree fell.

Example of a Tree and Node The tree fell.

Example of a Clustering John saw Mary and her father at the park. He was alarmed by the old man’s fierce glare. Labeling 1: [E1; 0,4 (John)], [E1; 43,45 (He)] Labeling 2: [E2; 10,14 (Mary)], [E2; 20,23 (her)] Labeling 3: [E3; 20, 29 (her father)], [E3; 59, 61 (the old man)]

Using Curator for Flexible NLP Pipeline Setting up:  Install Curator Server instance  Install components (Annotators)  Update configuration files Use:  Use libraries provided: curatorClient.provide() method  Access Record field indicated by Component documentation/configuration Page 15

Record Data Structure struct Record { /** how to identify this record. */ 1: required string identifier, 2: required string rawText, 3: required map labelViews, 4: required map clusterViews, 5: required map parseViews, 6: required map views, 7: required bool whitespaced, } rawText contains original text span Annotators populate one of the Views, assign a unique identifier (specified in configuration file) Page 16

Annotator Example: Parser Will populate a View, named ‘charniak’ Curator will expect a Parser interface from the annotator Client will expect prerequisites to be provided in other Record fields  Specified via Curator server’s annotator configuration file: parser charniak mycharniakhost.uiuc.edu:8087 sentences:tokens:pos Page 17

Using Curator (Java) snippet public void useCurator( String text ) { // First we need a transport TTransport transport = new TSocket(host, port ); // we are going to use a non-blocking server so need framed transport transport = new TFramedTransport(transport); // Now define a protocol which will use the transport TProtocol protocol = new TBinaryProtocol(transport); // instantiate the client Curator.Client client = new Curator.Client(protocol); transport.open(); Map avail = client.describeAnnotations(); transport.close(); for (String key : avail.keySet()) System.out.println(``\t'' + key + `` provided by '' + avail.get(key)); boolean forceUpdate = true; // force curator to ignore cache … Page 18

Curator snippet (Java) … // get an annotation source named as 'ner' in curator annotator // configuration file transport.open(); record = client.provide( “ner‘”, text, forceUpdate); transport.close(); for (Span span : record.getLabelViews().get(“ner”).getLabels()) { System.out.println(span.getLabel() + `` : '' + record.getRawText().substring(span.getStart(), span.getEnding())); }... } Page 19

Curator snippet (php) function useCurator() { // set variables naming curator host and port, timeout, and text... $socket = new TSocket($hostname, $c_port); $socket->setRecvTimeout($timeout*1000); $transport = new TBufferedTransport($socket, 1024, 1024); $transport = new TFramedTransport($transport); $protocol = new TBinaryProtocol($transport); $client = new CuratorClient($protocol); $transport->open(); $record = $client->getRecord($text); $transport->close(); … Page 20

Curator snippet (php) … foreach ($annotations as $annotation) { $transport->open(); $record = $client->provide($annotation, $text, $update); $transport->close(); } foreach ($record->labelViews as $view_name => $labeling) { $source = $labeling->source; $labels = $labeling->labels; $result = ``''; foreach ($labels as $i => $span) { $result.= ``$span->label;'';... }... Page 21

Benefits From the user’s (i.e., developer of complex text processing applications)’ perspective, Programmatic interface in their language of choice Uniform mechanism for accessing a wide variety of NLP components Caching of annotations, which can be shared across a group Distribution of memory-hungry components across different machines, but with one point of access For the more adventurous, an extensible framework that can be changed via the specification of the underlying Thrift files Page 22

Edison A Java library by Vivek Srikumar of CCG that…  Simplifies access to Curator  Defines useful NLP-friendly data structures  Provides code for a lot of common NLP tasks, e.g. feature extraction, calculation of performance statistics, … The link above provides examples for using Edison and Curator together Page 23