An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

An Introduction to GATE
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 11: Advanced Machine Learning.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
An Introduction to Edison Vivek Srikumar 17 th April 2012.
Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.
Cognitive Computation Group Curator Overview December 3, 2013
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies.
Incorporating Machine Learning in your application: Text classification Vivek Srikumar.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
A Memory-Based Approach to Semantic Role Labeling Beata Kouchnir Tübingen University 05/07/04.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
Cognitive Computation Group Natural Language Processing Tutorial May 26 & 27, 2011
Cognitive Computation Group Resources for Semantic Similarity
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
ELN – Natural Language Processing Giuseppe Attardi
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Overview of Machine Learning for NLP Tasks: part II Named Entity Tagging: A Phrase-Level NLP Task.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Information Extraction From Medical Records by Alexander Barsky.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Ling 570 Day 17: Named Entity Recognition Chunking.
Open Information Extraction using Wikipedia
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Natural language processing tools Lê Đức Trọng 1.
Linguistic Essentials
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Natural Language Processing (NLP)
Improving a Pipeline Architecture for Shallow Discourse Parsing
Chunk Parsing CS1573: AI Application Development, Spring 2003
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)

The Famous People Classifier f( ) = Politician f( ) = Athlete f( ) = Corporate Mogul

Outline An Overview of NLP Resources Our NLP Application: The Fame classifier The Curator Edison Learning Based Java Putting everything together

An overview of NLP resources What can NLP do for me?

NLP resources (An incomplete list) Cognitive Computation Group resources  Tokenization/Sentence Splitting  Part Of Speech  Chunking  Named Entity Recognition  Coreference  Semantic Role Labeling Others  Stanford parser and dependencies  Charniak Parser Page 5

Tokenization and Sentence Segmentation Given a document, find the sentence and token boundaries The police chased Mr. Smith of Pink Forest, Fla. all the way to Bethesda, where he lived. Smith had escaped after a shoot-out at his workplace, Machinery Inc. Why?  Word counts may be important features  Words may themselves be the object you want to classify  “lived.” and “lived” should give the same information  different analyses need to align if you want to leverage multiple annotators from different sources/tasks Page 6

Part of Speech (POS) Allows simple abstraction for pattern detection Disambiguate a target, e.g. “make (a cake)” vs. “make (of car)” Specify more abstract patterns, e.g. Noun Phrase: ( DT JJ* NN ) Specify context in abstract way  e.g. “DT boy VBX” for “actions boys do”  This expression will catch “a boy cried”, “some boy ran”, … Page 7 POSDTNNVBDPPDTJJNN WordTheboystoodontheburningdeck POSDTNNVBDPPDTJJNN WordAboyrodeonaredbicycle

Chunking Identifies phrase-level constituents in sentences [NP Boris] [ADVP regretfully] [VP told] [NP his wife] [SBAR that] [NP their child] [VP could not attend] [NP night school] [PP without] [NP permission]. Useful for filtering: identify e.g. only noun phrases, or only verb phrases  Groups modifiers with heads  Useful for e.g. Mention Detection Used as source of features, e.g. distance (abstracts away determiners, adjectives, for example), sequence,…  More efficient to compute than full syntactic parse  Applications in e.g. Information Extraction – getting (simple) information about concepts of interest from text documents Page 8

Named Entity Recognition Identifies and classifies strings of characters representing proper nouns [ PER Neil A. Armstrong], the 38-year-old civilian commander, radioed to earth and the mission control room here: “ [LOC Houston], [ORG Tranquility] Base here; the Eagle has landed." Useful for filtering documents  “I need to find news articles about organizations in which Bill Gates might be involved…” Disambiguate tokens: “Chicago” (team) vs. “Chicago” (city) Source of abstract features  E.g. “Verbs that appear with entities that are Organizations”  E.g. “Documents that have a high proportion of Organizations” Page 9

Coreference Identify all phrases that refer to each entity of interest – i.e., group mentions of concepts [ Neil A. Armstrong], [the 38-year-old civilian commander], radioed to [earth]. [He] said the famous words, “ [the Eagle] has landed”." The Named Entity recognizer only gets us part-way… …if we ask, “what actions did Neil Armstrong perform?”, we will miss many instances (e.g. “He said…”) Coreference resolver abstracts over different ways of referring to the same person  Useful in feature extraction, information extraction Page 10

Parsers Identify the grammatical structure of a sentence Full parse Johnhittheball object subject modifier Dependency parse Parsers reveal the grammatical relationships between words and phrases

Semantic Role Labeler SRL reveals relations and arguments in the sentence (where relations are expressed as verbs) Cannot abstract over variability of expressing the relations – e.g. kill vs. murder vs. slay… Page 12

The fame classifier Enough NLP. Let’s make our $$$ with the

The Famous People Classifier f( ) = Politician f( ) = Athlete f( ) = Corporate Mogul

The NLP version of the fame classifier All sentences in the news, which the string Barack Obama occurs All sentences in the news, which the string Roger Federer occurs All sentences in the news, which the string Bill Gates occurs Represented by

Our goal Find famous athletes, corporate moguls and politicians Athlete Michael Schumacher Michael Jordan … Politician Bill Clinton George W. Bush … Corporate Mogul Warren Buffet Larry Ellison …

Let’s brainstorm What NLP resources could we use for this task? Remember, we start off with just raw text from a news website

One solution Let us label entities using features defined on mentions Identify mentions using the named entity recognizer Define features based on the words, parts of speech and dependency trees Train a classifier All sentences in the news, which the string Barack Obama occurs

Where to get it: Machine Learning Feature Functions Learning Algorithm Data → “politics” → “sports” → “business”

A second look at the solution Identify mentions using the named entity recognizer Define features based on the words, parts of speech and dependency trees Train a classifier University of Illinois Sentence and Word Splitter Part-of-speech Tagger Named Entity Recognizer Stanford University Dependency Parser (and the NLP pipeline) These tools can be downloaded from the websites. Are we done? If not, what’s missing? These tools can be downloaded from the websites. Are we done? If not, what’s missing?

We need to put the pieces together

The infrastructure The Curator A common interface for different NLP annotators Caches their results Edison Library for NLP representation in Java Helps with extracting complex features Learning Based Java A Java library for machine learning Provides a simple language to define classifiers and perform inference with them

The infrastructure Each infrastructure module has specific interfaces that the user is expected to use The Curator specifies the interface for accessing annotations from the NLP tools Edison fixes the representation for the NLP annotation Learning Based Java requires training data to be presented to it using an interface called Parser

Curator A place where NLP annotations live

Big NLP NLP tools are quite sophisticated The more complex, the bigger the memory requirement  NER: 1G; Coref: 1G; SRL: 4G …. If you use tools from different sources, they may be…  In different languages  Using different data structures If you run a lot of experiments on a single corpus, it would be nice to cache the results  …and for your colleagues, nice if they can access that cache. Curator is our solution to these problems. Page 25

Curator Page 26 NER SRL POS, Chunker Cache Curator

What does the Curator give you? Supports distributed NLP resources  Central point of contact  Single set of interfaces  Code generation in many programming languages (using Thrift) Programmatic interface  Defines set of common data structures used for interaction Caches processed data Enables highly configurable NLP pipeline Overhead: Annotation is all at the level of character offsets: Normalization/mapping to token level required Need to wrap tools to provide requisite data structures Page 27

Getting Started With the Curator Installation:  Download the curator package and uncompress the archive  Run bootstrap.sh The default installation comes with the following annotators (Illinois, unless mentioned) :  Sentence splitter and tokenizer  POS tagger  Shallow Parser  Named Entity Recognizer  Coreference resolution system  Stanford parser

Basic Concept Different NLP annotations can be defined in terms of a few simple data structures:  Record: A big container to store all annotations of a text  Span : A span of text (defined in terms of characters) along with a label (A single token, or a single POS tag)  Labeling : A collection of Span s (POS tags for the text)  Trees and Forests (Parse trees)  Clustering : A collection of Labeling s (Co-reference) Go here for more information:

Example of a Labeling The tree fell.

Edison Representing NLP objects and extracting features

Edison An NLP data representation and feature extraction library Helps manage and use different annotations of text Doesn’t the Curator do everything we need?  Curator is a service that abstracts away different annotators  Edison is a Curator client  And more…

Representation of NLP annotations All NLP annotations are called View s A View is just a labeled directed graph  Nodes are labeled collections of tokens, called Constituent s  Labeled edges between nodes are called Relation s All Views related to some text are contained in a TextAnnotation

Example of Views: Part of speech A tree fell Part of speech view is a degenerate graph  No edges because there are no relations  This kind of View is represented by a subclass called TokenLabelView Note that constituents are token based, not character based A DT A DT tree NN tree NN fell VBD fell VBD constituents

Example of Views: Shallow Parse A tree fell Shallow parse view is also a degenerate graph  No edges because there are no relations  This kind of View is represented by a subclass called SpanLabelView A tree Noun Phrase A tree Noun Phrase fell Verb Phrase fell Verb Phrase constituents

Example of Views: DependencyTree A tree fell A subclass of View called TreeView A A tree fell mod subj Constituents Relations

More about Views View represents a generic graph of Constituents and Relations Its subclasses denote specializations suited to specific structures  TokenLabelView  SpanLabelView  TreeView  PredicateArgumentView  CoreferenceView Each view allows us to query its constituents  Useful for defining features!

Features Complex features using this library Examples  POS tag for a token  All POS tags within a span  All tokens within a span that have a specific POS tag  All chunks contained within a parse constituent  All chunks contained in the largest NP that covers a token  All co-referring mentions to chunks contained in the largest NP that covers this token  All incoming dependency edges to a constituent Enables quick feature engineering

Getting started with Edison How to use Edison: 1. Download the latest version of Edison and its dependencies from the website 2. Add all the jars to your project 3. ???? 4. Profit A Maven repository is also available. See the edison page for more details

Demo 1 Basic Edison example, where we will 1. Create a TextAnnotation object from raw text 2. Add a few views from the curator 3. Print them on the terminal ml

Demo 2 Second Edison example, where we will 1. Create a TextAnnotation object from raw text 2. Add a few views from the curator 3. Print all the constituents in the named entity view

Let’s recall our goal Let us label entities using features defined on mentions Identify mentions using the named entity recognizer Define features based on the words, parts of speech and dependency trees Train a classifier All sentences in the news, which the string Barack Obama occurs

Demo 3 Reading the Fame classifier data and adding views Feature functions  What would be good features for the fame classification task? The US President Barack Obama said that he …. President Barack Obama recently visited France. Features for Barack Obama US: 1 President: 2 said: 1 visited: 1 France: 1

Learning Based Java Writing classifiers

What is Learning Based Java? A modeling language for learning and inference Supports  Programming using learned models  High level specification of features and constraints between classifiers  Inference with constraints The learning operator  Classifiers are functions defined in terms of data  Learning happens at compile time

What does LBJ do for you? Abstracts away the feature representation, learning and inference Allows you to write learning based programs Application developers can reason about the application at hand

Our application Feature Functions Learning Algorithm Data → “politics” → “sports” → “business” Curator and Edison Learning Based Java

Demo 4 The fame classifier itself 1. The features 2. The classifier 3. Compiling to train the classifier

The Fame classifier Putting the pieces together

Recall our solution Let us label entities using features defined on mentions Identify mentions using the named entity recognizer Define features based on the words, parts of speech and dependency trees Train a classifier All sentences in the news, which the string Barack Obama occurs

The infrastructure Curator  Provides access to the POS tagger, NER and the Stanford Dependency parser  Caches all annotations Edison  NLP representation in our program  Feature extraction Learning Based Java  The machine learning

Final demo Let’s see this in action

Links Cogcomp Software: Support: Download the slides and the code from

Running the test code on a Unix Machine Step 1: Train the classifier $./compileLBJ entityFame.lbj Step 2: Compile the other java files with $ ant Step 3: Test the classifier: $./test.sh data/test