Natural Language Processing at NYU: the Proteus Project

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
The Impact of Task and Corpus on Event Extraction Systems Ralph Grishman New York University Malta, May 2010 NYU.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
July 9, 2003ACL An Improved Pattern Model for Automatic IE Pattern Acquisition Kiyoshi Sudo Satoshi Sekine Ralph Grishman New York University.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Interfaces for Querying Collections. Information Retrieval Activities Selecting a collection –Lists, overviews, wizards, automatic selection Submitting.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
World Wide Web As the World Wide Web increased in popularity, it was difficult to keep track of all web addresses. Search engines were created to minimize.
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Chapter 5: Information Retrieval and Web Search
1 Internet Search Tools Adapted from Kathy Schrock’s PowerPoint entitled “Successful Web Search Strategies” Kathy Schrock’s complete PowerPoint available.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
The World Wide Web is a great place to find more information about a topic. But there are a lot of sites out there—some are good and some are not so good.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
The World Wide Web is a great place to find more information about a topic. But there are a lot of sites out there—some are good and some are not so good.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
By Sarah Kastner, Brian Marhefki, and David Vrooman.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Automated Information Retrieval
CSCE 590 Web Scraping – Information Extraction II
Sampath Jayarathna Cal Poly Pomona
Information Organization: Overview
Lecture 12: Relevance Feedback & Query Expansion - II
Daniel Bevis William King Villanova University Spring 2006 CS9010
Robust Semantics, Information Extraction, and Information Retrieval
Backpage Gold Coast – One of the foremost classified sites. A relatively new classified ads site, Backpage Gold Coast is more similar to backpage But it makes a good pick if you want a platform that is free and easy to use.
Discovery of Inference Rules for Question Answering
Director, Proteus Project Research in Natural Language Processing
Eric Sieverts University Library Utrecht Institute for Media &
IR Theory: Evaluation Methods
Semantic Knowledge Discovery, Organization and Use
Dept. of Computer Science University of Liverpool
Basic Information Retrieval
CS 430: Information Discovery
Introduction Task: extracting relational facts from text
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Information Retrieval
Zhixiang Chen & Xiannong Meng U.Texas-PanAm & Bucknell Univ.
Information Organization: Overview
Presentation transcript:

Natural Language Processing at NYU: the Proteus Project Ralph Grishman September 2009

Proteus Project Faculty Ralph Grishman Satoshi Sekine Adam Meyers http://nlp.cs.nyu.edu/

‘Just the Facts’ Vast amount of information is now available on-line in text form but getting ‘the facts’ can be very hard and slow Where has Secretary Clinton been over the last month? Which places on the East Coast have had swine flu outbreaks this month? To move from search to question answering we need more than a bag of words we need to figure out who-did-what-to-whom

Understanding natural language isn’t easy The rebels strafed the car … with automatic weapons fire. … with the Minister and his deputy. They … died instantly. … were promptly arrested. Understanding language requires a lot of knowledge.

How to get all this knowledge? By hand … too expensive Use weakly supervised learning Give a few examples (‘seeds’) Use very large text corpus to learn similar examples

Knowledge Discovery: An Example Goal: want to keep track of all the hirings and departures of executives need to find all the ways such events are described Method: identify a few seed patterns retrieve documents containing patterns find subject-verb-object pattern with high frequency in retrieved documents relatively high frequency in retrieved docs vs. other docs add pattern to seed and repeat

#1: pick seed pattern Seed: < person retires >

#2: retrieve relevant documents Seed: < person retires > Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president. Relevant documents Other documents

#3: pick new pattern Seed: < person retires > < person was named president > appears in several relevant documents Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president.

#4: add new pattern to pattern set Pattern set: < person retires > < person was named president >

Results for some event types, unsupervised learning can do as well as manual pattern development Recall and precision as a function of number of iterations of learner:

Robust Learning Quality of learned patterns is uneven ambiguity of language leads us to learn incorrect patterns Need to identify cases of uncertainty Potential linguistic ambiguities With multiple classifiers using distinct features, cases where they disagree Query user for selected uncertain examples Weakly supervised learning + active learning robust, rapid knowledge discovery

For More Information Project web site Course nlp.cs.nyu.edu Course G22.2590 - Natural Language Processing (Spring 2010)