CS 4705 Robust Semantics, Information Extraction, and Information Retrieval.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
The Impact of Task and Corpus on Event Extraction Systems Ralph Grishman New York University Malta, May 2010 NYU.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
CS 4705 Robust Semantics, Information Extraction, and Information Retrieval.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Retrieval in Practice
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Vector Space Models.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CSC 594 Topics in AI – Text Mining and Analytics
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
From Frequency to Meaning: Vector Space Models of Semantics
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Robust Semantics, Information Extraction, and Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
Information Retrieval and Web Design
Presentation transcript:

CS 4705 Robust Semantics, Information Extraction, and Information Retrieval

Problems with Syntax-Driven Semantics Syntactic structures often don’t fit semantic structures very well –Important semantic elements often distributed very differently in trees for sentences that mean ‘the same’ I like soup. Soup is what I like. –Parse trees contain many structural elements not clearly important to making semantic distinctions –Syntax driven semantic representations are sometimes pretty verbose V --> serves

Semantic Grammars Alternative to modifying syntactic grammars to deal with semantics too Define grammars specifically in terms of the semantic information we want to extract –Domain specific: Rules correspond directly to entities and activities in the domain I want to go from Boston to Baltimore on Thursday, September 24 th –Greeting --> {Hello|Hi|Um…} –TripRequest  Need-spec travel-verb from City to City on Date

Predicting User Input Rely on knowledge of task and (sometimes) constraints on what the user can do –Can handle very sophisticated phenomena I want to go to Boston on Thursday. I want to leave from there on Friday for Baltimore. TripRequest  Need-spec travel-verb from City on Date for City Dialogue postulate maps filler for ‘from-city’ to pre- specified to-city

Priming User Input Users will tend to use the vocabulary they hear from the system: lexical entrainment (Clark & Brennan ’96) –Reference to objects: the scarey M&M man –Re-use of system prompt vocabulary/syntax: Please tell me where you would like to leave/depart from. Where would you like to leave/depart from? Explicit training vs. implicit training Training the user vs. retraining the system

Drawbacks of Semantic Grammars Lack of generality –A new one for each application –Large cost in development time Can be very large, depending on how much coverage you want them to have If users go outside the grammar, things may break disastrously I want to leave from my house at 10 a.m. I want to talk to a person.

Information Retrieval How related to NLP? –Operates on language (speech or text) –Does it use linguistic information? Stemming Bag-of-words approach Very simple analyses –Does it make use of document formatting? Headlines, punctuation, captions Collection: a set of documents Term: a word or phrase Query: a set of terms

But…what is a term? Stop list Stemming Homonymy, polysemy, synonymy

Vector Space Model Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection Is a term t in this document or in this query or not? D = (t1,t2,…,tn) Q = (t1,t2,…,tn) Similarity metric:how many terms does a query share with each candidate document? Weighted terms: term-by-document matrix D = (wt1,wt2,…,wtn) Q = (wt1,wt2,…,wtn)

How do we compare the vectors? –Normalize each term weight by the number of terms in the document: how important is each t in D? –Compute dot product between vectors to see how similar they are –Cosine of angle: 1 = identity; 0 = no common terms How do we get the weights? –Term frequency (tf): how often does i occur in Doc j? –Inverse document frequency (idf): # docs/ # docs term i occurs in –tf. idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection

Evaluating IR Performance Precision: #relevant docs returned/total #docs returned -- how often are you right when you say this document is relevant? Recall: #relevant docs returned/#relevant docs in collection -- how many of the relevant documents do you find? F-measure combines P and R Are P and R equally important?

Improving Queries Relevance feedback: users rate retrieved docs Query expansion: many techniques –add top N docs retrieved to query and resubmit expanded query –WordNet Term clustering: cluster rows of terms in term-by- document matrix to produce synonyms and add to query

IR Tasks Ad hoc retrieval: ‘normal’ IR Routing/categorization: assign new doc to one of predefined set of categories Clustering: divide a collection into N clusters Segmentation: segment text into coherent chunks Summarization: compress a text by extracting summary items or eliminating less relevant items Question-answering: find a span of text (within some window) containing the answer to a question

Information Extraction Another ‘robust’ alternative Idea: ‘extract’ particular types of information from arbitrary text or transcribed speech Examples: –Named entities: people, places, organizations, times, dates MIPS Vice President John Hime –MUC evaluationsMUC evaluations Domains: Medical texts, broadcast news (terrorist reports), …

Appropriate where Semantic Grammars and Syntactic Parsers are not Appropriate where information needs very specific and specifiable in advance –Question answering systems, gisting of news or mail… –Job ads, financial information, terrorist attacks Input too complex and far-ranging to build semantic grammars But full-blown syntactic parsers are impractical –Too much ambiguity for arbitrary text –50 parses or none at all –Too slow for real-time applications

Information Extraction Techniques Often use a set of simple templates or frames with slots to be filled in from input text –Ignore everything else –My number is –The inventor of the wiggleswort was Capt. John T. Hart. –The king died in March of Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots How to do better than everyone else?

The IE Process Given a corpus and a target set of items to be extracted: –Clean up the corpus –Tokenize it –Do some hand labeling of target items –Extract some simple features POS tags Phrase Chunks … –Do some machine learning to associate features with target items or derive this associate by intuition –Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers

Domain-Specific IE from the Web ( Patwardhan & Riloff ’06 ) The Problem: –IE systems typically domain-specific – a new extraction procedure for every task –Supervised learning depends on hand annotation for training Goals: –Acquire domain specific texts automatically on the Web –Identify domain-specific IE patterns automatically Approach:

–Start with a set of seed IE patterns learned from a hand- labeled corpus –Use these to identify relevant documents on the web –Find new seed patterns in the retrieved documents

MUC04 IE Task Corpus: –1700 news stories about terrorist events in Latin America –Answer keys about information that should be extracted Problems: –All upper case –50% of texts irrelevant –Stories may describe multiple events Best results: –50-70% precision and recall with hand-built components –41-44% recall and 49-51% precision with automatically generated templates

Procedure Apply pre-defined syntactic patterns to a training corpus of documents for which relevant/irrelevant judgments known Count how often partial lexicalizations of each (e.g. was killed) appear in relevant vs. irrelevant documents

Rank patterns based on association with domain (frequency in domain documents vs. non-domain documents) Manually review patterns and assign thematic roles to those deemed useful –From 40K+ patterns  291 Now find similar web documents

Domain Corpus Creation Create IR queries by crossing names of 5 terrorist organizations (e.g. Al Qaeda, IRA) with 16 terrorist activities (e.g assinated, bombed, hijacked, wounded)  80 queries –Restricted to CNN, English documents –Eliminated TV transcripts –Yield from 2 runs: 6,182 documents –Cleaned corpus: 5,618 documents

Learning Domain-Specific Patterns Hypothesis: new extraction patterns co-occurring with seed patterns from training corpus will also be associated with terrorism Generate all extraction patterns in CNN corpus (147,712) Compute correlation of each with seed patterns based on frequency of co-occurrence in same sentence – keep those occurring more often that chance with some seed Rank new patterns by their seed correlations

Filter: Measure semantic affinity: how often does this pattern extract an entity of a particular category (e.g. victim, target)? Compute semantic affinity for each extraction pattern wrt 6 categories: target, victim plus distractors: perpetrator, organization, weapon, other –E.g. Frequency of extracting target/frequency of extracting any of 6 categories weighted by log probability of target Highly Ranked Patterns

Remove patterns not strongly associated with desired classes: Evaluate on MUC-4 –Baseline: Recall 64%/Precision 43% on targets Recall 50%/Precision 52% on victims

Results for Web-Learned Patterns Use 396 terrorism extraction patterns learned from MUC training set as seeds Produce ranked list of new patterns from web using semantic affinity of 3.0 threshold Chose top N (50-300) patterns to add to seed set Performance:

Combining IR and IE for QA Information extraction: AQUA

Summary Many approaches to ‘robust’ semantic analysis –Semantic grammars targeting particular domains Utterance --> Yes/No Reply Yes/No Reply --> Yes-Reply | No-Reply Yes-Reply --> {yes,yeah, right, ok,”you bet”,…} –Information extraction techniques targeting specific tasks Extracting information about terrorist events from news –Information retrieval techniques --> more like NLP