CS 4705 Robust Semantics, Information Extraction, and Information Retrieval.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Imbalanced data David Kauchak CS 451 – Fall 2013.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
CS 4705 Robust Semantics, Information Extraction, and Information Retrieval.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Information Retrieval in Practice
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche,
Text Classification, Active/Interactive learning.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Ling 570 Day 17: Named Entity Recognition Chunking.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Are you ready to play…. Deal or No Deal? Deal or No Deal?
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Vector Space Models.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
CS 4705 Corpus Linguistics and Machine Learning Techniques.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Robust Semantics, Information Extraction, and Information Retrieval
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Searching and Summarizing Speech
Searching and Summarizing Speech
Text Categorization Assigning documents to a fixed set of categories
CS246: Information Retrieval
Introduction to Sentiment Analysis
Information Retrieval
Presentation transcript:

CS 4705 Robust Semantics, Information Extraction, and Information Retrieval

Problems with Syntax-Driven Semantics Syntactic structures often don’t fit semantic structures very well –Important semantic elements often distributed very differently in trees for sentences that mean ‘the same’ I like soup. Soup is what I like. –Parse trees contain many structural elements not clearly important to making semantic distinctions –Syntax driven semantic representations are sometimes pretty verbose V --> serves

Alternatives? Semantic Grammars Information Extraction Information Retrieval

Semantic Grammars Alternative to modifying syntactic grammars to deal with semantics too Define grammars specifically in terms of the semantic information we want to extract –Domain specific: Rules correspond directly to entities and activities in the domain I want to go from Boston to Baltimore on Thursday, September 24 th –Greeting --> {Hello|Hi|Um…} –TripRequest  Need-spec travel-verb from City to City on Date

Predicting User Input Semantic grammars rely upon knowledge of the task and (sometimes) constraints on what the user can do, when –Allows them to handle very sophisticated phenomena I want to go to Boston on Thursday. I want to leave from there on Friday for Baltimore. TripRequest  Need-spec travel-verb from City on Date for City Dialogue postulate maps filler for ‘from-city’ to pre- specified from-city

Priming User Input Users will tend to use the vocabulary they hear from the system Explicit training vs. implicit training Training the user vs. retraining the system

Drawbacks of Semantic Grammars Lack of generality –A new one for each application –Large cost in development time Can be very large, depending on how much coverage you want If users go outside the grammar, things may break disastrously I want to leave from my house. I want to talk to someone human.

Some examples Semantic grammars

Information Extraction Another ‘robust’ alternative Idea: ‘extract’ particular types of information from arbitrary text or transcribed speech Examples: –Named entities: people, places, organizations, times, dates MIPS Vice President John Hime –MUC evaluationsMUC evaluations Domains: Medical texts, broadcast news (terrorist reports), voic ,...

Appropriate where Semantic Grammars and Syntactic Parsers are Not Appropriate where information needs very specific and specifiable in advance –Question answering systems, gisting of news or mail… –Job ads, financial information, terrorist attacks Input too complex and far-ranging to build semantic grammars But full-blown syntactic parsers are impractical –Too much ambiguity for arbitrary text –50 parses or none at all –Too slow for real-time applications

Information Extraction Techniques Often use a set of simple templates or frames with slots to be filled in from input text –Ignore everything else –My number is –The inventor of the wiggleswort was Capt. John T. Hart. –The king died in March of Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots How to do better than everyone else?

The IE Process Given a corpus and a target set of items to be extracted: –Clean up the corpus –Tokenize it –Do some hand labeling of target items –Extract some simple features POS tags Phrase Chunks … –Do some machine learning to associate features with target items or derive this associate by intuition –Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers

IE in SCANMail: Audio Browsing and Retrieval for Voic Motivated by interviews, surveys and usage logs identifying problems of heavy voic users: –It’s hard to quickly scan through new messages to find the ones you need to deal with (e.g. during a meeting break) –It’s hard to find the message you want in your archive –It’s hard to locate the information you want in any message (e.g. the telephone number, caller name)

SCANMail Architecture Caller SCANMailSubscriber

Corpus Details Recordings collected from 138 voic boxes of AT&T Labs employees 100 hours; 10,000 messages; 2500 speakers Gender balanced; 12% non-native speakers Mean message duration 36.4 secs, median 30.0 secs Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telephone numbers)

Transcription gender F age A caller_name NA native_speaker N speech_pathology N sample_rate 8000 label " [ Greeting: hi R__ ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [.hn ] I guess there's some [.hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [.hn ] anyway they had this idea [ cos ] since I think J__'s the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [.hn ] well J2__actually offered to take J__home with her and then would she would meet you back at the synagogue at [ Time: five thirty ] to pick her up [.hn ] [ uh ] so I don't know how you feel about that otherwise Miriam and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [.hn ] I wanted to know how you feel before I tell her one way or the other so call me [.hn ] right away cos I have to get back to her in about an hour so [.hn ] okay [ Closing: bye [.nhn ] [.onhk ] ]" duration "50.3 seconds"

Demo mail/demo.html mail/demo.html Audix extension: 8380 Audix extension: 8380 Audix password: (null) Audix password: (null) Audix extension: 8380 Audix password: (null) SCANMail demo SCANMail demo

SCANMail Demo: Number Extraction

SCANMail Access Devices PC Pocket PC Dataphone Voice Phone Flash

Finding Phone Numbers and Caller IDs (Jansche & Abney ‘02)Jansche & Abney ‘02 Goals: extract key information from msgs to present in headers from ASR transcripts Approach: –Supervised learning from transcripts (phone #’s, caller self-ids) Hand-crafted rules (good recall) propose candidates Statistical classifier (decision tree) prunes bad candidates –Features exploit structure of key elements (e.g. length of phone numbers) and surrounding context (e.g. self- ids occur at beginning of msg)

Location is key Predict 1=phr begin,2=in phr,3=neither Phone numbers: –Rules convert to standard digit format –Predict start with rules and prune with classifier –Position in msg and lexical cues plus length of digit string as features (.94 F on human-labeled;.95 F on ASR) Self-ids: –Predict start prediction (97% begin 1-7 words into msg) and then length of phrase (majority 2-4 words) –Avoid risk of relying on correct recognition for names –Good lexical cues to end of phrase (‘I’, ‘could’, ‘please’) (.71 F on human-labeled;.70 F on ASR)

Information Retrieval How related to NLP? –Operates on language (speech or text) –Does it use linguistic information? Stemming Bag-of-words approach Very simple analyses –Does it make use of document formatting? Headlines, punctuation, captions Collection: a set of documents Term: a word or phrase Query: a set of terms

But…what is a term? Stop list Stemming Homonymy, polysemy, synonymy

Vector Space Model Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection Is a term t in this document or in this query or not? D = (t1,t2,…,tn) Q = (t1,t2,…,tn) Similarity metric:how many terms does a query share with each candidate document? Weighted terms: term-by-document matrix D = (wt1,wt2,…,wtn) Q = (wt1,wt2,…,wtn)

How do we compare the vectors? –Normalize each term weight by the number of terms in the document: how important is each t in D? –Compute dot product between vectors to see how similar they are –Cosine of angle: 1 = identity; 0 = no common terms How do we get the weights? –Term frequency (tf): how often does t occur in D? –Inverse document frequency (idf): # docs/ # docs term t occurs in –tf. idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection

Evaluating IR Performance Precision: #rel docs returned/total #docs returned -- how often are you right when you say this document is relevant? Recall: #rel docs returned/#rel docs in collection -- how many of the relevant documents do you find? F-measure combines P and R Are P and R equally important?

Improving Queries Relevance feedback: users rate retrieved docs Query expansion: many techniques –e.g. add top N docs retrieved to query and resubmit expanded query Term clustering: cluster rows of terms to produce synonyms and add to query

IR Tasks Ad hoc retrieval: ‘normal’ IR Routing/categorization: assign new doc to one of predefined set of categories Clustering: divide a collection into N clusters Segmentation: segment text into coherent chunks Summarization: compress a text by extracting summary items Question-answering: find a stretch of text containing the answer to a question

Combining IR and IE for QA Information extraction

Summary Many approaches to ‘robust’ semantic analysis –Semantic grammars targeting particular domains Utterance --> Yes/No Reply Yes/No Reply --> Yes-Reply | No-Reply Yes-Reply --> {yes,yeah, right, ok,”you bet”,…} –Information extraction techniques targeting specific tasks Extracting information about terrorist events from news –Information retrieval techniques --> more like NLP