ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Lecture 1 Introduction: Linguistic Theory and Theories
Language on the OSSLT. About the test  All grade 10 students (and second time writers) will write the literacy test on March 27  You must pass the.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
 The ACT Writing Test is an optional, 30-minute test which measures your writing skills. The test consists of one writing prompt, following by two opposing.
ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu (co-PIs) USC/Information Sciences Institute December 4-6.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Discourse Topics, Linguistics, and Language Teaching Richard Watson Todd King Mongkut’s University of Technology Thonburi arts.kmutt.ac.th/crs/research/
UNIT 1 ENGLISH DISCOURSE ANALYSIS (an Introduction)
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Relationship between Physics Understanding and Paragraph Coherence Reva Freedman November 15, 2012.
Lily  It is the kind of writing used in high school and college classes.  Academic writing is different from creative writing, which is the kind.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Important Tips to writing a History Paper. Getting Started At first glance, writing about history can seem like an overwhelming task. History’s subject.
Teaching Productive Skills Which ones are they? Writing… and… Speaking They have similarities and Differences.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Teaching Writing.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Quick Write Reflection How will you implement the Engineering Design Process with your students in your classes?
The Unreasonable Effectiveness of Data
Probabilistic Text Structuring: Experiments with Sentence Ordering Mirella Lapata Department of Computer Science University of Sheffield, UK (ACL 2003)
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Improving a Pipeline Architecture for Shallow Discourse Parsing
Statistical NLP: Lecture 9
Communicative Competence (Canale and Swain, 1980)
Text-to-Text Generation
Chapter 14 Communicative Language Teaching
Chapter 4.
Communicative Competence (Canale and Swain, 1980)
Teacher Reference (Please use electronic version with class)
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute

ADGEN USC/ISI Natural Language Generation for QA Analysts create documents for other analysts; machines should also create documents for analysts. Goal is to produce new texts that: –contain useful answers and ancillary material –are brief –are coherent at the text level –are grammatical at the sentence level These goals conflict, but we have no principled ways of reasoning about these trade-offs.

ADGEN USC/ISI Of the myriad variations of a text that the machine might produce for an analyst, only a fraction are coherent. What makes a text coherent? New Approach: –We have millions of examples of coherent texts –We can validate ideas empirically, develop models –We can train models automatically ADGEN Research Focus

ADGEN USC/ISI Word-Level Language Models Given an unordered bag of words, assign an order that yields a grammatical, sensible sentence. For example, given: “any aware company interest isn't it of said takeover the” Produce: “the company said it isn't aware of any takeover interest” No algorithm for this “bag generation” task appears in linguistics texts, nor can one easily assemble an algorithm using published results as subroutines!

ADGEN USC/ISI Word-Level Language Models Even if linguistic syntactic grammars were widely available, they would not distinguish between sensible sentences and nonsense ones, e.g.: “the takeover said it isn't aware of any interest company” However, statistical n-gram models (and other lexicalized models) perform surprisingly well by incorporating both syntactic and semantic constraints.

ADGEN USC/ISI Why care about bag generation? It’s an acid test for any theory of language use. We can automatically generate problem instances. We can automatically evaluate proposed algorithms. Good solutions are directly applicable to answer generation/aggregation problems Good solutions are also directly applicable to word- ordering problems in statistical machine translation (SMT) and meaning-to-text generation.

ADGEN USC/ISI Text-Level Language Models Given an unordered bag of answers/clauses/sentences/, assign an order that yields a coherent text. Typical discourse study: “if we scramble sentences in an English document, the result is not coherent, so text has structure…” Let’s do something about it!

ADGEN USC/ISI Sample Problem 1. Terms weren't disclosed, but industry sources said the price was about $2.5 million. 2. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern. 3. Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp., a unit of London-based Beecham Group PLC. 4. The sale includes the rights to Germaine Monteil in North and South America and in the Far East, as well as the worldwide rights to the Diane Von Furstenberg cosmetics and fragrance lines and U.S. distribution rights to Lancaster beauty products.

ADGEN USC/ISI 1. Terms weren't disclosed, but industry sources said the price was about $2.5 million. 2. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern. 3. Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp., a unit of London-based Beecham Group PLC. 4. The sale includes the rights to Germaine Monteil in North and South America and in the Far East, as well as the worldwide rights to the Diane Von Furstenberg cosmetics and fragrance lines and U.S. distribution rights to Lancaster beauty products. Sample Problem Correct order: 3, 1, 4, 2

ADGEN USC/ISI Is this problem too hard? People can do it. News articles 2-10 sentences long: –50%: re-ordering matches original –40%: one sentence out of place –10%: large mismatches, but judges preferred original Debriefings are very useful for getting insight.

ADGEN USC/ISI Models have multiple applications Word-level ordering Text-level ordering Machine Translation Meaning-to-Text Generation Multi-document Summarization Essay Grading ?

ADGEN USC/ISI Redundancy Model of text coherence must deal with redundancy. This text is not coherent: Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp. Terms weren't disclosed, but industry sources said the price was about $2.5 million. The sale includes the rights to Germaine Monteil in North and South America. Terms were not disclosed by either party. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern, and neither elected to disclose the terms of the acquisition.

ADGEN USC/ISI Contradiction Model of text coherence must deal with contradiction. This text is not coherent: Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques. Terms weren't disclosed, but industry sources said the price was about $2.5 million. Revlon said it paid $2.2 million for Germaine Monteil.

ADGEN USC/ISI Methods Modeling of data in a one-billion word corpus of English, as well as in topical multi-document collections. –generative stories of how text gets produced –probability values that combine naturally with each other –strong local constraints expressed as conditional probabilities –automatic training procedures –statistical perplexity as a measure of how well the model fits the data Features –Word correlations, cue-phrase patterns, syntactic patterns, tense-specific patterns, semantic wordnet-based patterns, coreference patterns

ADGEN USC/ISI ADGEN in AQUAINT 1. Answer generation –Input: collection of text fragments (including phrases and paragraphs) –Fuse phrases into sentences, order sentences to form millions of possible texts –Rank and select most coherent presentation 2. Text improvement –Input: existing text –Apply probabilistic rewriting operations –Select rewrite that most improves coherence without sacrificing any of the basic material