Presentation is loading. Please wait.

Presentation is loading. Please wait.

ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute.

Similar presentations


Presentation on theme: "ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute."— Presentation transcript:

1 ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute

2 ADGEN USC/ISI Natural Language Generation for QA Analysts create documents for other analysts; machines should also create documents for analysts. Goal is to produce new texts that: –contain useful answers and ancillary material –are brief –are coherent at the text level –are grammatical at the sentence level These goals conflict, but we have no principled ways of reasoning about these trade-offs.

3 ADGEN USC/ISI Of the myriad variations of a text that the machine might produce for an analyst, only a fraction are coherent. What makes a text coherent? New Approach: –We have millions of examples of coherent texts –We can validate ideas empirically, develop models –We can train models automatically ADGEN Research Focus

4 ADGEN USC/ISI Word-Level Language Models Given an unordered bag of words, assign an order that yields a grammatical, sensible sentence. For example, given: “any aware company interest isn't it of said takeover the” Produce: “the company said it isn't aware of any takeover interest” No algorithm for this “bag generation” task appears in linguistics texts, nor can one easily assemble an algorithm using published results as subroutines!

5 ADGEN USC/ISI Word-Level Language Models Even if linguistic syntactic grammars were widely available, they would not distinguish between sensible sentences and nonsense ones, e.g.: “the takeover said it isn't aware of any interest company” However, statistical n-gram models (and other lexicalized models) perform surprisingly well by incorporating both syntactic and semantic constraints.

6 ADGEN USC/ISI Why care about bag generation? It’s an acid test for any theory of language use. We can automatically generate problem instances. We can automatically evaluate proposed algorithms. Good solutions are directly applicable to answer generation/aggregation problems Good solutions are also directly applicable to word- ordering problems in statistical machine translation (SMT) and meaning-to-text generation.

7 ADGEN USC/ISI Text-Level Language Models Given an unordered bag of answers/clauses/sentences/, assign an order that yields a coherent text. Typical discourse study: “if we scramble sentences in an English document, the result is not coherent, so text has structure…” Let’s do something about it!

8 ADGEN USC/ISI Sample Problem 1. Terms weren't disclosed, but industry sources said the price was about $2.5 million. 2. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern. 3. Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp., a unit of London-based Beecham Group PLC. 4. The sale includes the rights to Germaine Monteil in North and South America and in the Far East, as well as the worldwide rights to the Diane Von Furstenberg cosmetics and fragrance lines and U.S. distribution rights to Lancaster beauty products.

9 ADGEN USC/ISI 1. Terms weren't disclosed, but industry sources said the price was about $2.5 million. 2. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern. 3. Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp., a unit of London-based Beecham Group PLC. 4. The sale includes the rights to Germaine Monteil in North and South America and in the Far East, as well as the worldwide rights to the Diane Von Furstenberg cosmetics and fragrance lines and U.S. distribution rights to Lancaster beauty products. Sample Problem Correct order: 3, 1, 4, 2

10 ADGEN USC/ISI Is this problem too hard? People can do it. News articles 2-10 sentences long: –50%: re-ordering matches original –40%: one sentence out of place –10%: large mismatches, but judges preferred original Debriefings are very useful for getting insight.

11 ADGEN USC/ISI Models have multiple applications Word-level ordering Text-level ordering Machine Translation Meaning-to-Text Generation Multi-document Summarization Essay Grading ?

12 ADGEN USC/ISI Redundancy Model of text coherence must deal with redundancy. This text is not coherent: Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques Corp. Terms weren't disclosed, but industry sources said the price was about $2.5 million. The sale includes the rights to Germaine Monteil in North and South America. Terms were not disclosed by either party. Revlon is a cosmetics concern, and Beecham is a pharmaceutical concern, and neither elected to disclose the terms of the acquisition.

13 ADGEN USC/ISI Contradiction Model of text coherence must deal with contradiction. This text is not coherent: Revlon Group Inc. said it completed the acquisition of the U.S. cosmetics business of Germaine Monteil Cosmetiques. Terms weren't disclosed, but industry sources said the price was about $2.5 million. Revlon said it paid $2.2 million for Germaine Monteil.

14 ADGEN USC/ISI Methods Modeling of data in a one-billion word corpus of English, as well as in topical multi-document collections. –generative stories of how text gets produced –probability values that combine naturally with each other –strong local constraints expressed as conditional probabilities –automatic training procedures –statistical perplexity as a measure of how well the model fits the data Features –Word correlations, cue-phrase patterns, syntactic patterns, tense-specific patterns, semantic wordnet-based patterns, coreference patterns

15 ADGEN USC/ISI ADGEN in AQUAINT 1. Answer generation –Input: collection of text fragments (including phrases and paragraphs) –Fuse phrases into sentences, order sentences to form millions of possible texts –Rank and select most coherent presentation 2. Text improvement –Input: existing text –Apply probabilistic rewriting operations –Select rewrite that most improves coherence without sacrificing any of the basic material


Download ppt "ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute."

Similar presentations


Ads by Google