1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Automatic summarization Dragomir R. Radev University of Michigan
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
Information Retrieval in Practice
Text Specificity and Impact on Quality of News Summaries Annie Louis & Ani Nenkova University of Pennsylvania June 24, 2011.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Mon 3-4 TA: Fadi Biadsy 702 CEPSR,
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Chapter 23: Probabilistic Language Models April 13, 2004.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Understanding Naturally Conveyed Explanations of Device Behavior Michael Oltmans and Randall Davis MIT Artificial Intelligence Lab.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Natural Language Processing Vasile Rus
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Natural Language Processing for the Web
CSCI 5832 Natural Language Processing
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Information Retrieval
Presentation transcript:

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, Office Hours: Thurs 12-1, 8-9

2 Projects  Proposal due today  Hand in via courseworks (by midnight)

3 Today  Discussants – Each person should sign up for TWO papers  Automated Discovery and Analysis of Social Networks from Threaded Discussions (Kathy)  From Social Bookmarking to Social Summarization (Kathy)  Joint Group and Topic Discovery from Relations and Text (Kathy)  Discovering Authorities in Question Answering Communities Using Link Analysis (Kenny)  Discussants: Lauren, Weiwei

4 Automated Discovery and Analysis of Social Networks  Threaded discussion:  Online class discussion board  “…examining social networks – including the roles and positions of actors in a social network, their influence on others, and what exchanges support and sustain the network – is an important goal for understanding networked learning processes”  Social Network background  Most studies use meta-data about links  This work uses nl analysis of text within postings

5 Who talks to whom? (ties)  Chain network  Create a link from poster to previous poster  Create a link from poster to thread starter plus previous poster  Create a link from poster to all previous posters in a thread, decreasing weight with distance

6

7 Topic of this paper: identifying names in a posting  Possible names  Previous poster (to line, direct reference, indirect reference, subject of discussion)  someone else entirely (author),  current poster (self-reference, in address line, signature)

8 Methods for identifying “Who”  Use of name lists  Class roster  Use of titles (Prof.), addresses (dear)  Exclusion of 3 word capitalized sequences  Confidence level (page vs “Page”)  Mis-spellings (manual review and edits)  Results: Local vs Ling-Pipe  Precision:.88 vs..60  Recall:.66 vs..68

9 Methods for identifying “ties”: links  Chance of a tie proportional to the number of times each mentions the other as addressee or subject  Add 1 to a poster and all names found in posting  More than one name: link name to userid using collocational analysis  (A vs. P) Association type P (poster) A (addressee)  Information Exchange: essential social interaction  Measure via content of exchange (vs. network structure)  Information weight of an exchange (vs. discourse act such as announcement)  Yahoo Term Extractor to find content nouns

10 Example  Keep in mind that google and other search technology are still evolving and getting better. I certainly don't believe that they will be as effective as a library in 2-5 years, but if they improve significantly, it will continue to be difficult for the public to perceive the difference.

11 Evaluation  Metric: QAP correlation

12 From Social Bookmarking to Social Summarization  Exploit user-created content  Del.icio.us web page tags, Flick’r, review sites  Approach expands on query-focused summarization  Extract bookmark tags for a page p: (b1, b2,..)  Issue a search with tags as query  Extract snippets associated with p in result: S(bi, p)  Normalize snippets  Score each snippet according to frequency  Rank order as summary

13 Some details  Limit results of search to top N  Normalize by  Determining overlap (like cosine)  Match if overlap score above threshold T  Take shorter sentence of a match  Determine frequency of selected snippets in search result to rank

14

15 Evaluation  Baselines: OTS, MEAD  Metric: Rouge  Set-up 1: SS used full set of tags, average length = 24%  Problem?  Relative improvement:  SS to OTS: 31-39% relative improvement  SS to MEAD: 24-29% relative improvement

16  Set-up 2: Vary summary length from 10% to 50%

17 Set-up 3: Community based summarization  Use tags generated by a specific community: skier community (“skiing” as seed bookmark) vs. a travel community (“travel” as seed bookmark)  Evaluation: recall in summary of terms in seed set

18

19 Joint Group and Topic Discovery from Relations and Text  Example: legislative body and alliances  Different alliances may form depending on the resolution topic (taxation vs. foreign trade)  GT model  Discovery of groups guided by emerging topics  Discovery of topics guided by emerging groups  Example: resolutions that would have been assigned to one group based on topic may be assigned to different one given voting patterns; distinct word- based topics may be merged if entities vote similarly.

20 GT Model  Simultaneously clusters entities into groups and words into topics  Data set: voting data from the US Senate and the UN General Assembly

21 Sentence extraction  Sparck Jones:  `what you see is what you get’, some of what is on view in the source text is transferred to constitute the summary

22 Background  Sentence extraction the main approach  Some more sophisticated features for extraction in recent years  Lexical chains, anaphoric reference, topic signatures  Machine learning models for learning an extraction summarizer (e.g., Kupiec)

23 Today’s systems  How can we edit the selected text?

24 Karen Sparck Jones Automatic Summarizing: Factors and Directions

25 Sparck Jones claims  Need more power than text extraction and more flexibility than fact extraction (p. 4)  In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose and output factors, that bear on summarising and its evaluation. (p. 1)  It is important to recognize the role of context factors because the idea of a general-purpose summary is manifestly an ignis fatuus. (p. 5)  Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)  I believe that the right direction to follow should start with intermediate source processing, as exemplified by sentence parsing to logical form, with local anaphor resolutions

26 Questions (from Sparck Jones)  Does subject matter of the source influence summary style (e.g, chemical abstracts vs. sports reports)?  Should we take the reader into account and how?  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?

27  Consider the papers we read in light of Sparck Jones’ remarks on the influence of context:  Input  Source form, subject type, unit  Purpose  Situation, audience, use  Output  Material, format, style

28 Cut and Paste in Professional Summarization  Humans also reuse the input text to produce summaries  But they “cut and paste” the input rather than simply extract  automatic corpus analysis (Zipf Davis)  300 summaries, 1,642 sentences  81% sentences were constructed by cutting and pasting

29 Major Cut and Paste Operations  (1) Sentence reduction ~~~~~~~~~~~~

30 Major Cut and Paste Operations  (1) Sentence reduction ~~~~~~~~~~~~

31 Major Cut and Paste Operations  (1) Sentence reduction  (2) Sentence Combination ~~~~~~~~~~~~ ~~~~~~~ ~~~~~~

32  (3) Generalization  "a proposed new law that would require Web publishers to obtain parental consent before collecting personal information from children" -> "legislation to protect children's privacy on-line"

33 Cut and Paste Based Single Document Summarization -- System Architecture Extraction Sentence reduction Generation Sentence combination Input: single document Extracted sentences Output: summary Zipf Davis Corpus Decomposition Lexicon Parser Co-reference

34 Sentence Reduction  Step 1: Use linguistic knowledge to decide what phrases MUST NOT be removed  Obligatory arguments of verbs are saved  Step 2: Determine what phrases are most important in the local context  Phrases with words that link forward or backward  Step 3: Compute the probabilities of humans removing a certain type of phrase  Step 4: Combine the three factors to decide

35 Sentence Fusion for Multi- document Summarization 

36

37 Fusion

38 Sentence Fusion Computation: Content Selection  Common information identification  Alignment of constituents in parsed theme sentences: only some subtrees match  Bottom-up local multi-sequence alignment  Similarity depends on u Word/paraphrase similarity u Tree structure similarity

39  Sim(T,T’) = max (nodecompare(T,T’), Sim(T, children(T’)), Sim(children(T),T’))  Nodecompare searches for best possible alignment of all child nodes  Nodesimilarity depends on similarity between words of atomic nodes

40

41 Sentence Fusion: Generation  Fusion lattice computation  Choose a basis sentence  Add subtrees from fusion not present in basis  Add alternative verbalizations  Remove subtrees from basis not present in fusion  Lattice linearization  Generate all possible sentences from the fusion lattice  Score sentences using statistical language model

42

43 Questions  Jing: Not a statistical approach, not learned. Is this OK? Does it buy us anything over the approaches using learning?  Barzilay: Also not statistical, OK? How to compare with Jing? Is redundancy a good criteria for content selection? What could go wrong?

44 Sparck Jones claims  Need more power than text extraction and more flexibility than fact extraction (p. 4)  In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose and output factors, that bear on summarising and its evaluation. (p. 1)  It is important to recognize the role of context factors because the idea of a general-purpose summary is manifestly an ignis fatuus. (p. 5)  Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source was intended. (p. 5)  I believe that the right direction to follow should start with intermediate source processing, as exemplified by sentence parsing to logical form, with local anaphor resolutions

45 Supervised and Unsupervised Learning for Sentence Compression  J. Turner and E. Charniak

46 Knight and Marcu Model  Noisy Channel Model  Zipf Davis corpus  Given a long sentence, determine the short sentence that maximizes P(s|l)  Bayes rule:  P(l) is constant across all long, dropped  Language model combination of PCFG and bigram of S

47 Two problems with K&M  Lack of training data – Why?  Probability model is ad hoc

48 Turner and Charniak Approach – K&M modification  Use syntactic language model  Slightly change channel model:  Parameter to encourage compression

49 Alternate models  “Special rule” additions + K&M variation  NP(1) -> NP(2) CC NP (3) compressed to NP(2)  Unsupervised version using PTB: no parallel corpus.  P(l|s) learned by comparing similar rules  NP -> DT JJ NN (3X)  NP -> DT NN (4X)  P(l|s) = 3/7  Semi-supervised: fall back on unsupervised when no data from supervised  Constraints: complement/adjunct distinction: never allow deletion of complement

50 Results (evaluated using judges)

51 Questions  How does this compare with Jing?  Will same manual rules be captured?  Verb arguments not deleted?  Context determines importance  What does statistics capture that is not captured by the manual approach?  How about revisions other than reduction?

52 Compression Beyond Word Deletion  Cohn and Lapata

53 Goal and Approach  To delete, substitute, re-order  Collect a new corpus: why?  30 newspaper articles, 575 sentences  Is this adequate?  Extract compressions  Collect paraphrases using MT

54 Abstraction Example  High winds and snowfalls have, however, grounded at a lower level the powerful US Navy Sea Stallion helicopters used to transport the slabs.  Bad weather, however, has grounded the helicopters transporting the slabs.

55 Extraction of compression rules  Synchronous Tree Substitution Grammar  (S,S) -> (NPVBD NP, NP was VBN by NP)  Probabilistic (each grammar rule assigned a learned weight)  Prediction: generation finds the best scoring compression using the grammar rules  (Skip training section)

56 Extension (contribution)  Paraphrasing with their corpus a problem  Learn paraphrase grammar rules  Parallel bilingual corpus  Learns over syntax tree fragments  Translate from English to French and back again -> an English paraphrase of original  These rules are added into extracted compression grammar

57 Combined grammar  Incorporates an ngram language model as a feature  Helps prevent ungrammatical output  Like K&M, Turner and Charniak, a parameter to penalize short output  Union of compression plus paraphrasing grammar plus a COPY grammar derived from the source side

58 Results ModelsGrammaticalityImportanceComp Rate extract abstract gold O: The scheme was intended for people of poor or moderate means. E: The scheme was intended for people of poor means. A: The scheme was intended for poor people. G: The scheme was intended for the poor. O: He died last Thursday at his home from complications following a fall, said his wife author Margo Kurtz. E: He died last at his home from complications following a fall, said wife, author Margo Kurtz. A: His wife author Margo Kurtz died from complications after a decline. G: He died from complications following a fall.

59 Quesitons  Is this comparable to K&M, Turner and Charniak?  Is it OK to take a risk?  What are the weak points?

60

61

62