WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Presented By Wanchen Lu 2/25/2013

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

BY Asef poormasoomi. Motivation summaries which are generic in nature do not cater to the user’s background and interests results show that each person.

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.

Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA

Chapter 6: Information Retrieval and Web Search

From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

SINGULAR VALUE DECOMPOSITION (SVD)

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

Algorithmic Detection of Semantic Similarity WWW 2005.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Citation-Based Retrieval for Scholarly Publications 指導教授：郭建明學生：蘇文正 M

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.

Chapter 5: Information Retrieval and Web Search

Presentation transcript:

WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology HuaJun Zeng & Zheng Chen, Microsoft Research Asia SIGIR 2005

2 INTRODUCTION The clickthrough data of Web search engine can be used to improve Web-page summarization. The clickthrough data contain many users’ knowledge on Web pages' contents. The query-word set of a page may cover the multiple topics of the target Web page.

3 INTRODUCTION Problems:  Web pages may have no associated query words since they are not visited by any Web users through a search engine.  the clickthrough data are often very noisy. By using ODP website ( we can build a thematic lexicon to solve these problems. Adapt two text-summarization methods to summarize Web pages:  The first approach is based on significant-word selection adapted from Luhn's method.  The second method is based on Latent Semantic Analysis (LSA).

4 RELATED WORK Amitay and Paris first extracted the text segments containing a link to a Web page. They then chose the most accurate sentence from the text segments as the snippet of the target page Delort et al. proposed two enhanced Web-page summarization methods using hyperlinks These methods directly obtain their knowledge from hyperlinks which may be sparse and noisy

5 RELATED WORK Sun et al. proposed a CubeSVD approach to utilizing the clickthrough data for personalized Web search. Liu et al. proposed a technique for categorizing Web-query terms from the clickthrough logs into predefined subject taxonomy based on their popular search interests. Hulth et al. proposed to extract keywords using domain knowledge.

6 Adapted Significant Word (ASW) Method In Luhn's method, each sentence is assigned a significance factor and the sentences with high significance factors are selected to form the summary. A set of significant words are selected according to word frequency in a document. That is, those words with frequency between high-frequency cutoff and low-frequency cutoff are selected as significant words.

7 Adapted Significant Word (ASW) Method Compute the significance factor : (1) Set a limit L for the distance at which any two significant words could be considered as being significantly related. (2) Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart. (3) Count the number of significant words contained in the portion and divide the square of this number by the total number of words within the portion. The result is the significance factor of each sentence. Modify this method to use both the local contents of a Web page and query terms collected from the clickthrough data

8 Adapted Significant Word (ASW) Method After the significance factors for all words are calculated, we rank them and select the top N% as significant words. Then we employ Luhn's algorithm to compute the significance factor of each sentence. Where tf i p and tf i q denote frequencies of the i-th term in the local text content of a Web page and in the query set respectively. α is a trade-off parameter when combining the two significance measurements. (1)

9 Adapted Latent Semantic Analysis (ALSA) Method Suppose that there are m distinct terms in a n document collection. The corpus can be represented by a term-document matrix X R mxn, whose component x ij is the weight of term t i in document d j. The Singular Value Decomposition (SVD) of X is given by: X = UΣV T (2) In Equation 2, U and V are the matrices of the left and right singular vectors. Σ is the diagonal matrix of singular values.

10 Adapted Latent Semantic Analysis (ALSA) Method LSA approximates X with a rank-k matrix: by setting the smallest r - k singular values to zero (r is rank of X). That is, the documents are represented in the k dimensional space spanned by column vectors of U k (3)

11 Extraction based summarization algorithm Gong et al. proposed an extraction based summarization algorithm. Firstly, a term-sentence matrix is constructed from the original text document. Next, LSA analysis is conducted on the matrix. In the singular vector space, the i-th sentence is represented by the column vector of V T. Each element in measures the importance factor of this sentence on the corresponding latent concept.

12 Extraction based summarization algorithm In the last step, a document summary is produced incrementally. For the most important concept, the sentence having the largest importance factor is selected into the summary. Then, the second sentence is selected for the next most important concept. This procedure repeated until a predefined number of sentences are selected.

13 Our LSA-based summarization method Our LSA-based summarization method is a variant of Gong's method. We utilize the query-word knowledge by changing the term-sentence matrix: if a term occurs as query words, its weight is increased according to its frequency in query word collection. In this approach, we expect to extract sentences whose topics are related to the ones reflected by query words.

14 Our LSA-based summarization method According to experiments, the weighting and normalization schemes lower the summarization performance. Thus, a term frequency (TF) approach without weighting or normalization is used to represent the sentences in Web pages. Terms in a sentence are augmented by query terms as follows: (4) β is a parameter used to tune the weights of query terms. is the frequency of term i in a sentence, while denotes term frequency in query set.

15 Summarize Web Pages Not Covered by Clickthrough Data According to statistics, only 23.1% out of the crawled ODP pages (in English) was browsed and associated with query words. So we build a hierarchical lexicon using the click-through data and apply it to help summarize those pages. We use TS(c) to represent a set of terms associated with category c. Thus the thematic lexicon is a set of TS, which are organized using the ODP category structure.

16 Summarize Web Pages Not Covered by Clickthrough Data The lexicon is built as follows: First, TS corresponding to each category is set empty. Next, for each page covered by the clickthrough data, its query words are added into TS of categories which this page belongs to as well as all its parent categories. When a query word is added into TS, its frequency is added to its original weight in TS. If a page belongs to more than one category, its query terms will be added into all TS associated with all its categories. At last, term weight in each TS is multiplied by its Inverse Category Frequency (ICF). The ICF value of a term is the reciprocal of its frequency occurring in different categories of the hierarchical taxonomy.

17 Summarize Web Pages Not Covered by Clickthrough Data For each page to be summarized, we first look up the lexicon for TS according to the page's category. Then the summarization methods proposed previous are used. Weights of the terms in TS can be used to select significant words or to update the term- sentence matrix. If a page to be summarized has multiple categories, the corresponding TS are merged together and weights are averaged. When a TS does not have sufficient terms, TS corresponding with its parent category is used.

18 EXPERIMENTS Data Set A set of Web pages of the ODP directory are crawled. After removing those which belong to “\World" and “\Regional" categories, we got 1,125,207 Web pages, 260,763 of which are clicked by Web users using 1,586,472 different queries. DAT1: It consists of 90 pages selected from the 260,763 browsed pages. DAT2: We preprocess the 260,763 pages and extract description (\DESCRIPTION" metadata) of each page and keep the pages with a description of over 200 characters and containing at least 10 sentences, from which 10,000 pages are randomly selected.

19 EXPERIMENTS Ideal summary DAT1 Three human evaluators were employed to summarize these pages. Each evaluator was requested to extract the most important sentences for a Web page. There is no constraint on the number of sentences to be extracted. DAT2 We use the descriptions provided by the page editors as the ideal summary.

20 EXPERIMENTS ROUGE Evaluation (Recall-Oriented Understudy for Gisting Evaluation ) ROUGE is a software package that measures summarization quality by counting overlapping units such as the n-gram, word sequences, and word pairs between the candidate summary and the reference summary. ROUGE-N is an n-gram recall measure which is defined as follows:

21 EXPERIMENTS N stands for n-gram, Count match (gram n ) is the maximum number of n-grams co-occurring in the candidate summary and the reference summary, Count (gram n ) is the number of n-grams in the candidate summary. In this paper, we only reported ROUGE-N where N=1.

22 EXPERIMENTS Results on DAT1

23 EXPERIMENTS

24 EXPERIMENTS Results on DAT1 without queries

25 EXPERIMENTS

26 EXPERIMENTS Results on DAT2

27 Discussions Experiments indicate the clickthrough data are helpful for generic Web-page summarization and both our proposed methods can leverage this knowledge source well. When the Web pages are not covered by the clickthrough data, the lexicon-based approach achieves better results than pure-text-based summarizers. The thematic lexicon built from clickthrough data can discover the topic terms associated with a specific category and the ICF-based approach can effectively assign weights to terms of this category.

28 CONCLUSIONS We leverage extra knowledge from clickthrough data to improve Web-page summarization. Two extract- based methods are proposed to produce generic Web-page summaries. For the pages not covered by the click-through data, we build a thematic lexicon using the clickthrough data in conjunction with an hierarchical Web directory. The experimental results show that significant improvements are achieved compared with summarizers without using clickthrough logs.

29 FUTURE WORK Automatically determine trade-off parameter. Study how to leverage other types of knowledge, such as word clusters and thesaurus, hidden in the clickthrough data. Plan to evaluate our methods using extrinsic evaluation metrics and much larger data sets.