1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen,

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.

COMP 630L Paper Presentation Javy Hoi Ying Lau. Selected Paper “A Large Scale Evaluation and Analysis of Personalized Search Strategies” By Zhicheng Dou,

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

MARS: Applying Multiplicative Adaptive User Preference Retrieval to Web Search Zhixiang Chen & Xiannong Meng U.Texas-PanAm & Bucknell Univ.

University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

Chapter 5: Information Retrieval and Web Search

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

BY Asef poormasoomi. Motivation summaries which are generic in nature do not cater to the user’s background and interests results show that each person.

Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng.

Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.

Chapter 6: Information Retrieval and Web Search

Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Algorithmic Detection of Semantic Similarity WWW 2005.

WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems.

Search Engines By: Faruq Hasan.

Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.

1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

An Adaptive User Profile for Filtering News Based on a User Interest Hierarchy Sarabdeep Singh, Michael Shepherd, Jack Duffy and Carolyn Watters Web Information.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,

An Empirical Study of Learning to Rank for Entity Search

Searching with context

Latent Semantic Analysis

Presentation transcript:

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen, Qiang Yang Hong Kong University of Science and Technology Clearwater Bay, Kowloon, HK HuaJun Zeng, Zheng Chen Microsoft Research Asia 5F, Sigma Center, 49 Zhichun Road, Beijing , China Presenter: Chen Yi-Ting

2 Reference JianTao Sun, Yuchang Lu, Dou Shen, Qiang Yang, HuaJun Zeng, Zheng Chen, “Web-Page Summarization Using Clickthrought Data”, SIGIR’05, August 15-19, H. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2): , 1958.

3 Outline Introduction Summarize Web Pages using Clickthrough Data –Empirical study on clickthrough data –Adapted web-page summarization methods –Summarize web pages not covered by clickthrough data Experiments Conclusions and future work

4 Introduction(1/2) Why web-page summarized? Web-page summaries can be abstracts or extracts Web-page summary can also be either generic or query-dependent –A query-dependent summary presents the information which is most relevant with the initial query –A generic summary gives an overall sense of the document’s content –A generic summary should meet two conditions: maintain wide coverage of the page’s topics and keep low redundancy at the same time In this paper, we focus on extract-based generic Web-page summarization The objective of this research is to utilize extra knowledge to improve Web-page summarization “clickthrough” ： contains users’ knowledge on Web pages’ content A user’s query words often reflect the true meaning of target Web page’s content

5 Introduction(2/2) This is a challenging task ： –Web pages may have no associated query words since they are not visited by web users through search engine –The clickthrough data are noisy In this paper, a thematic hierarchy of query terms are constructed The thematic lexicon can be used to complement the scarcity of Web-page content even no clickthrough data was collected associated with these pages That method can help filter out noises contained in query words for an individual Web page through the use of statistics over all Web page of this category Two text-summarization methods to summarize Web pages –The first approach is based on significant-word selection adapted from Luhn’s method –The second method is based on Latent Semantic Analysis (LSA)

6 Summarize web pages using clickthrough data (1/7) Empirical study on clickthrough data –Consider the typical search scenario: a user (u) submits a query (q) to search engine, the search engine returns a ranked list of Web page. Then the user clicks on the pages (p) of interest –Be represented by a set of triples –The clickthrough data records how Web users find information through queries –The collection of queries is supposed to well reflect the topic of the target Web page –Two experiment ： To investigate whether the query words are related with the topics of the Web page (45.5% of keywords occurs in the query words, 13.1% of query words appear as keywords) To give evidence that clickthrough data is helpful to summarizing Web pages

7 Summarize web pages using clickthrough data (2/7) Adapted Web-page Summarization Methods ： (Suppose that we have a set of query terms for each page now) –Adapted Significant Word (ASW) Method The first summarization method is adapted from Luhn’s algorithm, which is a classical algorithm designed for text signed a significance In Luhn’s method, each sentence is assigned a significance factor and the sentences with high significance factors are selected to form the summary Then the significant factor of a sentence can be computed as follow: (1) Set a limit L for the distance at which any two significant words could be considered as being significantly related (2) Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart (3) Count the number of significant words contained in the portion and divide the square of this number by the total number of words within he portion ◙◙ First, a set of significant words are constructed (according to word frequency in a document)

8 Summarize web pages using clickthrough data (3/7) Adapted Web-page Summarization Methods ： –Adapted Significant Word (ASW) Method In order to customize this procedure to leverage query terms for Web-page summarization, the significant word selection method is modified The basic idea is to use both the local contents of a Web page and query terms collected from the clickthrough data to decide whether a word is significant After the significance factors for all words are calculated, ranking them and select the top N% as significant words Then Luhn’s algorithm to compute the significant factor of each sentence is employed

9 Summarize web pages using clickthrough data (4/7) Adapted Web-page Summarization Methods ： –Adapted Latent Semantic Analysis (ALSA) Method Gong et al. proposed an extraction based summarization algorithm –Firstly, a term-sentence matrix is constructed from the original text document –Next, LSA analysis is conducted on the matrix –In the last step, a document summary is produced incrementally Proposed LSA-based summarization method is a variant of Gong’s method –Utilizing the query-word knowledge by changing the term- sentence matrix: if a term occurs as query word, its weight is increased according to its frequency in query word collection –Expecting to extract sentences whose topics are related to the ones reflected by query words –The term frequency vector of each sentence can be weighted by different weighting (global weighting and local weighting) and normalization methods

10 Summarize web pages using clickthrough data (5/7) Adapted Web-page Summarization Methods ： –Adapted Latent Semantic Analysis (ALSA) Method In this paper, a term frequency (TF) approach without weighting or normalization is used to represent the sentences in Web pages Terms in a sentence are augmented by query terms as follows: Advantages of the adapted methods –The extra knowledge of query terms is utilized to help select significant words and to modify the page representation –Our approach can, to some extent, handle the noises of query words –Finally, ASW approach can avoid that problem that is Luhn’s method, the frequency-cutoff method may lead to a lot of significant words for long pages

11 Summarize web pages using clickthrough data (6/7) Summarize Web Pages Not Covered by Clickthrough Data –Building a hierarchical lexicon using the clickthrough data and apply it to help summarize those pages –All ODP Web pages have been manually organized into a hierarchical taxonomy –For each category of the taxonomy, the lexicon contains all query terms that users have submitted to browse Web pages of this category –The lexicon is built as follows: First, TS corresponding to each category is set empty. Next, for each page covered by the clickthrough data, its query words are added into TS of categories At last, term weight in each TS is multiplied by its Inverse Category Frequency (ICF) –For each Web page to be summarized, first look up the lexicon for TS according to the page’ category

12 Summarize web pages using clickthrough data (7/7) Summarize Web Pages Not Covered by Clickthrough Data –Weights of the terms in TS can be used to select significant words or update the term-sentence matrix If a page to be summarized has multiple categories, the corresponding TS are merged together and weights are averaged When a TS does not have sufficient terms, TS corresponding with its parent category is used –Two advantages ： First, the category-specific TS provides a distribution of topic term in this category Second, some noisy terms which may be relatively frequent in one page’s query words will be given a low weight through the used of statistics over all Web pages of this category

13 Experiments(1/6) Data Set –The clickthrough data was collected from MSN search engine –A set of Web pages of the ODP directory are crawled –To get 1,125,207 Web pages, 260,763 of which are clicked by Web users using 1,586,472 different queries –Two different data sets were used for experiment ： (1) DAT1-consists of 90 pages which are selected from the 260,763 browsed pages. Three human evaluators were employed to summarize these

14 Experiments(2/6) Data Set –Two different data sets were used for experiment ： (2) DAT2-from the 260,763, 10,000 pages are randomly selected and constitutes Data2 data set descriptions of each page are also extracted that is provided by the page editor to give a general description of this page, they use it as the ideal summary Performance Evaluation –Precision, Recall and F1 –ROUGE Evaluation ： N=1

15 Experiments(3/6) Experimental Results and Analysis –On DAT1 : (1) To investigate whether the adapted summarizers can benefit from query terms associated with each page

16 Experiments(4/6) Experimental Results and Analysis –On DAT1 : (1) To evaluate proposed summarization methods using the thematic lexicon approach

17 Experiments(5/6) Experimental Results and Analysis –On DAT2 : Only ROUGE-1 measure is used for evaluation Since the description length is commonly short and the ROUGE-1 measures is recall based, the summarization results are relatively poor The thematic lexicon-based methods can still lead to better summaries compared with local textual content based summarizers

18 Experiments(6/6) Discussions –Finding that ICF-based re-weighting can help discover topic terms of a specific category –To verify our hypothesis that the clickthrough data can complement the textual contents of Web pages for summarization tasks

19 Conclusions and Future work To leverage extract knowledge from clickthrough data to improve Web-page summarization It would be interesting to propose a method to determine parameter automatically To study how to leverage other types of knowledge

20 ◙