Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen,

Similar presentations


Presentation on theme: "1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen,"— Presentation transcript:

1 1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen, Qiang Yang Hong Kong University of Science and Technology Clearwater Bay, Kowloon, HK HuaJun Zeng, Zheng Chen Microsoft Research Asia 5F, Sigma Center, 49 Zhichun Road, Beijing 100080, China Presenter: Chen Yi-Ting

2 2 Reference JianTao Sun, Yuchang Lu, Dou Shen, Qiang Yang, HuaJun Zeng, Zheng Chen, “Web-Page Summarization Using Clickthrought Data”, SIGIR’05, August 15-19, 2005. H. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159-165, 1958.

3 3 Outline Introduction Summarize Web Pages using Clickthrough Data –Empirical study on clickthrough data –Adapted web-page summarization methods –Summarize web pages not covered by clickthrough data Experiments Conclusions and future work

4 4 Introduction(1/2) Why web-page summarized? Web-page summaries can be abstracts or extracts Web-page summary can also be either generic or query-dependent –A query-dependent summary presents the information which is most relevant with the initial query –A generic summary gives an overall sense of the document’s content –A generic summary should meet two conditions: maintain wide coverage of the page’s topics and keep low redundancy at the same time In this paper, we focus on extract-based generic Web-page summarization The objective of this research is to utilize extra knowledge to improve Web-page summarization “clickthrough” : contains users’ knowledge on Web pages’ content A user’s query words often reflect the true meaning of target Web page’s content

5 5 Introduction(2/2) This is a challenging task : –Web pages may have no associated query words since they are not visited by web users through search engine –The clickthrough data are noisy In this paper, a thematic hierarchy of query terms are constructed The thematic lexicon can be used to complement the scarcity of Web-page content even no clickthrough data was collected associated with these pages That method can help filter out noises contained in query words for an individual Web page through the use of statistics over all Web page of this category Two text-summarization methods to summarize Web pages –The first approach is based on significant-word selection adapted from Luhn’s method –The second method is based on Latent Semantic Analysis (LSA)

6 6 Summarize web pages using clickthrough data (1/7) Empirical study on clickthrough data –Consider the typical search scenario: a user (u) submits a query (q) to search engine, the search engine returns a ranked list of Web page. Then the user clicks on the pages (p) of interest –Be represented by a set of triples –The clickthrough data records how Web users find information through queries –The collection of queries is supposed to well reflect the topic of the target Web page –Two experiment : To investigate whether the query words are related with the topics of the Web page (45.5% of keywords occurs in the query words, 13.1% of query words appear as keywords) To give evidence that clickthrough data is helpful to summarizing Web pages

7 7 Summarize web pages using clickthrough data (2/7) Adapted Web-page Summarization Methods : (Suppose that we have a set of query terms for each page now) –Adapted Significant Word (ASW) Method The first summarization method is adapted from Luhn’s algorithm, which is a classical algorithm designed for text signed a significance In Luhn’s method, each sentence is assigned a significance factor and the sentences with high significance factors are selected to form the summary Then the significant factor of a sentence can be computed as follow: (1) Set a limit L for the distance at which any two significant words could be considered as being significantly related (2) Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart (3) Count the number of significant words contained in the portion and divide the square of this number by the total number of words within he portion ◙◙ First, a set of significant words are constructed (according to word frequency in a document)

8 8 Summarize web pages using clickthrough data (3/7) Adapted Web-page Summarization Methods : –Adapted Significant Word (ASW) Method In order to customize this procedure to leverage query terms for Web-page summarization, the significant word selection method is modified The basic idea is to use both the local contents of a Web page and query terms collected from the clickthrough data to decide whether a word is significant After the significance factors for all words are calculated, ranking them and select the top N% as significant words Then Luhn’s algorithm to compute the significant factor of each sentence is employed

9 9 Summarize web pages using clickthrough data (4/7) Adapted Web-page Summarization Methods : –Adapted Latent Semantic Analysis (ALSA) Method Gong et al. proposed an extraction based summarization algorithm –Firstly, a term-sentence matrix is constructed from the original text document –Next, LSA analysis is conducted on the matrix –In the last step, a document summary is produced incrementally Proposed LSA-based summarization method is a variant of Gong’s method –Utilizing the query-word knowledge by changing the term- sentence matrix: if a term occurs as query word, its weight is increased according to its frequency in query word collection –Expecting to extract sentences whose topics are related to the ones reflected by query words –The term frequency vector of each sentence can be weighted by different weighting (global weighting and local weighting) and normalization methods

10 10 Summarize web pages using clickthrough data (5/7) Adapted Web-page Summarization Methods : –Adapted Latent Semantic Analysis (ALSA) Method In this paper, a term frequency (TF) approach without weighting or normalization is used to represent the sentences in Web pages Terms in a sentence are augmented by query terms as follows: Advantages of the adapted methods –The extra knowledge of query terms is utilized to help select significant words and to modify the page representation –Our approach can, to some extent, handle the noises of query words –Finally, ASW approach can avoid that problem that is Luhn’s method, the frequency-cutoff method may lead to a lot of significant words for long pages

11 11 Summarize web pages using clickthrough data (6/7) Summarize Web Pages Not Covered by Clickthrough Data –Building a hierarchical lexicon using the clickthrough data and apply it to help summarize those pages –All ODP Web pages have been manually organized into a hierarchical taxonomy –For each category of the taxonomy, the lexicon contains all query terms that users have submitted to browse Web pages of this category –The lexicon is built as follows: First, TS corresponding to each category is set empty. Next, for each page covered by the clickthrough data, its query words are added into TS of categories At last, term weight in each TS is multiplied by its Inverse Category Frequency (ICF) –For each Web page to be summarized, first look up the lexicon for TS according to the page’ category

12 12 Summarize web pages using clickthrough data (7/7) Summarize Web Pages Not Covered by Clickthrough Data –Weights of the terms in TS can be used to select significant words or update the term-sentence matrix If a page to be summarized has multiple categories, the corresponding TS are merged together and weights are averaged When a TS does not have sufficient terms, TS corresponding with its parent category is used –Two advantages : First, the category-specific TS provides a distribution of topic term in this category Second, some noisy terms which may be relatively frequent in one page’s query words will be given a low weight through the used of statistics over all Web pages of this category

13 13 Experiments(1/6) Data Set –The clickthrough data was collected from MSN search engine –A set of Web pages of the ODP directory are crawled –To get 1,125,207 Web pages, 260,763 of which are clicked by Web users using 1,586,472 different queries –Two different data sets were used for experiment : (1) DAT1-consists of 90 pages which are selected from the 260,763 browsed pages. Three human evaluators were employed to summarize these

14 14 Experiments(2/6) Data Set –Two different data sets were used for experiment : (2) DAT2-from the 260,763, 10,000 pages are randomly selected and constitutes Data2 data set descriptions of each page are also extracted that is provided by the page editor to give a general description of this page, they use it as the ideal summary Performance Evaluation –Precision, Recall and F1 –ROUGE Evaluation : N=1

15 15 Experiments(3/6) Experimental Results and Analysis –On DAT1 : (1) To investigate whether the adapted summarizers can benefit from query terms associated with each page

16 16 Experiments(4/6) Experimental Results and Analysis –On DAT1 : (1) To evaluate proposed summarization methods using the thematic lexicon approach

17 17 Experiments(5/6) Experimental Results and Analysis –On DAT2 : Only ROUGE-1 measure is used for evaluation Since the description length is commonly short and the ROUGE-1 measures is recall based, the summarization results are relatively poor The thematic lexicon-based methods can still lead to better summaries compared with local textual content based summarizers

18 18 Experiments(6/6) Discussions –Finding that ICF-based re-weighting can help discover topic terms of a specific category –To verify our hypothesis that the clickthrough data can complement the textual contents of Web pages for summarization tasks

19 19 Conclusions and Future work To leverage extract knowledge from clickthrough data to improve Web-page summarization It would be interesting to propose a method to determine parameter automatically To study how to leverage other types of knowledge

20 20 ◙


Download ppt "1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen,"

Similar presentations


Ads by Google