1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004

2 Outline Motivation Related work Architecture Overview Summarizer Methods (1~4) Experiments Conclusion Future Work

3 Motivation To facilitate the web users to find the desired information. Browse: Navigate through hierarchical collections Search: submit a query to search engine Much work has been done on Web-Page Classification. Hyperlink Summarization is a good method to filter the noise from the web page.

4 Related Work Overview of summarization: Goal of summarization: Summary generation methods seek to identify document contents that convey the most “important” information within the document Types of Summarization Indicative vs. informative Extraction vs. Abstract Generic vs. query-oriented Unsupervised vs. Supervised Single-document vs. multi-document

5 Related Work (Cont.) -- Summarization in IR Methods Unsupervised Methods—Cluster and Select Supervised Methods Applications Generic summaries for indexing in information retrieval (Tetsuya Sakai SIGIR2001) Term Selection in Relevance Feedback for IR (A.M.Lam-Adesina SIGIR2001)

6 Architecture Overview Testing set Train set Train Summaries Testing Summaries Classifier (NB/SVM) Result Ensemble Summarization LuhnLSASupervisedPage-layout analysisHuman Classification ( 10-fold cross validation )

7 Summarizer 1: Adapted Luhn’s Method (IBM journal 1958) Assumption: The more “Significant Words” there are in a sentence and the closer they are, the more meaningful the sentence is. Approach The sentences with the highest significance factor are selected to form the summary. An Example — — — [ # — — # — # # # ] — — — # significance factor = 5*5/8 = 3.125 frequency is between high-frequence cutoff and low-frequency cutoff Limit L=2, which two # could be considered as being significantly related

8 Summarizer 1: Adapted Luhn’s Method (Cont.) Build Significant word pool Cat 1 Cat 2 Cat m Significant Word Pool Significant Word Pool Significant Word Pool … … … … Sentences in a training page from category m Summary For train pages Sentences from a testing page Summary average For testing pages A web page Significant word pool Sentences in this page Summary Original methodAdapted method

9 Summarizer 2: Latent Semantic Analysis (SIGIR2001) A fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse. Overview Given an m  n term-by-sentence matrix A; A = U∑V T ∑: diag(  1, …,  r, 0, …, 0), sorted singular value r =rank(A) Um  n: whose column vectors are called left singular vectors (salient patterns among the terms) Vn  n: whose column vectors are called right singular vectors (salient patterns among the sentences)

10 Summarizer 2: Latent Semantic Analysis (Cont.) = ×∑××∑× Select the sentence which has the largest index value with the right singular vector (column vector of V)

11 Summarizer 3: Summarization by Page Layout Analysis (WWW10,2001) In HTML content, a BO (Basic Object) is a non-breakable element within two tags or an embedded Object.

12 Summarizer 3: Summarization by Page Layout Analysis  Analyze the structure of Web Pages  Compute the similarity graph between objects  Nodes = Objects  Weight of Edge = Similarity  Get the core object  Extract the content body (CB) Header Search Box Main Body Navigation List Copyright

13 Summarizer 3: Summarization by Page Layout Analysis (Cont.) Detect the Content Body (CB) algorithm Consider each selected object as a single document and build the TF*IDF index for the object. Calculate the similarity between any two objects using Cosine similarity computation, and add a link between them if their similarity is greater than a threshold. A core object is defined as the object having the most edges. Extract the CB as the combination of all objects that have edges linked to the core object. Summary All sentences that are included in the content body give rise to the summary of the Web page.

14 Summarizer 3: Summarization by Page Layout Analysis (Cont.) Obj1Obj2Obj3Obj4 Obj1 1.000.030.080.00 Obj2 1.000.150.00 Obj3 1.000.02 Obj4 1.00 12 34 Content Body 0.03 0.08 0.02 0.00 0.15 0.00

15 Naïve Bayes Classifier (lecture of ML&DM berlin 2004) Assume target function, where each instance described by attributes Most probable value of is Naïve Bayes assumption: Naïve Bayes Classifier: predict the target value/classification

16 Given a data set Z with 3-dimensional Boolean examples. Train a naïve Bayes classifier to predict the classification What is the predicted probability ? Naïve Bayes: Example Attribu te A Attribut e B Attribut e C Classificatio n D FTFT FFTT TFFT TFFF FTTF FFTF

17 Naïve Bayes: Example

18 Summarizer 4: Supervised Summarization (SIGIR 1995) Features Given sentence S i F i1 : the position of a sentence S i in a certain paragraph; F i2 : the length of a sentence S i ; F i3 : ∑TFw*SFw; (Term frequency, Sentence Frequency) F i4 : the similarity (cosine) between S i and the title; F i5 : the similarity between S i and all text in the page; F i6 : the similarity between S i and meta-data in the page; F i7 : the number of occurrences of word in S i from a special word set (italic or bold or underlined words). F i8 : the average font size of the words in S i

19 Summarizer 4: Supervised Summarization (Cont.) Classifier Each sentence will then be assigned a score by the above equation. stands for the compression rate is the probability of each feature i is the conditional probability of each feature i

20 Ensemble Summarizers The final score for each sentence is calculated by summing the individual score factors obtained for each method used. Schema1: We assigned the weight of each summarization method in proportion to the performance of each method (the value of micro-F1). Schema2-5: We increased the value of w i (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one.

21 Experiment Setup Dataset 2 millions Web pages from the LookSmart Web directory. 500 thousand pages with manually created descriptions. Randomly sampled 30% (includes 153,019 pages). Distributed among 64 categories (Only the top two level categories on LookSmart Website) Classifier NB (Naïve Bayesian) and SVM (Support Vector Machine) Evaluation 10-fold cross validation Precision,Recall,F1 Micro (gives equal weight to every document) vs. Macro

22 Experiment 1: feasibility study Baseline: Remaining text by removing the html tags Human-authored summary as the ideal summary for the page; Conclusion: Good summary can improve classification performance obviously. NB SVM 14.8% 13.2%

23 Experiment 2: Evaluation on Automatic Summarizers Similar improvement among unsupervised methods Unsupervised methods are better than the supervised method All automatic methods are not as good as human summary microPmicroRmicro-F1 Baseline 70.757.763.6 Human 81.566.273.0 Summ1 77.963.369.8 Summ2 77.262.769.2 Summ3 75.961.768.1 Summ4 75.260.967.3 microPmicroRmicro-F1 Baseline72.459.365.1 Human82.166.973.7 Summ177.362.869.3 Summ278.663.770.3 Summ379.264.371.0 Supervised76.361.868.3 NB SVM Summ1 = Luhn; Summ2 = Content Body; Summ3 = LSA; Summ4 = Supervised

24 Experiment 2: Evaluation on Automatic Summarizers (Cont.) The ensemble of summarizers achieves similar improvement as human summary. 14.8% 12.9% 13.2% 11.5% NB SVM

25 Experiment3: Parameter Tuning --Compression Rate Compression rate is the most parameter in consideration Most of automatic methods achieve best result when the compression rate is 20% or 30% 0.200.150.100.05 CB65.0±0.567.0±0.469.2±0.466.7±0.3 Performance of CB with different Threshold with NB 10%20%30%50% Luhn66.1±0.569.8±0.567.4±0.464.5±0.3 LSA66.3±0.667.0±0.568.1±0.563.4±0.3 Supervised66.1±0.567.3±0.464.8±0.462.9±0.3 Hybrid66.9±0.469.3±0.471.8±0.367.1±0.3 Performance on different compression rate with NB

26 Experiment3: Parameter Tuning --Weight Schema Schema1: weight of each summarization method in proportion to the performance of each method Schema2-5: Increase the value of (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one microPmicroRmicro-F1 Origin80.2±0.365.0±0.371.8±0.3 Schema181.0±0.365.6±0.372.5±0.3 Schema281.3±0.466.1±0.472.9±0.4 Schema379.5±0.464.4±0.471.2±0.4 Schema481.1±0.365.5±0.372.5±0.3 Schema579.7±0.464.7±0.471.4±0.4

27 Analysis Why summarization helps? Summarization can extract the main topic of a Web page while remove noises. #of pagesTotal Size(k)Average Size/page (k) A1003121031.2 B5005450010.9 A: 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline system B: 500 pages randomly from the testing pages Conclusion: Summarization is helpful especially for large web pages

28 Conclusion Summarization techniques can be helpful for classification. A new summarizer based on Web-page structure analysis. Modification to Luhn’s method New features for Web-page supervised summarization

29 Future Work Improve the summarization performance Take Hypertext/Anchor Text into consideration. Use the Hyperlink Structure. Use Query Log. Apply Summarization to other applications Clustering

30 Thanks

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

Similar presentations

Presentation on theme: "1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

Similar presentations

Presentation on theme: "1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004."— Presentation transcript:

Similar presentations

About project

Feedback