# CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

## Presentation on theme: "CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses."— Presentation transcript:

CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR

CLEI 2007 2 Motivation

CLEI 2007 3 Motivation Which HTML feature is the most important to provide good clustering results? Using symbolic objects to cluster web documents. 15 th World Wide Web Conference (2006)

CLEI 2007 HTML Document Clustering Find meaningful groups from a web document collection. Effectively represent web document clusters for further analysis.

CLEI 2007 5 HTML Document

CLEI 2007 6

7 Classical Representations Different approaches for representing a web document.

CLEI 2007 8 Vectorial Representation Every document is represented by a vector in n-dimensional space. Bag of words scheme. Each variable represents the relative weight of a term in the document.

CLEI 2007 9 Symbolic Objects Real-life objects are too complex to be represented by points in a vectorial space. [Bock&Diday, 2000] Symbolic objects overcome this limitation by representing concepts rather than individuals. In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.

CLEI 2007 Symbolic Data Table

CLEI 2007 Multivariate Numeric Analysis IndividualAgeProfessionWageLocation 345736Lawyer2,500.00San José 125128Teacher1,750.00Alajuela 324539Doctor2,400.00San José 763533Teacher1,900.00Alajuela 324535Engineer1,850.00Alajuela 536727Engineer1,900.00Heredia 648634Manager1,600.00Heredia IndividualAgeProfessionWage San José[36,39]{Law, 50%,Doc,50%}[2,4 – 2,5] Alajuela[28,35]{Tea,66%,Eng,33%}[1,75 – 1,9] Heredia[2,34]{Eng,50%,Mgn,50%}[1,6 – 1,9] Multivariate Symbolic Analysis Millions… Hundreds… Data Concepts From relational data bases to symbolic data bases Symbolic Data Table

CLEI 2007 12 Relational Data BaseSymbolic Data Base 100% knowledge 15 Gigabyte 90 % knowledge 10.3 Megabyte Symbolic Data Base

CLEI 2007 13 Symbolic Representations A complex representation that takes into account: term frequency, word order and phrases.

CLEI 2007 14 The K-Means Clustering Method

CLEI 2007 15 But, there are some problems …….

CLEI 2007 16 Distance Measures

CLEI 2007 17 Teorema: Igualdad de Fisher Inercia total Inercia inter-clasesInercia total = Inercia inter-clases + Inercia intra-clases Inercia intra-clases

CLEI 2007 18 1.Representar una clase por su centro de gravedad, esto es, por su vector de promedios. 2.¿Qué es el centro de gravedad? Problemas en el caso simbólico:

CLEI 2007 ¿Qué el centro de gravedad?

CLEI 2007

21 Evaluation Criteria 1.Rand Index 2.Mutual Information 3.F-Measure 4.Entropy

CLEI 2007 22 Experiments

CLEI 2007 23 Experiments

CLEI 2007 24 Experiments

CLEI 2007 25 Experiments

CLEI 2007 26 Conclusions Symbolic representations are richer and more flexible than classical representations. The text in the HTML document seems to be the more important factor to cluster HTML documents.

CLEI 2007 27 Thank you!

Similar presentations