Presentation on theme: "CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses."— Presentation transcript:
CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR
7 Classical Representations Different approaches for representing a web document.
CLEI 2007 8 Vectorial Representation Every document is represented by a vector in n-dimensional space. Bag of words scheme. Each variable represents the relative weight of a term in the document.
CLEI 2007 9 Symbolic Objects Real-life objects are too complex to be represented by points in a vectorial space. [Bock&Diday, 2000] Symbolic objects overcome this limitation by representing concepts rather than individuals. In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.
CLEI 2007 26 Conclusions Symbolic representations are richer and more flexible than classical representations. The text in the HTML document seems to be the more important factor to cluster HTML documents.