Download presentation
Presentation is loading. Please wait.
Published byAutumn MacKenzie Modified over 10 years ago
1
CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR
2
CLEI 2007 2 Motivation
3
CLEI 2007 3 Motivation Which HTML feature is the most important to provide good clustering results? Using symbolic objects to cluster web documents. 15 th World Wide Web Conference (2006)
4
CLEI 2007 HTML Document Clustering Find meaningful groups from a web document collection. Effectively represent web document clusters for further analysis.
5
CLEI 2007 5 HTML Document
6
CLEI 2007 6
7
7 Classical Representations Different approaches for representing a web document.
8
CLEI 2007 8 Vectorial Representation Every document is represented by a vector in n-dimensional space. Bag of words scheme. Each variable represents the relative weight of a term in the document.
9
CLEI 2007 9 Symbolic Objects Real-life objects are too complex to be represented by points in a vectorial space. [Bock&Diday, 2000] Symbolic objects overcome this limitation by representing concepts rather than individuals. In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.
10
CLEI 2007 Symbolic Data Table
11
CLEI 2007 Multivariate Numeric Analysis IndividualAgeProfessionWageLocation 345736Lawyer2,500.00San José 125128Teacher1,750.00Alajuela 324539Doctor2,400.00San José 763533Teacher1,900.00Alajuela 324535Engineer1,850.00Alajuela 536727Engineer1,900.00Heredia 648634Manager1,600.00Heredia IndividualAgeProfessionWage San José[36,39]{Law, 50%,Doc,50%}[2,4 – 2,5] Alajuela[28,35]{Tea,66%,Eng,33%}[1,75 – 1,9] Heredia[2,34]{Eng,50%,Mgn,50%}[1,6 – 1,9] Multivariate Symbolic Analysis Millions… Hundreds… Data Concepts From relational data bases to symbolic data bases Symbolic Data Table
12
CLEI 2007 12 Relational Data BaseSymbolic Data Base 100% knowledge 15 Gigabyte 90 % knowledge 10.3 Megabyte Symbolic Data Base
13
CLEI 2007 13 Symbolic Representations A complex representation that takes into account: term frequency, word order and phrases.
14
CLEI 2007 14 The K-Means Clustering Method
15
CLEI 2007 15 But, there are some problems …….
16
CLEI 2007 16 Distance Measures
17
CLEI 2007 17 Teorema: Igualdad de Fisher Inercia total Inercia inter-clasesInercia total = Inercia inter-clases + Inercia intra-clases Inercia intra-clases
18
CLEI 2007 18 1.Representar una clase por su centro de gravedad, esto es, por su vector de promedios. 2.¿Qué es el centro de gravedad? Problemas en el caso simbólico:
19
CLEI 2007 ¿Qué el centro de gravedad?
20
CLEI 2007
21
21 Evaluation Criteria 1.Rand Index 2.Mutual Information 3.F-Measure 4.Entropy
22
CLEI 2007 22 Experiments
23
CLEI 2007 23 Experiments
24
CLEI 2007 24 Experiments
25
CLEI 2007 25 Experiments
26
CLEI 2007 26 Conclusions Symbolic representations are richer and more flexible than classical representations. The text in the HTML document seems to be the more important factor to cluster HTML documents.
27
CLEI 2007 27 Thank you!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.