Download presentation

Presentation is loading. Please wait.

Published byAutumn MacKenzie Modified over 2 years ago

1
CLEI Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR

2
CLEI Motivation

3
CLEI Motivation Which HTML feature is the most important to provide good clustering results? Using symbolic objects to cluster web documents. 15 th World Wide Web Conference (2006)

4
CLEI 2007 HTML Document Clustering Find meaningful groups from a web document collection. Effectively represent web document clusters for further analysis.

5
CLEI HTML Document

6
CLEI

7
7 Classical Representations Different approaches for representing a web document.

8
CLEI Vectorial Representation Every document is represented by a vector in n-dimensional space. Bag of words scheme. Each variable represents the relative weight of a term in the document.

9
CLEI Symbolic Objects Real-life objects are too complex to be represented by points in a vectorial space. [Bock&Diday, 2000] Symbolic objects overcome this limitation by representing concepts rather than individuals. In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.

10
CLEI 2007 Symbolic Data Table

11
CLEI 2007 Multivariate Numeric Analysis IndividualAgeProfessionWageLocation Lawyer2,500.00San José Teacher1,750.00Alajuela Doctor2,400.00San José Teacher1,900.00Alajuela Engineer1,850.00Alajuela Engineer1,900.00Heredia Manager1,600.00Heredia IndividualAgeProfessionWage San José[36,39]{Law, 50%,Doc,50%}[2,4 – 2,5] Alajuela[28,35]{Tea,66%,Eng,33%}[1,75 – 1,9] Heredia[2,34]{Eng,50%,Mgn,50%}[1,6 – 1,9] Multivariate Symbolic Analysis Millions… Hundreds… Data Concepts From relational data bases to symbolic data bases Symbolic Data Table

12
CLEI Relational Data BaseSymbolic Data Base 100% knowledge 15 Gigabyte 90 % knowledge 10.3 Megabyte Symbolic Data Base

13
CLEI Symbolic Representations A complex representation that takes into account: term frequency, word order and phrases.

14
CLEI The K-Means Clustering Method

15
CLEI But, there are some problems …….

16
CLEI Distance Measures

17
CLEI Teorema: Igualdad de Fisher Inercia total Inercia inter-clasesInercia total = Inercia inter-clases + Inercia intra-clases Inercia intra-clases

18
CLEI Representar una clase por su centro de gravedad, esto es, por su vector de promedios. 2.¿Qué es el centro de gravedad? Problemas en el caso simbólico:

19
CLEI 2007 ¿Qué el centro de gravedad?

20
CLEI 2007

21
21 Evaluation Criteria 1.Rand Index 2.Mutual Information 3.F-Measure 4.Entropy

22
CLEI Experiments

23
CLEI Experiments

24
CLEI Experiments

25
CLEI Experiments

26
CLEI Conclusions Symbolic representations are richer and more flexible than classical representations. The text in the HTML document seems to be the more important factor to cluster HTML documents.

27
CLEI Thank you!

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google