Presentation on theme: "I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse."— Presentation transcript:
I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse Anne-Marie Vercoustre Inria Projet Axis E_mail:
I-Know 2005 Scientific Activity Report at Inria
I-Know 2005 Homogeneous presentation
I-Know 2005 Some RA figures 146 files text lines 14,8 M octets of data one DTD Optional sections Free style and content
I-Know 2005 Grouping by Themes (2003)
I-Know 2005 Grouping by Themes (2004)
I-Know 2005 Problem Presentation by Research themes That varies overtime Not politically neutral (funding, evaluation) Is there any natural grouping? What is the role of different parts of the report in highlighting the themes?
I-Know 2005 Methodology 1.Select specific parts by using the XML structure 2.Select significant words by using a tool for syntactic typing and stemming (TreeTagger) 3.Cluster the documents into disjoined clusters 4.Evaluate those clusters
I-Know 2005 Various experiments K-F: Keywords from sections foundations K-all: all Keywords T-P: text in section presentation T-PF: text in sections presentation et foundations T-C: names of conferences, workshops, congress etc. in the bibliography
I-Know 2005 Clustering Method The objective of the 3rd step is to cluster documents in a set of disjoint classes, from the vocabularies selected for the five experiments. We use a partition method close to the k-means algorithm where the distance between documents is based on the word frequency.
I-Know 2005 K-F-a experiment: list of representative Keywords Classe 1: 3d approximation, computer, differential, environment, modeling, processing, programming, vision Classe 2 : computing, equation, grid, problem, transformation Classe 3 : code, design, event, network, processor, time, traffic Classe 4 : calculus, database, datum, image, indexing, information, integration, knowledge, logic, mining, pattern, recognition, user, web For each cluster, the list of most representative words can be associated. Those words can be interpreted as summaries for those classes.
I-Know 2005 Repartition of clusters compared to themes 2003
I-Know 2005 Repartition of themes 2003 compared to clusters
I-Know 2005 Partition des projets ThèmesCluster_1Cluster_2Cluster_3Cluster_4 3a Eiffel HELIX LeD METISS MAIA Merlin Cordial Parole Symbiose ECOO ACACIA ATLAS AXIS Cordial Gemo in-situ MOSTRARE Orion PRIMA Smis WAM TEXMEX Cortex Orpailleur WAM I3D Atoll EXMO DREAM SIGNES 3b Air2 ArianaIPARLA ALCOVE EVASION TEXMEX TEMICS Epidaure ISA LEAR Mirages Odysee Imedia e- motion PRIMA REVES siames VISTA artis 4a BIPOP COMORE Miaou corida IS2 CONGE Fractales NUMOPT Metalau Sydoco Scilab macsiImara IcareSigma2 4b ALADIN Bang Estime IDOPT Macs Mathfi Micmac OMEGA Opale Caiman Calvi Smash ScAIApplix sagep
I-Know 2005 Extern Evaluation The evaluation of the quality of clusters can be done by comparing the resulting clusters with the two lists of themes used by INRIA n ij is the number of research projects with their report classed in cluster U i and allocated to group C j (theme j). n i. is the number of research reports in cluster U i, n.k is the number of research projects allocated to group C k, n is the total number of research projects analysed.
I-Know 2005 Two evaluation measures The F-measure proposed by (Jardine and Rijsbergen, 1963) combines the precision and recall measure between U i and C k. recall is defined by R(i,k)=n ik /n i. precision is defined by P(i,k)= n ik /n.k The F-measure between the a priori partition U in K groupes and partition C of INRIA projects by the clustering method is: The corrected Rand index (CR) proposed by (Hubert and Arabie (1985)) to compare two partitions.
I-Know 2005 Conclusion Combination of selection by structure and by linguistic terms Evaluation of clustering compared to an existing typology The quality of clustering strongly depends on the selected parts in the activity reports (which in turn gives an indication on where the report could be improved) Future : Measuring the stability of clusters when K varies Evolution of classes overtime Experiences with other collections