Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Similar presentations


Presentation on theme: "Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities."— Presentation transcript:

1 Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities

2 Outline Motivations XML Mining Challenge Graph Labelling/WebSpam Challenge Conclusion and future work

3 General Idea The two challenges have been proposed to try to attract researchers from different domains: ◦ Mainly Machine Learning and Information Retrieval Show to IR researchers that ML methods are able to solve some of their problems Show to ML researchers that IR tasks provide interesting context for developping new general Machine Learning Algorithms

4 General Idea Find generic tasks that correspond to: ◦ IR new real-applications ◦ ML new generic problems To work together…. To mutualize efforts… To solve these tasks faster… To compare the approaches…

5 Open questions in ML Structure+content classification Classification of inter-dependant variables Structured output classification

6 Open questions in IR Structure+content classification Classification of inter-dependant variables Structured output classification Semi structured documents (XML) Interconnected documents Heterogeneous collections

7 Motivations Structured input classification Classification of inter- dependant variables Structured output classification Semi structured documents (XML) Hyperlinked documents Heterogeneous collections XML Mining Challenge

8 Motivations Structured input classification Classification of inter- dependant variables Structured output classification Semi structured documents (XML) Hyperlinked documents Heterogeneous collections WebSpam Challenge XML Mining Challenge

9 Motivations Information Retrieval Machine Learning Data MiningWeb Proposed Challenges

10 Challenges XML Mining Challenge ◦ « Bridging the gap between Machine Learning and Information Retrieval » Graph Labelling Challenge ◦ Application to WebSpam detection

11 Outline Motivations XML Mining Challenge WebSpam Challenge Conclusion and future work

12 XML Mining Challenge Launched in 2005 ◦ PASCAL (Network of excellence in ML) ◦ DELOS (Network of excellence in Digital Librairies) Organized as a INEX Track ◦ INEX: Initiative for the Evaluation of XML IR  More than 50 different institutes involved One event each year at INEX (december) Biggest INEX Track (after ad-hoc retrieval) We are currently launching the 4th XML Mining track

13 XML Mining Challenge ML Goal ◦ Classification of large collections of structures IR Goal ◦ Classification of semi-structured collections  Using both structure and content

14 Underlying idea Using structure and content Information

15 Collections Different collections have been used: ◦ 2005  Artificial collection  Movie collection ◦ 2006  Scientific articles  Wikipedia XML based collection ◦ 2007  Wikipedia XML based collection  96,000 documents in XML  21 categories

16 Submitted papers

17 Large variety of models Different existing ML Methods have been applied: ◦ Self Organizing Map ◦ SVM ◦ (Graph) Neural Network ◦ CRF ◦ Incremental Models ◦ … Some new models have been developped

18 Short Typology See Report on the XML Mining track – SIGIR Forum

19 Results - 2007 Classification AuthorsMethodMicro recallMacro recall Zhang and al.Kernel+SVM0.870.83 L. M. de Campos and al. Graphical Models – Bayesian netwoks 0.780.76 Meenakshi and al. Negative Category Document Frequency 0.780.75 ….

20 XML Structure Mapping task Proposed in 2006 ML task : Structured ouput classification ◦ Learning to transform trees IR application : Dealing with hetereogenous collections ◦ Learning to transform heterogeneous documents to a mediated schema

21 XML Structure Mapping A generic ML model able to solve this task has a lot of potential applications: ◦ Conversion between file formats ◦ Automatic translation ◦ Natural Language processing ◦ …

22 Conclusion Existing structured input models (kernel,…) have been tested on this task New specific models have been developped Difficult to know which model is the best ◦ Need to wait one more year The challenge has attracted researchers from different communities ◦ Each year, ML researchers are coming to INEX and:  Discover a new domain  Present advanced ML models to other researchers The collections are freely available and have been downloaded a hundred times ◦ …some articles start to appear in different conferences…

23 WebSpam Challenge PASCAL « Graph Labelling Challenge » Organized by: ◦ Ricardo BAEZA-YATES (Yahoo! Research Barcelona) Ricardo BAEZA-YATES ◦ Carlos CASTILLO (Yahoo! Research Barcelona) Carlos CASTILLO ◦ Brian DAVISON (Lehigh University, USA ) Brian DAVISON ◦ Ludovic DENOYER (University Paris 6, France) Ludovic DENOYER ◦ Patrick GALLINARI (University Paris 6, France) Patrick GALLINARI The Web Spam Challenge 2007 was supported by PASCAL The Web Spam Challenge 2007 was also supported by the DELIS EU - FET research project

24 WebSpam Challenge Three Events: ◦ AirWeb workshop 2007 (WWW’07)  May 2007  Web-oriented part ◦ GraphLab workshop 2007 – P KDD/ECML  September 2007  ML-oriented part ◦ AirWeb workshop 2008 (WWW’08 ?)

25 WebSpam Challenge IR (Web) Task : ◦ Detection of web spam  Spam = any attempt to get “an unjustifiably favorable relevance or importance score for some Web pages, considering the page’s true value”

26 Example of spam

27 WebSpam Challenge ML Learning task: ◦ Graph labelling ◦ Classification of inter-dependant variables

28 Collection A collection of interconnected Web pages ◦ 77 millions pages ◦ About 11,000 hosts ◦ manually labeled as spam or normal (host level) Blinded evaluation of models

29 Participants

30 Participants Why such an increase of ML participants during GraphLab ?

31 GraphLab workshop at ECML/PKDD 2007 Collection has been fully preprocessed by the organizers  Each node corresponds to a vector (in SVMLight format) based on the words distribution in each host/page  The contingenchy matrix has been built One small collection with 9,000 nodes One large collection with 400,000 nodes 10% for train/20% for validation/70% for test You can easily apply your « relationnal » models on this corpus without knowing anything about text processing

32 Results Small collection (9,000 nodes) ParticipantsMethodsAUC Abernethy and al.Semi supervised learning 95.2 Tang and al.SVM95.1 Filoche and al.Stacked Learning92.7 Csalogany and al.C4.587.7 Tian and al.Semi Supervised86.3 ………

33 Results Large collection (400,000 nodes) ParticipantsMethodsAUC Weiss and al.Semi supervised learning 99.8 Filoche and al.Stacked Learning99.1 Tang and al.SVM98.9 ………

34 Conclusion on WebSpam Different pure ML methods used « as if » ◦ Semi supervised methods ◦ Stacked Learning ◦ … Very nice performances of ML models (equivalent to Web « hand-made » models)

35 Conclusion on WebSpam Devlopment of a ML benchmark for graph labelling WebSpam also proposes interesting ML challenges that could be integrated in the challenge ◦ Learning with a few examples ◦ Large scale problems ◦ Adversial Machine Learning ◦ …

36 Conclusion The two challenges have proposed benchmarks for IR/Web applications and also for generic ML problems It is possible to mix researchers from different communities ML researchers dislike to clean real collections ◦ you have to preprocess the collections ML researchers dislike large collections ◦ but it is moving…

37 Future work XML Mining will continue this year ◦ See http://xmlmining.lip6.frhttp://xmlmining.lip6.fr ◦ The corpus will be preprocessed ? WebSpam challenge will also continue ◦ See http://webspam.lip6.frhttp://webspam.lip6.fr ◦ We will see after WWW’08 if we propose an other GraphLab workshop (see http://graphlab.lip6.fr) http://graphlab.lip6.fr ◦ Note that a new larger corpus has been developped in 2008

38 Thank you for your attention (Thank you to the participants of the different challenges that are in the room)


Download ppt "Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities."

Similar presentations


Ads by Google