Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.

Similar presentations


Presentation on theme: "XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk."— Presentation transcript:

1 XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk

2 Outline Structured document retrieval XML Content-oriented XML retrieval Evaluation

3 Outline Structured document retrieval XML Content-oriented XML retrieval Evaluation

4 Structured Document Retrieval Traditional IR is about finding relevant documents to a users information need, e.g. entire book. SDR allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book. The structure of documents is exploited to identify which document components to retrieve.

5 Structured Documents Linear order of words, sentences, paragraphs … Hierarchy or logical structure of a books chapters, sections … Links (hyperlink), cross- references, citations … Temporal and spatial relationships in multimedia documents Book Chapters Sections Paragraphs World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of todays research it issues to make se last sentence..

6 Structured Documents Explicit structure formalised through document representation standards (Mark-up Languages) Layout LaTeX (publishing), HTML (Web publishing) Structure SGML, XML (Web publishing, engineering), MPEG-7 (broadcasting) Content/Semantic RDF, DAML + OIL, OWL (semantic web) World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of todays research it issues to make se last sentence.. SDR … …

7 Outline Structured document retrieval XML Content-oriented XML retrieval Evaluation

8 XML: eXtensible Mark-up Language Meta-language (user-defined tags) being adopted as the document format language by W3C Used to describe content and structure (and not layout) Grammar described in DTD ( used for validation) Structured Document Retrieval Smith John Introduction into XML retrieval …. … …

9 XML: eXtensible Mark-up Language Use of XPath notation to refer to the XML structure chapter/title: title is a direct sub-component of chapter //title: any title chapter//title: title is a direct or indirect sub-component of chapter chapter/paragraph[2]: any direct second paragraph of any chapter chapter/*: all direct sub-components of a chapter Structured Document Retrieval Smith John Introduction into SDR …. …

10 Querying XML documents Content-only (CO) queries ' open standards for digital video in distance learning ' Content-and-structure (CAS) queries //article [about(., 'formal methods verify correctness aviation systems')] /body//section [about(.,'case study application model checking theorem proving')] Structure-only (SA) queries /article//*section/paragraph[2]

11 Passage retrieval Fixed-length (e.g. 300-word windows, overlapping) Discourse (e.g. sentence, paragraph) according to logical structure but fixed Semantic (e.g. TextTiling) Retrieval: e.g. rank document based on highest ranking passage or sum of ranking scores for all passages deal principally with CO queries p1 p2 p3 p4 p5 p6 doc

12 Database approaches to XML retrieval Relational OO Native Flexibility, expressiveness, complexity Efficiency Data-oriented retrieval –containment and not aboutness –no relevance-based ranking Aims/challenges tend to focus on efficiency performance XQuery

13 Outline Structured document retrieval XML Content-oriented XML retrieval A definition Challenges Approaches Evaluation

14 Content-oriented XML retrieval Return document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the users information need both with regards to content and structure.

15 Content-oriented XML retrieval Retrieve the best components according to content and structure criteria: INEX: most specific component that satisfies the query, while being exhaustive to the query Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing ???

16 Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 1: term weights Title Section 1Section 2 No fixed retrieval unit + nested document components: how to obtain document and collection statistics (e.g. tf idf) which aggregation formalism to use?

17 Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 2: augmentation weights Title Section 1Section 2 Nested document components: which components contribute best to content of Article? how to estimate augmentation weights (e.g. size, number of children)? how to aggregate term and augmentation weights?

18 Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 3: component weights Title Section 1Section 2 Different types of document components: which component is a good retrieval unit? how to estimate component weights (frequency, user studies)? how to aggregate term, augmentation and component weights?

19 Approaches … vector space model probabilistic model bayesian network language model extending DB model boolean model natural language processing cognitive model ontology parameter estimation tuning smoothing fusion phrase term statistics collection statistics component statistics proximity search logistic regression belief model relevance feedback

20 Vector space model article index abstract index section index sub-section index paragraph index RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV merge tf and idf as for fixed and non-nested retrieval units (IBM Haifa, INEX 2003 )

21 Language model element language model collection language model smoothing parameter element score element size element score article score query expansion with blind feedback ignore elements with 20 terms high value of leads to increase in size of retrieved elements results with = 0.9, 0.5 and 0.2 similar rank element (University of Amsterdam, INEX 2003)

22 Outline Structured document retrieval XML Content-oriented XML retrieval Evaluation

23 Evaluation of XML retrieval: INEX Evaluating the effectiveness of content-oriented XML retrieval approaches Collaborative effort participants contribute to the development of the collection queries relevance assessments Similar methodology as for TREC, but adapted to XML retrieval 40+ participants worldwide Workshop in Schloss Dagstuhl in December (20+ institutions)

24 INEX Test Collection Documents (~500MB), which consist of 12,107 articles in XML format from the IEEE Computer Society; 8 millions elements INEX CO and 30 CAS queries CO and CAS ad hoc retrieval tasks inex_eval metric INEX CO and 30 CAS queries CO, SCAS and VCAS ad hoc retrieval tasks CAS queries are defined according to enhanced subset of XPath inex_eval and inex_eval_ng metrics INEX 2004 is just starting

25 Relevance in INEX Exhaustivity how exhaustively a document component discusses the query: 0, 1, 2, 3 Specificity how focused the component is on the query: 0, 1, 2, 3 Relevance (3,3), (2,3), (1,1), (0,0), … Use of an online assessment tool to ensure exhaustive and consistent assessments (assessing a query takes a week!) section article all sections relevant article very relevant all sections relevant article better than sections one section relevant article less relevant one section relevant section better than article …

26 Metrics Recall / precision - based quantisation functions to obtain one relevance value expected search length penalise overlap consider size Others expected ratio of relevant cumulated gain-based metrics tolerance to irrelevance section article

27 Lessons learnt Good definition of relevance Expressing CAS queries was not easy Relevance assessment process must be improved Further development on metrics needed User studies required

28 INEX Tracks Relevance feedback Interactive Heterogeneous collection Natural language query

29 Merci


Download ppt "XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk."

Similar presentations


Ads by Google