Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.

Similar presentations


Presentation on theme: "XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab."— Presentation transcript:

1 XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab

2 XML Ranking Querying, Dagstuhl, 9-13 Mar, The XML retrieval tasks Query formulation CO – Content only CAS – Content and structure (NEXI) Retrieval tasks Thorough: “find all highly exhaustive and specific elements” Retrieval results can be (possibly overlapping) XML elements of varying granularity that fulfill the query Focussed : “ find the most exhaustive and specific element in a path” No overlap in returned results

3 XML Ranking Querying, Dagstuhl, 9-13 Mar, Approaches for XML retrieval Index full documents. Score documents and then components inside the documents Problem: Works well for “fetch and browse ” but not for the general thorough task Index only leaf elements Score leaves and propagate scores along the XML tree Problem: weights used to propagate are either set manually by the user or set empirically Index all elements into same index Score all possible elements Problem: distorted “element-level" statistics due to overlapping Can we fix the distorted statistics?

4 XML Ranking Querying, Dagstuhl, 9-13 Mar, An adaptive XML retrieval system Split all collection elements into separate indices such that Coverage - each element is indexed in at least one index No overlap - elements in each index do not nest. Run Query on each index Merge results to a single result list

5 XML Ranking Querying, Dagstuhl, 9-13 Mar, Split to indices - example Index 2 p[3] p[1] bdy[1] article[1] sec[2] sec[1] Index 0 Index 1 Index 3 Index 0: /article[1]/article[1] Index 1: /article[1]/bdy[1]/article[1]/bdy[1] Index 2: /article[1]/bdy[1]/sec[1], /article[1]/bdy[1]/sec[1] /article[1]/bdy[1]/sec[2] Index 3: /article[1]/bdy[1]/sec[2]/p[1], /article[1]/bdy[1]/sec[1]/ss1[1] /article[1]/bdy[1]/sec[2]/p[3]/article[1]/bdy[1]/sec[1]/ss1[2] article[1] bdy[1] sec[1] ss1[1] ss1[2]

6 XML Ranking Querying, Dagstuhl, 9-13 Mar, An adaptive indexing schema SplitToIndices(doc, minCompSize, nInd) Find all leaves in doc that are larger than minCompSize If no minimal leaves found return G 0 = {root} Let d be the longest path among all those leaves Create groups {G 0,…,G d-1 } where each G i contains all elements inferred Xpath prefixes of length i of all matched leaves. Remove repeating elements in each group Split the groups {G 1,…,G d } to indices{I 0,…, I nInd-1 } (several strategies) Return {I 0,…, I nInd-1 }

7 XML Ranking Querying, Dagstuhl, 9-13 Mar, Examples – cut long paths Minimal element - /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2] Split to Indices index 0 : /article[1] index 1 : /article[1]/body[1] index 2 : /article[1]/body[1]/section[7] index 3: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1] index 4: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2] index 5: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1] index 6: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2]

8 XML Ranking Querying, Dagstuhl, 9-13 Mar, Experiements IEEE collection ,000 articles, 700MB Average document length ~41K Average depth topics from INEX 2005 Wikipedia collection 660,000 pages, 4.5GB Average document length 6.8K Average depth topics from INEX 2006

9 XML Ranking Querying, Dagstuhl, 9-13 Mar, Coverage For nInd=7 and minCompSize=10. 87% coverage for IEEE collection recall base 75% coverage for Wikipedia collection filtered recall base The filtered recall base was generated by removing all link elements from the recall base We still miss some small elements and some in-between elements which has depth > 7

10 XML Ranking Querying, Dagstuhl, 9-13 Mar, Doc pivot Some low level indices have partial content of the collection thus missing statistics Solution: compensate by containing document’s score Score’(e) = docPivot * Score(doc(e)) + (1 – docPivot) * Score(e))

11 XML Ranking Querying, Dagstuhl, 9-13 Mar, Elements distribution

12 XML Ranking Querying, Dagstuhl, 9-13 Mar, Tuning number of Indices needle Set minCompSize=10

13 XML Ranking Querying, Dagstuhl, 9-13 Mar, Tuning min Component Size Set num indices = 7 Set num indices nInd=7

14 XML Ranking Querying, Dagstuhl, 9-13 Mar, Summary Adaptive Indexing schema –split XML elements to separate indices –Same parameters for different collections XML retrieval system –achieved by running existing IR engines on each index Can be used for CAS Relatively low MAep results –Does XML structure reflect any semantic structure?

15 XML Ranking Querying, Dagstuhl, 9-13 Mar, Thank you!


Download ppt "XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab."

Similar presentations


Ads by Google