Evaluation of Relevance Feedback Algorithms for XML Retrieval Silvana Solomon 27 February 2007 Supervisor: Dr. Ralf Schenkel
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Outline Short introduction Motivation & Goals Evaluating retrieval effectiveness INEX tool Evaluation methodology Results
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Introduction Path to the result sec „ The IR process is composed …“ article body sec subsec „ For small collections …“ frontmatter sec subsec pp p „ Figure 1 outlines …“ author „ Ian Ruthven “ Content of result citation „ D. Harman “ backmatter (3) feedback (4) expanded query Feedback XML Search Engine (1) query (2) results (5) results of expanded query
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Motivation Best way to compare feedback algorithms? Cannot use standard evaluation tools on feedback results Goals: Analyze evaluation methods Develop an evaluation tool
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluating Retrieval Effectiveness Document collection Topics set Assessments set Human assessors Metrics INEX: INitiative for the Evaluation of XML Retrieval 2006 document collection: 600,000 Wikipedia documents
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 INEX Tool: EvalJ Tool for evaluation of information retrieval experiments Implements a set of metrics used for evaluation Limitations: cannot measure improvement of runs produced with feedback
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 RF Evaluation – Ranking Effect Baseline run doc[1]/bdy[1] doc[2]/bdy[1] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[1] Mark in top results relevant doc[3] doc[8]/bdy[1]/article[3] doc[3] doc[8]/bdy[1]/article[3] doc[7]/article[3] push the known relevant results to the top of the element ranking artificially improves RP figures doc[2]/bdy[1]/article[1]
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 RF Evaluation – Feedback Effect measure improvement on unseen relevant elements not directly tested Modify FB run Evaluate untrained results Baseline run doc[1]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[3] doc[8]/bdy[1]/article[3] Mark in top results relevant
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (1) 1. Standard text IR: freezing known results at the top independent results assumption 2. New approach: remove known results+X from the collection resColl-result : remove results only (~doc retrieval) resColl-desc : remove results+descendants resColl-anc : remove results+ancestors resColl-path : remove results+desc+anc resColl-doc : remove whole doc with known results
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (2) Freezing: Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4]
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (2) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] block top-3 Feedback run doc[7]/bdy[1] doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] Freezing:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (2) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] block top-3 Feedback run doc[7]/bdy[1] doc[3] doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] Freezing:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (2) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] block top-3 Feedback run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] Freezing:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (2) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] block top-3 Feedback run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] Freezing:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (3) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] resColl-path:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (3) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] resColl-path:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (3) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] resColl-path:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (3) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[2]/bdy[1] doc[4]/bdy[1]/ article[4] resColl-path:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (3) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[4]/bdy[1]/ article[4] resColl-path:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Evaluation Methodology (3) Baseline run doc[7]/bdy[1] doc[3] doc[2]/bdy[1] doc[8]/bdy[1]/article[3] doc[4]/bdy[1]/ article[1]/ sec[6] Feedback run doc[2]/bdy[1]/article[1] doc[9] doc[4]/bdy[1]/article[2] doc[4]/bdy[1]/ article[4] resColl-path:
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Best Evaluation Methodology? sec „ The IR process is composed …“ article body sec subsec „ For small collections …“ frontmatterbackmatter sec subsec p p P „ Figure 1 outlines …“ author „ Ian Ruthven “ citation „ D. Harman “ resColl-path
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Testing Evaluated Results Standard method: average – problems: Topic-id Avg. Baseline Modified feedback t-test & Wilcoxon signed-rank test: gives probability p that the baseline run is better than the feedback run experiment significant if p<0.05 or p<0.01
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Results (1) Evaluation mode: resColl-path Feedback fileINEX metric Abs. improv. Rel. improv. T-testWSR TopX_CO_Content.xml xfirm_r1_cosc3s.xml xfirm_r1_cosc5.xml xfirm_r1_cosc3.xml xfirm_r1_coc3s3.xml xfirm2_r2_cop4.xml xfirm2_r2_cot40.xml xfirm2_r2_cot10.xml xfirm_r1_coc3.xml xfirm_r1_coc10.xml
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Results (2) Comparison of evaluation techniques based on relative improvement w.r.t. baseline run freezingresColl- anc resColl- desc resColl- doc resColl- path resColl- res c3s TopX c3s TopXc5c3s c5 TopXc5 TopX c3 TopX = TopX_CO_Content.xml c3 = xfirm_r1_cosc3.xml c3s = xfirm_r1_cosc3s.xml c5 = xfirm_r1_cosc5.xml
Silvana Solomon Evaluation of RF Algorithms for XML Retrieval 27 Feb 2007 Conclusions & Future Work Evaluation based on different techniques & metrics Correct improvement measurement Not solved: comparing several systems with different output Maybe a hybrid evaluation mode