XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.

XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011

I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

B ASIC XML C ONCEPTS

XML Ordered, Labeled Tree XML Element XML Attribute XML DOM (Document Object Model): Standard for accessing and processing XML documents.

XML S TRUCTURE An Example:

XML DOM O BJECT XML DOM Object of the Sample in the Previous Slide Nodes in a Tree Parse the Tree Top Down

XP ATH Standard for enumerating paths in an XML document collection Query language for selecting nodes from an XML document Defined by the World Wide Web Consortium (W3C)

S CHEMA Puts Constraints on the Structure of Allowable XML Two Standarts for Schemas: XML DTD XML Schema

C HALLANGES IN XML R ETRIEVAL

S TRUCTURED D OCUMENT R ETRIEVAL P RINCIPLE A system should always retrieve the most specific part of a document answering the query In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book. In the same example however, if user queries «Apple», the book should be returned instead of a chapter.

I NDEXING U NIT Unstructured: Files on PC, Pages on the Web, E-Mail Messages etc. Structured Non-Overlapping Pseudodocuments Top-Down Bottom-Up All

I NDEXING U NIT Non-Overlapping Pseudodocuments Not Coherent

I NDEXING U NIT Top-Down Start with one of the latest units (e.g book in a book collection) Postprocess search results to find for each book the subelement that is the best hit. Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.

I NDEXING U NIT Bottom-Up Search all leaves, select relevant ones Extend them to larger units in postprocessing Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.

I NDEXING U NIT Index All the Elements Not Useful to Index Some Elements (e.g ISBN) Creates redundancy (Deeper Level Elements are Returned Several Times)

N ESTED E LEMENTS To Get Rid of Redundancy, Discard All Small Elements Discard All Element Types that Users do not Look at (Working XML Retrieval System Logs) Discard All Element Types that Assessors Generally do not Judge to be Relevant (If Relevance Assessments are Available) Only Keep Element Types that a System Designer or Librarian has Deemed to be Useful Search Results

N ESTED E LEMENTS Remove Nested Elements in a Postprocessing Step Collapse Several Nested Elements in the Results List and then Highlight Results

V ECTOR S PACE M ODEL F OR XML R ETRIEVAL

L EXICALIZED S UBTREES To get each word together with its position within the XML tree encoded by a dimension of the vector space Map XML documents to lexicalized subtrees Take each text node (leaf) and break it into multiple nodes, one for each word. E.g. split Bill Gates into Bill and Gates Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term

L EXICALIZED S UBTREES

Queries and documents can be respresented as vectors in this lexicalized subtree context Matches can then be computed for example by using the Vector Space Formalism V.S. Formalism -> Unstructured vs Structured Dimensions: Vocabulary Terms vs Lexicalized Subtrees

D IMENSIONS : T RADEOFF Dimensionality of Space vs Accuracy of Results Restrict Dimensions to Vocabulary Terms Standart Vector Space Retrieval System Do Not Match the Structure of the Query Separate Lexicalized Dimension for Each Subtree Dimensionality of Space Becomes too Large

D IMENSIONS : C OMPROMISE Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs) Structural Term : a pair of XML-context c and vocabulary term t

C ONTEXT R ESEMBLANCE To measure the similarity between a path in a query and a path in a document | c q | and | c d | are the number of nodes in the query path and document path respectively c q matches c d if and only if we can transform c q into c d by inserting additional nodes

C ONTEXT R ESEMBLANCE CR( c q4, c d2 ) = 3/4 = 0.75 CR( c q4, c d3 ) = 3/5 = 0.6

D OCUMENT S IMILARITY M EASURE Final Score for a Document Variant of the Cosine Measure Also called «SimNoMerge» Not a True Cosine Measure Since Its Value can be Larger than 1.0

D OCUMENT S IMILARITY M EASURE V is the vocabulary of non-structural terms B is the set of all XML contexts weight ( q, t, c ), weight( d, t, c ) are the weights of term t in XML context c in query q and document d, respectively standard weighting e.g. idf t x wf t,d, where idf t depends on which elements we use to compute df t.

S IM N O M ERGE A LGORITHM S CORE D OCUMENTS W ITH S IM N O M ERGE ( q, B, V, N, normalizer )

E VALUATION OF XML R ETRIEVAL

INEX Initiative for the Evaluation of XML Retrieval Yearly standard benchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments) Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection) The relevance of documents is judged by human assessors.

INEX T OPICS Content Only (CO) Regular Keyword Queries Like in Unstructured IR Content and Structure (CAS) Structured Constraints in Addition to Keywords Relevance Assessments are More Complicated

INEX R ELEVANCE A SSESSMENTS INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance Component Coverage: Evaluates Whether the Element Retrieved is «Structurally» Correct Topical Relevance

INEX R ELEVANCE A SSESSMENTS Component Coverage: Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self- contained) unit of information Too large (L): The information sought is present in the component, but is not the main topic No coverage (N): The information sought is not a topic of the component Topical Relevance: Highly Relevant (3), Fairly Relevant (2), Marginally Relevant (1) and Nonrelevant (0)

C OMBINING T HE R ELEVANCE D IMENSIONS All of the combinations are not possible -> 3N Quantization:

INEX E VALUATION M EASURES Precision and Recall can be applied Sum Grades vs Binary Relevance Overlap is not accounted for Nested elements in the same search result Recent INEX focus: Develop algorithms and evaluation measures that return non-redundant results lists and evaluate them properly.

XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.

Similar presentations

Presentation on theme: "XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.

Similar presentations

Presentation on theme: "XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric."— Presentation transcript:

Similar presentations

About project

Feedback