XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
XML: Extensible Markup Language
Chapter 5: Introduction to Information Retrieval
XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
CS276 Information Retrieval and Web Search Lecture 10: XML Retrieval.
Evaluating Search Engine
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 10: XML Retrieval 1.
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
CS276B Text Retrieval and Mining Winter 2005 Lecture 15.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Ch 4: Information Retrieval and Text Mining
Information Retrieval in Practice
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Hinrich Schütze and Christina Lioma
1 COS 425: Database and Information Management Systems XML and information exchange.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 10: XML Retrieval 1.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Chapter 5: Information Retrieval and Web Search
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Information Retrieval in Practice
4/20/2017.
Search Engines and Information Retrieval Chapter 1.
Extensible Markup and Beyond
CpSc 881: Information Retrieval. 2 IR and relational databases IR systems are often contrasted with relational databases (RDB). Traditionally, IR systems.
XML Retrieval with slides of C. Manning und H.Schutze 04/12/2008.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
CS276 Information Retrieval and Web Search Lecture 10: XML Retrieval.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Information Retrieval in Practice
Sampath Jayarathna Cal Poly Pomona
Spatial Data Management
Text Based Information Retrieval
An Introduction to IR Chapter 10: XML Retrieval 9th Course,
Information Retrieval and Web Search
XML Indexing and Search
Modern Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

XML R ETRIEVAL Tarık Teksen Tutal

I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

B ASIC XML C ONCEPTS

XML Ordered, Labeled Tree XML Element XML Attribute XML DOM (Document Object Model): Standard for accessing and processing XML documents.

XML S TRUCTURE An Example:

XML DOM O BJECT XML DOM Object of the Sample in the Previous Slide Nodes in a Tree Parse the Tree Top Down

XP ATH Standard for enumerating paths in an XML document collection Query language for selecting nodes from an XML document Defined by the World Wide Web Consortium (W3C)

S CHEMA Puts Constraints on the Structure of Allowable XML Two Standarts for Schemas: XML DTD XML Schema

C HALLANGES IN XML R ETRIEVAL

S TRUCTURED D OCUMENT R ETRIEVAL P RINCIPLE A system should always retrieve the most specific part of a document answering the query In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book. In the same example however, if user queries «Apple», the book should be returned instead of a chapter.

I NDEXING U NIT Unstructured: Files on PC, Pages on the Web, Messages etc. Structured Non-Overlapping Pseudodocuments Top-Down Bottom-Up All

I NDEXING U NIT Non-Overlapping Pseudodocuments Not Coherent

I NDEXING U NIT Top-Down Start with one of the latest units (e.g book in a book collection) Postprocess search results to find for each book the subelement that is the best hit. Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.

I NDEXING U NIT Bottom-Up Search all leaves, select relevant ones Extend them to larger units in postprocessing Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.

I NDEXING U NIT Index All the Elements Not Useful to Index Some Elements (e.g ISBN) Creates redundancy (Deeper Level Elements are Returned Several Times)

N ESTED E LEMENTS To Get Rid of Redundancy, Discard All Small Elements Discard All Element Types that Users do not Look at (Working XML Retrieval System Logs) Discard All Element Types that Assessors Generally do not Judge to be Relevant (If Relevance Assessments are Available) Only Keep Element Types that a System Designer or Librarian has Deemed to be Useful Search Results

N ESTED E LEMENTS Remove Nested Elements in a Postprocessing Step Collapse Several Nested Elements in the Results List and then Highlight Results

V ECTOR S PACE M ODEL F OR XML R ETRIEVAL

L EXICALIZED S UBTREES To get each word together with its position within the XML tree encoded by a dimension of the vector space Map XML documents to lexicalized subtrees Take each text node (leaf) and break it into multiple nodes, one for each word. E.g. split Bill Gates into Bill and Gates Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term

L EXICALIZED S UBTREES

Queries and documents can be respresented as vectors in this lexicalized subtree context Matches can then be computed for example by using the Vector Space Formalism V.S. Formalism -> Unstructured vs Structured Dimensions: Vocabulary Terms vs Lexicalized Subtrees

D IMENSIONS : T RADEOFF Dimensionality of Space vs Accuracy of Results Restrict Dimensions to Vocabulary Terms Standart Vector Space Retrieval System Do Not Match the Structure of the Query Separate Lexicalized Dimension for Each Subtree Dimensionality of Space Becomes too Large

D IMENSIONS : C OMPROMISE Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs) Structural Term : a pair of XML-context c and vocabulary term t

C ONTEXT R ESEMBLANCE To measure the similarity between a path in a query and a path in a document | c q | and | c d | are the number of nodes in the query path and document path respectively c q matches c d if and only if we can transform c q into c d by inserting additional nodes

C ONTEXT R ESEMBLANCE CR( c q4, c d2 ) = 3/4 = 0.75 CR( c q4, c d3 ) = 3/5 = 0.6

D OCUMENT S IMILARITY M EASURE Final Score for a Document Variant of the Cosine Measure Also called «SimNoMerge» Not a True Cosine Measure Since Its Value can be Larger than 1.0

D OCUMENT S IMILARITY M EASURE V is the vocabulary of non-structural terms B is the set of all XML contexts weight ( q, t, c ), weight( d, t, c ) are the weights of term t in XML context c in query q and document d, respectively standard weighting e.g. idf t x wf t,d, where idf t depends on which elements we use to compute df t.

S IM N O M ERGE A LGORITHM S CORE D OCUMENTS W ITH S IM N O M ERGE ( q, B, V, N, normalizer )

E VALUATION OF XML R ETRIEVAL

INEX Initiative for the Evaluation of XML Retrieval Yearly standard benchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments) Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection) The relevance of documents is judged by human assessors.

INEX T OPICS Content Only (CO) Regular Keyword Queries Like in Unstructured IR Content and Structure (CAS) Structured Constraints in Addition to Keywords Relevance Assessments are More Complicated

INEX R ELEVANCE A SSESSMENTS INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance Component Coverage: Evaluates Whether the Element Retrieved is «Structurally» Correct Topical Relevance

INEX R ELEVANCE A SSESSMENTS Component Coverage: Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self- contained) unit of information Too large (L): The information sought is present in the component, but is not the main topic No coverage (N): The information sought is not a topic of the component Topical Relevance: Highly Relevant (3), Fairly Relevant (2), Marginally Relevant (1) and Nonrelevant (0)

C OMBINING T HE R ELEVANCE D IMENSIONS All of the combinations are not possible -> 3N Quantization:

INEX E VALUATION M EASURES Precision and Recall can be applied Sum Grades vs Binary Relevance Overlap is not accounted for Nested elements in the same search result Recent INEX focus: Develop algorithms and evaluation measures that return non-redundant results lists and evaluate them properly.