XML Retrieval with slides of C. Manning und H.Schutze 04/12/2008.

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

XML: Extensible Markup Language
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
XML R ETRIEVAL Tarık Teksen Tutal I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
CS276 Information Retrieval and Web Search Lecture 10: XML Retrieval.
Information Retrieval in Practice
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 10: XML Retrieval 1.
ISP 433/533 Week 2 IR Models.
CS276B Text Retrieval and Mining Winter 2005 Lecture 15.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 COS 425: Database and Information Management Systems XML and information exchange.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 10: XML Retrieval 1.
CS276B Text Retrieval and Mining Winter 2005 Lecture 12.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen.
Search Engines and Information Retrieval Chapter 1.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
CpSc 881: Information Retrieval. 2 IR and relational databases IR systems are often contrasted with relational databases (RDB). Traditionally, IR systems.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
ITCS 6265 Information Retrieval & Web Mining Lecture 18-A Fall 2009.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Algorithmic Detection of Semantic Similarity WWW 2005.
CS276 Information Retrieval and Web Search Lecture 10: XML Retrieval.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
XML Extensible Markup Language
Information Retrieval in Practice
Plan for Today’s Lecture(s)
XML: Extensible Markup Language
Text Based Information Retrieval
An Introduction to IR Chapter 10: XML Retrieval 9th Course,
XML Indexing and Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
CS 430: Information Discovery
Structure and Content Scoring for XML
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Structure and Content Scoring for XML
Information Retrieval and Web Design
Presentation transcript:

XML Retrieval with slides of C. Manning und H.Schutze 04/12/2008

Outline 04/12/20082 What is XML? Challenges in XML retrieval Vector space model for XML retrieval Evaluation of text-centric XML retrieval

What is XML? eXtensible Markup Language A framework for defining markup languages No fixed collection of markup tags Each XML language targeted for application All XML languages share features Enables building of generic tools 04/12/20083

XML Example FileCab This chapter describes the commands that manage the FileCab inet application. 404/12/2008

Basic Structure An XML document is an ordered, labeled tree character data leaf nodes contain the actual data (text strings) element nodes, are each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodes 504/12/2008

Elements Elements are denoted by markup tags thetext Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag: 604/12/2008

Why Use XML? Represent semi-structured data data that are structured, but don’t fit relational model XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free 704/12/2008

XML Schemas Schema = syntax definition of XML language Schema language = formal language for expressing XML schemas Examples Document Type Definition XML Schema (W3C) Relevance for XML IR Our job is much easier if we have a (one) schema 804/12/2008

Challenges in XML Retrieval 04/12/2008

Data vs. Text-centric XML Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data Numerical and non-text data dominate Mostly stored in the databases Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval functionality Queries are user information needs. E.g., give me the Section (element) of the document that tells me how to change a brake light 1004/12/2008

Text search in RDB Highly structured text search problems are most efficiently handled by a relational database Information need: find all employees who are involved with invoicing SQL query: select lastname from employees where job_desc like 'invoic%'; satisfies the information need with high precision and recall. 04/12/ idlastnamejob_descsalary… …invoicing…

Structured Retrieval: Data Many structured data sources containing text are best modeled as structured documents rather than relational data. Search over structured documents: structured retrieval. Queries in structured retrieval can be either structured or unstructured (here we assume that the collection consists only of structured documents). Applications of structured retrieval include: digital libraries, patent databases, text with tagged entities (e.g. persons and locations), application output as marked up text. 04/12/200812

Structured Retrieval: Query Queries combine textual criteria with structural criteria. Information need: Articles about sightseeing tours of the Vatican and the Coliseum. Usability of structured queries knowledge of the data structure sightseeing AND (COUNTRY:Vatican OR LANDMARK:Coliseum) sightseeing AND (STATE:Vatican OR BUILDING:Coliseum) syntax of the query language (XQuery) low recall in Boolean model 04/12/ for $a in doc()//au let

Indexing unit: there is no document unit in XML Book? Chapter? How do we compute tf and idf? Global tf/idf over all text context is useless Indexing granularity Structured document retrieval principle. A system should always retrieve the most specific part of a document answering the query. IR XML Challenges: Indexing Units & Term Statistics 1404/12/2008

Partitioning an XML document into non-overlapping indexing units 1504/12/2008

IR XML Challenges: Schemas Schema heterogeneity Many schemas Schemas not known in advance Schemas change Users don’t understand schemas Need to identify similar elements in different schemas Example: name, surname, family name 1604/12/2008

Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender What is the query language you expose to the user? Specific XML query language? No. Forms? Parametric search? A textbox? In general: design layer between XML and user IR XML Challenges: UI 17

Vector Spaces and XML

Vector spaces and XML Vector spaces – tried+tested framework for keyword retrieval Other “bag of words” applications in text: classification, clustering … For text-centric XML retrieval, can we make use of vector space ideas? Challenge: capture the structure of an XML document in the vector space. 19

Vector spaces and XML For instance, distinguish between the following two cases Book Title Author Bill GatesMicrosoft Book Title Author Bill Wulf The Pearly Gates 20

Book Title Author BillMicrosoft Book Title Author WulfPearly Gates The Bill Lexicon terms. Content-rich XML: representation 21

Encoding the Gates differently What are the axes of the vector space? In text retrieval, there would be a single axis for Gates Here we must separate out the two occurrences, under Author and Title Thus, axes must represent not only terms, but something about their position in an XML tree 22

Queries Before addressing this, let us consider the kinds of queries we want to handle Book Title Microsoft Book Title Author Gates Bill 23

Subtrees and structure Consider all subtrees of the document that include at least one lexicon term: Book Title Author BillMicrosoftGates BillMicrosoft Gates Title Microsoft Author Bill Author Gates Book Title Microsoft Bill Book Author Gates e.g. … 24

Structural terms Call each of the resulting (8+, in the previous slide) subtrees a structural term Note that structural terms might occur multiple times in a document Create one axis in the vector space for each distinct structural term Weights based on frequencies for number of occurrences (just as we had tf) All the usual issues with terms (stemming? Case folding?) remain 25

Here the structural terms containing to or be would have more weight than those that don’t Play Act To be or not to be Play Act be Play Act or Play Act not Play Act to Exercise: How many axes are there in this example? Example of tf weighting 26

Structural terms: docs+queries The notion of structural terms is independent of any schema/DTD for the XML documents Well-suited to a heterogeneous collection of XML documents Each document becomes a vector in the space of structural terms A query tree can likewise be factored into structural terms And represented as a vector Allows weighting portions of the query 27

The catch remains This is all very promising, but … How big is this vector space? Can be exponentially large in the size of the document Cannot hope to build such an index And in any case, still fails to answer queries like Book Gates (somewhere underneath) 28

Descendants Gates Author Book Author BillGates vs. Book Author BillGates FirstNameLastName No known DTD. Query seeks Gates under Author. 29

Handling descendants in the vector space Devise a match function that yields a score in [0,1] between structural terms E.g., when the structural terms are paths, measure overlap The greater the overlap, the higher the match score Can adjust match for where the overlap occurs Book Author Bill Book Author Bill LastName Book Bill vs. in 30

How do we use this in retrieval? First enumerate structural terms in the query Measure each for match against the dictionary of structural terms Just like a postings lookup, except not Boolean (does the term exist) Instead, produce a score that says “80% close to this structural term”, etc. Then, retrieve docs with that structural term, compute cosine similarities, etc. 31

Example of a retrieval step ST1 Doc1 (0.7)Doc4 (0.3)Doc9 (0.2) ST = Structural Term ST5 Doc3 (1.0)Doc6 (0.8)Doc9 (0.6) Index Query ST Match =0.63 Now rank the Doc’s by cosine similarity; e.g., Doc9 scores

Closing technicalities But what exactly is a Doc? In a sense, an entire corpus can be viewed as an XML document Corpus Doc1Doc2 Doc3Doc4 33

What are the Doc’s in the index? Anything we are prepared to return as an answer Could be nodes, some of their children … 34

What are queries we can’t handle using vector spaces? Find figures that describe the Corba architecture and the paragraphs that refer to those figures Requires JOIN between 2 tables Retrieve the titles of articles published in the Special Feature section of the journal IEEE Micro Depends on order of sibling nodes.  Requires to appear after a specific node.  Query tree cannot express relations that depend on node ordering. 35 …

Can we do IDF? Yes, but doesn’t make sense to do it corpus-wide Can do it, for instance, within all text under a certain element name say Chapter Yields a tf-idf weight for each lexicon term under an element Issues: how do we propagate contributions to higher level nodes. 36

Example Say Gates has high IDF under the Author element How should it be tf-idf weighted for the Book element? Should we use the idf for Gates in Author or that in Book ? Book Author BillGates 37

INEX: a benchmark for text-centric XML retrieval

INEX Benchmark for the evaluation of XML retrieval Analog of TREC (recall IIR8) Consists of: Set of XML documents Collection of retrieval tasks 39

INEX Each engine indexes docs Engine team converts retrieval tasks into queries In XML query language understood by engine In response, the engine retrieves not docs, but elements within docs Engine ranks retrieved elements

INEX assessment For each query, each retrieved element is human-assessed on two measures: Relevance – how relevant is the retrieved element Coverage – is the retrieved element too specific, too general, or just right  E.g., if the query seeks a definition of the Fast Fourier Transform, do I get the equation (too specific), the chapter containing the definition (too general) or the definition itself These assessments are turned into composite precision/recall measures 41

INEX corpus (Ad Hoc Track) Articles from IEEE Computer Society publications Wikipedia collection 2009 others 42

INEX topics Each topic is an information need, one of two kinds: Content Only (CO) – free text queries Content and Structure (CAS) – explicit structural constraints, e.g., containment conditions. 43

Sample INEX CO topic computational biology computational biology, bioinformatics, genome, genomics, proteomics, sequencing, protein folding Challenges that arise, and approaches being explored, in the interdisciplinary field of computational biology To be relevant, a document/component must either talk in general terms about the opportunities at the intersection of computer science and biology, or describe a particular problem and the ways it is being attacked. 44

INEX assessment Each engine formulates the topic as a query E.g., use the keywords listed in the topic. Engine retrieves one or more elements and ranks them. Human evaluators assign to each retrieved element relevance and coverage scores. 45

Assessments Relevance assessed on a scale from Irrelevant (scoring 0) to Highly Relevant (scoring 3) Coverage assessed on a scale with four levels: No Coverage (N: the query topic does not match anything in the element Too Large (L: The topic is only a minor theme of the element retrieved) Too Small (S: the element is too small to provide the information required) Exact Coverage(E). So every element returned by each engine has ratings from {0,1,2,3} × {N,S,L,E} 46

Combining the relevance/coverage assessments Define scores: 47

The Q-values Scalar measure of goodness of a retrieved elements Can compute Q-values for varying numbers of retrieved elements 10, 20 … etc. Means for comparing engines. 48

From Q-values to … ? INEX provides a method for turning these into precision-recall curves “Standard” issue: only elements returned by some participant engine are assessed Lots more commentary (and proceedings from previous INEX bakeoffs): 49

Resources Chapter 10 of IIR Resources at INEX 50