2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Chapter 5: Introduction to Information Retrieval
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
ISP 433/533 Week 2 IR Models.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
Ch 4: Information Retrieval and Text Mining
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.
COMP630 Paper Presentation by Haomian(Eric) Wang.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Querying Structured Text in an XML Database By Xuemei Luo.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search Sihem Amer-Yahia AT&T Labs Research - USA Database Department.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Personalizing XML Text Search in Piment Sihem Amer-Yahia AT&T Labs Research - USA Irini Fundulaki Bell Labs - USA Prateek Jain IIT-Kanpur - India Laks.
Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Module 7 XML and Information Retrieval (XQuery FullText, Research) 26MKT-ECHER-67FEX-44B6P.
Information Retrieval in Practice
XRANK: Ranked Keyword Search over XML Documents
Information Retrieval and Web Search
Probabilistic Data Management
Toshiyuki Shimizu (Kyoto University)
Information Retrieval
Structure and Content Scoring for XML
Introduction to XML IR — Scoring and Ranking XML Group.
Structure and Content Scoring for XML
Relax and Adapt: Computing Top-k Matches to XPath Queries
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem Amer-Yahia AT&T Labs – Research

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Motivation XML is able to represent a mix of structured and text information. XML applications: digital libraries, content management. XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection.

2 September 2005VLDB Tutorial on XML Full-Text Search XML in Library of Congress 109th CONGRESS 1st Session H. R IN THE HOUSE OF REPRESENTATIVES May 26, 2005 Mr. Tierney (for himself, Ms. McCollum of Minnesota, Mr. George Miller of California ) introduced the following bill; which was referred to the Committee on Education and the Workforce …

2 September 2005VLDB Tutorial on XML Full-Text Search THOMAS: Library of Congress

2 September 2005VLDB Tutorial on XML Full-Text Search INEX Data K /K0271s-2004 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING /04/$20.00 © 2004 IEEE Published by the IEEE Computer Society Vol. 16, No. 2 FEBRUARY 2004 pp A Graph-Based Approach for Timing Analysis and Refinement of OPS5 Knowledge- Based Systems pp * Albert Mo Kim Cheng Senior Member IEEE Hsiu-yen Tsai Abstract —This paper examines the problem of predicting the timing behavior of knowledge-based systems for real- …

2 September 2005VLDB Tutorial on XML Full-Text Search Example INEX Query //article[about(.//abs, "data mining")]//sec[about(., "frequent itemsets")] sections about frequent itemsets from articles with abstract about data mining To be relevant, a component has to be a section about "frequent itemsets". For example, it could be about algorithms for finding frequent itemsets, or uses of frequent itemsets to generate rules. Also, the article must have an abstract about "data mining". I need this information for a paper that I am writing. It is a survey of different algorithms for finding frequent itemsets. The paper will also have a section on why we would want to find frequent itemsets.

2 September 2005VLDB Tutorial on XML Full-Text Search Challenges in XML FT Search Searching over Semi-Structured Data –Users may specify a search context and return context. Expressive Power and Extensibility –Users should be able to express complex full-text searches and combine them with structural searches. Scores and Ranking –Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores. –The language should allow for an efficient implementation.

2 September 2005VLDB Tutorial on XML Full-Text Search XML FT Search Definition Context expression : XML elements searched: –pre-defined XML nodes. –XPath/XQuery queries. Return expression : XML fragments returned: –pre-defined meaningful XML fragments. –XPath/XQuery to build answers. Search expression : FT search conditions: –Boolean keyword search. –proximity distance, scoping, thesaurus, stop words, stemming. Score expression : –system-defined scoring function. –user-defined scoring function. –query-dependent keyword weights.

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Four Classes of Languages Keyword search (INEX Content-Only Queries) –“book xml” Tag + Keyword search –book: xml Path Expression + Keyword search –/book[./title about “xml db”] XQuery + Complex full-text search –for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search XRank [Guo et al., SIGMOD 2003] XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

2 September 2005VLDB Tutorial on XML Full-Text Search XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … … XRank [Guo et al., SIGMOD 2003]

2 September 2005VLDB Tutorial on XML Full-Text Search XIRQL [Fuhr & Grobjohann, SIGIR 2001] XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … Index Node

2 September 2005VLDB Tutorial on XML Full-Text Search Similar Notion of Results Nearest Concept Queries –[Schmidt et al., ICDE 2002] XKSearch –[Xu & Papakonstantinou, SIGMOD 2005]

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search XSearch [Cohen et al., VLDB 2003] XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … … XML Indexing … Not a “meaningful” result

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search XPath [W3C 2005] fn:contains($e, string) returns true iff $e contains string //section[fn:contains(./title, “XML Indexing”)]

2 September 2005VLDB Tutorial on XML Full-Text Search XIRQL [Fuhr & Grobjohann, SIGIR 2001] Weighted extension to XQL (precursor to XPath) //section[0.6 ·.//* $cw$ “XQL” ·.//section $cw$ “syntax”]

2 September 2005VLDB Tutorial on XML Full-Text Search XXL [Theobald & Weikum, EDBT 2002] Introduces similarity operator ~ Select Z From Where zoos.#.zoo As Z and Z.animals.(animal)?.specimen as A and A.species ~ “lion” and A.birthplace.#.country as B and A.region ~ B.content

2 September 2005VLDB Tutorial on XML Full-Text Search NEXI [Trotman & Sigurbjornsson, INEX 2004] Narrowed Extended XPath I INEX Content-and-Structure (CAS) Queries //article[about(.//title, apple) and about(.//sec, computer)]

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Schema-Free XQuery [Li, Yu, Jagadish, VLDB 2003] Meaningful least common ancestor (mlcas) for $a in doc(“bib.xml”)//author $b in doc(“bib.xml”)//title $c in doc(“bib.xml”)//year where $a/text() = “Mary” and exists mlcas($a,$b,$c) return {$b,$c}

2 September 2005VLDB Tutorial on XML Full-Text Search XQuery Full-Text [W3C 2005] Two new XQuery constructs 1)FTContainsExpr Expresses “Boolean” full-text search predicates Seamlessly composes with other XQuery expressions 2)FTScoreClause Extension to FLWOR expression Can score FTContainsExpr and other expressions

2 September 2005VLDB Tutorial on XML Full-Text Search FTContainsExpr //book ftcontains “Usability” && “testing” distance 5 //book[./content ftcontains “Usability” with stems]/title //book ftcontains /article[author=“Dawkins”]/title

2 September 2005VLDB Tutorial on XML Full-Text Search FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b In any order

2 September 2005VLDB Tutorial on XML Full-Text Search FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] ORDER BY $s RETURN $b In any order

2 September 2005VLDB Tutorial on XML Full-Text Search FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b In any order

2 September 2005VLDB Tutorial on XML Full-Text Search XQuery Full-Text Evolution Quark Full-Text Language (Cornell) TeXQuery (Cornell, AT&T Labs) IBM, Microsoft, Oracle proposals XQuery Full-Text (Second Draft)

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Full-Text Scoring Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance. Queries return document fragments. Granularity of returned results affects scoring. For queries containing conditions on structure, structural conditions may affect scoring. Existing proposals extend common scoring methods: probabilistic or vector-based similarity.

2 September 2005VLDB Tutorial on XML Full-Text Search Granularity of Results Keyword queries – compute possibly different scores for LCAs. Tag + Keyword queries –compute scores based on tags and keywords. Path Expression + Keyword queries –compute scores based on paths and keywords. XQuery + Complex full-text queries –compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions).

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Granularity of Results Document as hierarchical structure of elements as opposed to flat document. –XXL [Theobald & Weikum, EDBT 2002] –XIRQL [Fuhr & Grobjohann, SIGIR 2001] –XRANK [Guo et al., SIGMOD 2003] Propagate keyword weights along document structure.

2 September 2005VLDB Tutorial on XML Full-Text Search XML Data Model date 28 July …XML and …David Carmel … … …… XQL and …Ricardo … Containment edge Hyperlink edge

2 September 2005VLDB Tutorial on XML Full-Text Search XXL [Theobald & Weikum, EDBT 2002] Compute similar terms with relevance score r1 using an ontology. Compute tf*idf of each term for a given element content with relevance score r2. Relevance of an element content for a term is r1*r2. r1 and r2 are computed as a weighted distance in an ontology graph. Probabilities of conjunctions multiplied (independence assumption) along elements of same path to compute path score.

2 September 2005VLDB Tutorial on XML Full-Text Search Probabilistic Scoring XIRQL [Fuhr & Grobjohann, SIGIR 2001] Extension of XPath. Weighting and ranking: –weighting of query terms: P(wsum((0.6,a), (0.4,b)) = 0.6 · P(a)+0.4 · P(b) –probabilistic interpretation of Boolean connectors: P(a && b) = P(a) · P(b)

2 September 2005VLDB Tutorial on XML Full-Text Search XIRQL Example Query: –“Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago” Data: –“Ernst Olbrich, Darmstadt, 1899” Weights and ranking: –P(Olbrich p Ulbrich)=0.8 (phonetic similarity) –P(1899 n 1903)=0.9 (numeric similarity) –P(Darmstadt g Frankfurt)=0.7 (geographic distance)

2 September 2005VLDB Tutorial on XML Full-Text Search PageRank [Brin & Page 1998] w : Hyperlink edge d/3 d: Probability of following hyperlink 1-d: Probability of random jump

2 September 2005VLDB Tutorial on XML Full-Text Search ElemRank [Guo et al. SIGMOD 2003] w : Hyperlink edge d1/3 d1: Probability of following hyperlink 1-d1-d2-d3: Probability of random jump : Containment edge d2/2 d2: Probability of visiting a subelement d3 d3: Probability of visiting parent

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search XSearch [Cohen et al., VLDB 2003] tf*ilf to compute weight of keyword for a leaf element. A vector is associated with each non-leaf element. sim(Q,N): sum of the cosine distances between the vectors associated with nodes in N and vectors associated with terms matched in Q.

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Vector–based Scoring JuruXML [Mass et al INEX 2002] Transform query into (term,path) conditions: article/bm/bib/bibl/bb[about(., hypercube mesh torus nonnumerical database)] (term,path)-pairs: hypercube, article/bm/bib/bibl/bb mesh, article/bm/bib/bibl/bb torus, article/bm/bib/bibl/bb nonnumerical, article/bm/bib/bibl/bb database, article/bm/bib/bibl/bb Modified cosine similarity as retrieval function for vague matching of path conditions.

2 September 2005VLDB Tutorial on XML Full-Text Search JuruXML Vague Path Matching Modified vector-based cosine similarity Example of length normalization: cr (article/bibl, article/bm/bib/bibl/bb) = 3/6 = 0.5

2 September 2005VLDB Tutorial on XML Full-Text Search Query Relaxation on Structure Schlieder, EDBT 2002 Delobel & Rousset, 2002 Amer-Yahia et al, VLDB 2005

2 September 2005VLDB Tutorial on XML Full-Text Search XML Query Relaxation [Amer-Yahia et al EDBT 2002] FlexPath [Amer-Yahia et al SIGMOD 2004] Tree pattern relaxations: –Leaf node deletion –Edge generalization –Subtree promotion book edition paperback info author Dickens book edition paperback infoauthor Dickens book info author C. Dickens book edition (paperback) info author Charles Dickens edition? Query Data

2 September 2005VLDB Tutorial on XML Full-Text Search Adaptation of tf.idf to XML Whirlpool [ Marian et al ICDE 2005 ] Document Collection (Information Retrieval) XML Document DocumentXML Node (result is a subtree rooted at a returned node with a given tag and satisfying structural predicates in the query) Keyword(s)Tree Pattern idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s) idf is a function of the fraction of returned nodes that match the query tree pattern tf (term frequency) is a function of the number of occurrences of the keyword in the document tf is a function of the number of ways the query tree pattern matches the returned node

2 September 2005VLDB Tutorial on XML Full-Text Search A Family of XML Scoring Methods [ Amer-Yahia et al VLDB 2005 ] Twig scoring –High quality –Expensive computation Path scoring Binary scoring –Low quality –Fast computation book edition (paperback) info author (Dickens) Query book edition (paperback) info author (Dickens) book edition (paperback) author (Dickens) book info + edition (paperback) author (Dickens) book info ++book

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search XIRQL + Relaxation XIRQL proposes vague predicates but it is not clear how to combine it with all of XQuery. Open issue as how to relax all of XQuery including structured and scalar predicates.

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Main Issue Given: Query keywords Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order

2 September 2005VLDB Tutorial on XML Full-Text Search Na ï ve Method Naïve inverted lists: Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7 Problems: 1. Space Overhead 2. Spurious Results Main issue: Decouples representation of ancestors and descendants date 28 July …XML and …David Carmel … … …… XQL and … Ricardo …

2 September 2005VLDB Tutorial on XML Full-Text Search Dewey Encoding of IDs [1850s] 0.0date July …XML and …David Carmel … … …… XQL and …Ricardo …

2 September 2005VLDB Tutorial on XML Full-Text Search XRank: Dewey Inverted List (DIL) XQL Dewey Id Score Position List Sorted by Dewey Id ……… Ricardo Sorted by Dewey Id ……… Store IDs of elements that directly contain keyword - Avoids space overhead 91

2 September 2005VLDB Tutorial on XML Full-Text Search DIL: Query Processing Merge query keyword inverted lists in Dewey ID Order –Entries with common prefixes are processed together Compute Longest Common Prefix of Dewey IDs during the merge –Longest common prefix ensures most specific results –Also suppresses spurious results Keep top-m results seen so far in output heap –Calculate rank using two-dimensional proximity metric –Output contents of output heap after scanning inverted lists Algorithm works in a single scan over inverted lists

2 September 2005VLDB Tutorial on XML Full-Text Search XRank: Ranked Dewey Inverted List (RDIL) XQL … (other keywords) Inverted List … Sorted by Score B+-tree On Dewey Id

2 September 2005VLDB Tutorial on XML Full-Text Search RDIL: Algorithm An element may be ranked highly in one list and low in another list –B+-tree helps search for low ranked element When to stop scanning inverted lists? –Based on Threshold Algorithm [Fagin et al., 2002], which periodically calculates a threshold –Can stop if we have sufficient results above the threshold –Extension to most specific results

2 September 2005VLDB Tutorial on XML Full-Text Search RDIL: Query Processing Ricardo Inverted List B+-tree on Dewey Id XQL P: threshold = Score(P)+Max-Score Rank(9.0.4) Output Heap Temp Heap PP R threshold = Score(P)+Score(R) B+-tree on Dewey Id

2 September 2005VLDB Tutorial on XML Full-Text Search ID Order vs. Rank Order Approaches that combine benefits Long ID inverted list, short score inverted list –HDIL (Guo et al., SIGMOD 2003) Chunk inverted list based on score, organize by ID within chunk –FlexPath (Amer-Yahia et al., SIGMOD 2004) –SVR (Guo et al., ICDE 2005)

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search XSearch Technique Given: An interconnection relationship R between nodes (semantic relationship) –R is reflexive and symmetric Node interconnection index –Given two nodes n and n’ in a document d, find if (n,n’) are in R* Use dynamic programming to compute closure –Online vs. offline

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search XXL Indexing Element Path Index (EPI) –Evaluates simple path expressions Element Content Index (ECI) –Traditional inverted list (but replicates nested elements) Ontology Index (OI) –Lookup similar concepts (for evaluating ~e) –Returned in ranked order

2 September 2005VLDB Tutorial on XML Full-Text Search Myaeng et al. [SIGIR 1994] XQL 585act Document ID Element ID Element Tag … 0.3 Probability play0.2plays0.1 Element Tag Probability Element Tag Probability …

2 September 2005VLDB Tutorial on XML Full-Text Search Integrating Structure and IL [Kaushik et al., SIGMOD 2004] book title editioninfo author XQL Document ID Start ID End ID … 3 Depth 5 Index ID … 0.9 Score B+ Tree

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing –Simple Keyword Search –Tags + Keyword Search –Path Expressions + Keyword Search –XQuery + Complex Full-Text Search Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search Scoring Functions Critical for Top-k Query Processing Top-k answer quality depends on scoring function. Efficient top-k query processing requires scoring function to be: –Monotone. –Fast to compute.

2 September 2005VLDB Tutorial on XML Full-Text Search Structural Join Relaxation //book[./info[./author ftcontains “Dickens”] [./edition ftcontains “paperback”]] bookinfo author edition Dickens paperback pc(book,info) pc(info,author) pc(info,edition) contains(author,”Dickens”) contains(edition,”paperback”) author edition Dickens paperback pc(book,info) or ad(book,info) pc(info,author) pc(info,edition) or ad(book,edition) contains(author,”Dickens”) contains(edition,”paperback”) infobook

2 September 2005VLDB Tutorial on XML Full-Text Search Quark/Galax XQuery Engine Full-Text Primitives (FTWord, FTWindow, FTTimes etc.) evaluation Text Text Text Text.xml XQFT Parser Equivalent XQuery Query Equivalent XQuery Query Full-Text Query Full-Text Query Preprocessing & Inverted Lists Generation Text </xml Text </xml inverted lists.xml 4 Quark/GalaTex Architecture API on positions

2 September 2005VLDB Tutorial on XML Full-Text Search Outline Motivation Full-Text Search Languages Scoring Query Processing Open Issues

2 September 2005VLDB Tutorial on XML Full-Text Search System Architecture XQuery EngineIR Engine Integration Layer

2 September 2005VLDB Tutorial on XML Full-Text Search System Architecture XQuery + IR Engine Quark/GalaTex use this architecture

2 September 2005VLDB Tutorial on XML Full-Text Search Structural Relaxation FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” with stems] ORDER BY $s RETURN $b

2 September 2005VLDB Tutorial on XML Full-Text Search Search Over Views … … … Data Source 1Data Source 2 … … Integrated View

2 September 2005VLDB Tutorial on XML Full-Text Search Other Open Issues Extensive experimental evaluation of scoring functions and ranking algorithms for XML ( INEX). Joint scoring on full-text and scalar predicates. Score-aware algebra for XML for the joint optimization of queries on both structure and text.

2 September 2005VLDB Tutorial on XML Full-Text Search Backup Slides

2 September 2005VLDB Tutorial on XML Full-Text Search Why not use SQL/MM (or variant)? Key difference: No strict demarcation between structured and text data in XML –Can issue structured and text queries over same data Find books with year > 1995 Find books containing keyword “1998” –Can embed structured queries in text queries Find books that contain the keywords that occur in the title of Richard Dawkins’ books Other important differences –XML/XQuery data model –Composability of full-text primitives

2 September 2005VLDB Tutorial on XML Full-Text Search Scoring Function (monotonicity) Required properties: –Exact matches should be scored higher than relaxed matches (idf) –Returned elements with several matches should be ranked higher than those with fewer matches (tf) How to combine tf and idf? –tf.idf, as used by IR, violates above properties –Ranking based on idf, then breaking ties using tf satisfies the properties book title (Great Expectations) edition (paperback) info author (Dickens) book info author (Dickens) score(a) >= score(b) (a)(b) book title (Great Expectations) edition (paperback) info book title (Great Expectations) edition (paperback) infoedition (paperback) score(a) <= score(b)