Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University

The HTML World XML and Information Retrieval: A SIGIR 2000 Workshop The workshop was held on 28 July 2000. The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer XQL and Proximal Nodes The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. We consider the recently proposed language … The paper references the following papers: … …

The XML World XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

Key Aspect of XML Captures text and structure Applications –Digital libraries –Content management Many such XML repositories already available –IEEE INEX collection –Library of Congress documents –Shakespeare’s plays –SIGMOD, DBLP, …

Searching XML Repositories Confluence of Information Retrieval (text) and Database (structure) techniques A spectrum of possibilities “Pure” Keyword Search Full-Text + DB Queries Keyword Search in Context

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries Related Work and Conclusion

Keyword Search over HTML Query Keywords Ranked Results Hyperlinked HTML Documents

Keyword Search over XML [Guo, Shao, Botev, Shanmugasundaram, SIGMOD 2003] Query Keywords Ranked Results Mix of Hyperlinked XML and HTML Documents

Outline Pure Keyword Search –Design Principles –Indexing and Query Processing Keyword Search in Context Full-Text + DB Queries Conclusion

XML Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

Design Principles 1)Return most specific element containing the query keywords

Design Principles 1)Return most specific element containing the query keywords 2)Ranking has to be done at the granularity of elements

Design Principles 1)Return most specific element containing the query keywords 2)Ranking has to be done at the granularity of elements 3)Generalize HTML keyword search

Outline Pure Keyword Search –Design Principles –Indexing and Query Processing Keyword Search in Context Full-Text + DB Queries Conclusion

System Architecture ElemRank Computation Hybrid Dewey Inverted List Query Evaluator XML/HTML Documents XML Elements with ElemRanks Keyword query Ranked Results Data access Compute top-k query results as per definition of ranking

Na ï ve Method Naïve inverted lists: Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7 Problems: 1. Space Overhead 2. Spurious Results Main issue: Decouples representation of ancestors and descendants date 28 July …XML and …David Carmel … … …… XQL and … Ricardo … 1 2345 6 78

Dewey IDs [1850s] 0.0date0.1 0 0.2 0.3 28 July …XML and …David Carmel … 0.3.0 0.3.1 … 0.3.0.0 0.3.0.1 …… XQL and …Ricardo …

Dewey Inverted List (DIL) XQL 5.0.3.0.08532 Dewey Id ElemRank Position List 8.0.3.8.33889 Sorted by Dewey Id ……… Ricardo 5.0.3.0.18238 8.2.1.4.29952 Sorted by Dewey Id ……… Store IDs of elements that directly contain keyword - Avoids space overhead 91

DIL: Query Processing Merge query keyword inverted lists in Dewey ID Order –Entries with common prefixes are processed together Compute Longest Common Prefix of Dewey IDs during the merge –Longest common prefix ensures most specific results –Also suppresses spurious results Keep top-k results seen so far in output heap –Output contents of output heap after scanning inverted lists Algorithm works in a single scan over inverted lists

Ranked Dewey Inverted List (RDIL) XQL Inverted List … Sorted by ElemRank B+-tree On Dewey Id Ricardo Inverted List … Sorted by ElemRank B+-tree On Dewey Id

RDIL: Query Processing Ricardo Inverted List B+-tree on Dewey Id XQL P: 9.0.4.2.0 9.0.4.1.2 threshold = ElemRank(P)+Max-ElemRank Rank(9.0.4) Output Heap Temp Heap PP R threshold = ElemRank(P)+ElemRank(R) 8.2.1.4.29.0.4.1.29.0.5.610.8.3 B+-tree on Dewey Id 9.0.4.2.0 9.0.5.6 9.0.4.1.2

Motivation for DIL/RDIL Hybrid Correlation of query keywords: probability that the query keywords occur in same element –High correlation: RDIL likely to outperform DIL by stopping early –Low correlation: DIL likely to outperform RDIL because RDIL has to scan most (or entire) inverted list Dilemma –DIL and RDIL are likely to outperform each other –But require inverted lists to be sorted in different orders Challenges –Get benefits of DIL and RDIL without doubling space? –How can keyword correlation be determined?

Hybrid Dewey Inverted List (HDIL) XQL Full Inverted List … Sorted by Dewey id B+-tree On Dewey Id Short List Sorted by ElemRank RDIL is better only when it scans little of inverted list –Short list sorted by ElemRank - saves space! Can reuse full inverted list as leaf of B+-tree –Saves space!

DBLP: High Correlation Keywords

DBLP: Low Correlation Keywords

Shakespeare's Plays (<3%) INEX IEEE SIGMOD Record... Shakespeare's Plays Find relevant elements in Shakespeare’s plays about ‘the process of speech’ 9 of top 10 results for one repository were not in the top 10 results of other repository –XIRQL’s [Fuhr & Grobjohann, SIGIR 2001] TF-IDF scoring

Explaining the Results TF-IDF scoring for a keyword k: –TF (Term Frequency): # occurences of k in element Usually normalized by some factor –IDF (Inverse Document Frequency): (# elements)/(# elements that contain k) Score = sum of TF*IDF for all query keywords Main reason for skewed results –Language of engineers very different from language of Shakespeare! –‘process’ common in INEX, ‘speech’ uncommon

Shakespeare's Plays (<3%) INEX IEEE SIGMOD Record... Need a way to efficiently compute IDF (or other corpus scoring statistic) “on-the-fly”

Context-Sensitive Ranking [Botev & Shanmugasundaram, WebDB 2005] Use Dewey inverted lists + context B+-trees Two pass algorithm –First pass: collect statistics –Second pass: compute results (entries cached from first pass)

Motivation Many new applications require sophisticated DB queries + “complex” full-text search –Example: Library of Congress documents in XML Current XML query languages are mostly “database” languages –Examples: XQuery, XPath Provide very rudimentary text/IR support –fn:contains(e, keywords) No support for complex IR queries –Distance predicates, stemming, scoring, …

Example Queries From XQuery Full-Text Use Cases Document –Find the titles of the books whose body contains the phrases “Usability” and “Web site” in that order, in the same paragraph, using stemming if necessary to match the tokens –Find the titles of the books published after 1999 whose body contains “Usability” and “testing” within a window of 3 words, and return them in score order

XQuery Full-Text [W3C Working Draft] Quark Full-Text Language (Cornell) 2002 2003 2004 2005 TeXQuery (Cornell, AT&T) IBM, Microsoft, Oracle proposals XQuery Full-Text (Second Draft)

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries –XQuery Full-Text Overview –Quark Implementation Related Work and Conclusion

XQuery Primer //book[./price < 25]/title //book/title for $b in //book[./author = ‘Dawkins’] order by $b/price return $b Find the titles of books: Find the titles of books with price < 25: Find books written by Dawkins, in order of price:

Syntax Overview [Amer-Yahia, Botev, Shanmugasundaram, WWW 2004] Two new XQuery constructs 1)FTContainsExpr Expresses “Boolean” full-text search predicates Seamlessly composes with other XQuery expressions 2)FTScoreClause Extension to FOR expression Can score FTContainsExpr and other expressions

FTContainsExpr ContextExpr ftcontains FTSelection –ContextExpr (any XQuery expression) is context spec –FTSelection is search spec –Returns true iff at least one node in ContextExpr satisfies the FTSelection Examples –//book ftcontains ‘Usability’ && ‘testing’ distance 5 –//book[./content ftcontains ‘Usability’ with stems]/title –//book ftcontains /article[author=‘Dawkins’]/title

FTScore Clause FOR $v [SCORE $s]? IN Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b

FTScore Clause FOR $v [SCORE $s]? IN Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] ORDER BY $s RETURN $b

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries –XQuery Full-Text Overview –Quark Implementation Related Work and Conclusion

Quark An open-source C++ implementation of XQuery Full-Text –http://www.cs.cornell.edu/database/quarkhttp://www.cs.cornell.edu/database/quark –Compiles on Linux and Windows Key features –Mix of structured and full-text predicates –Score all of XQuery! –Full-text search over views

Quark Architecture File System Storage Structure Index Inverted List Index Document Loader Query Processing + Scoring XML Documents XQuery + XQFT Ranked Results

Mix of Structure and Full-Text Queries Structure IndexInverted List Index /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] /pub/book [./price < 10.00] Dewey IDs Results

Scoring XQuery FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] ORDER BY $s RETURN $b

Scoring XQuery Extending XQuery data model (internal) –Original: Sequence of items –New: Sequence of scored items Scoring predicates –Full-text: IR style probabilistic scoring –Structured: Scoring functions E.g., a > 1000 (score = 1 when a = infinity) Scoring XQuery expressions –Probabilistic combination of scores [Fuhr and Roelekke] E.g., Exists is “or” of all input scores

Full-Text Search Over Views … … … Data Source 1Data Source 2 … … … Integrated View

Related Work Semi-structured ranked keyword search –XIRQL [Fuhr and Grobjohann] –XXL [Theobald and Weikum, 2001] –Commercial search engines [Luk et al.] –INEX initiative Keyword search over databases –BANKS [Bhalotia et al.] –DBXplorer [Agrawal et al.] –DISCOVER [Hristidis et al.] –LORE [Goldman et al.]

10000 Foot View of Data Management Structured Unstructured Complex and Structured Ranked Search Data Queries Database Systems Information Retrieval Systems

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.

Similar presentations

Presentation on theme: "Text Search over XML Documents Jayavel Shanmugasundaram Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.

Similar presentations

Presentation on theme: "Text Search over XML Documents Jayavel Shanmugasundaram Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback