Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.

Slides:



Advertisements
Similar presentations
Lecture 24 MAS 714 Hartmut Klauck
Advertisements

Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.
1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Managing XML and Semistructured Data Lecture 8: Query Languages - XML-QL Prof. Dan Suciu Spring 2001.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
Managing XML and Semistructured Data Lecture : Indexes.
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
Managing XML and Semistructured Data
Inbal Yahav A Framework for Using Materialized XPath Views in XML Query Processing VLDB ‘04 DB Seminar, Spring 2005 By: Andrey Balmin Fatma Ozcan Kevin.
Managing XML and Semistructured Data Lecture 14: Constraints and Keys Prof. Dan Suciu Spring 2001.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Managing XML and Semistructured Data Lecture 1: Preliminaries and Overview Prof. Dan Suciu Spring 2001.
CS 286, UC Berkeley, Spring 2007, R. Ramakrishnan 1 What is a Query Language? Universality of Data Retrieval Languages, Aho and Ullman, POPL 1979 Raghu.
4/20/2017.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
1 CSCI 2400 section 3 Models of Computation Instructor: Costas Busch.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Database Management 9. course. Execution of queries.
Querying Structured Text in an XML Database By Xuemei Luo.
Chapter 10: XML The world of XML. Context The dawn of database technology 70s A DBMS is a flexible store-recall system for digital information It provides.
Logic Circuits Chapter 2. Overview  Many important functions computed with straight-line programs No loops nor branches Conveniently described with circuits.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
1 Bisimulations as a Technique for State Space Reductions.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
Chapter 10: XML The world of XML. The Data Semistructured data instance = a large graph.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
September 2000XML Workshop, IIT Bombay Indexing of XML Data Raghuraman Rangarajan KReSIT, IIT Bombay.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Exchange Intensional XML Data Tova MiloSerge Abiteboul Tova Milo INRIA & Tel-Aviv U. ; Serge Abiteboul INRIA ; Bernd AmannOmar Benjelloun Bernd Amann Cedric-CNAM.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
1 UNIVERSITY of PENNSYLVANIAGrigoris Karvounarakis October 04 Lazy Query Evaluation for Active XML Abiteboul, Benjelloun, Cautis, Manolescu, Milo, Preda.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
Spatial Data Management
XML path expressions CSE 350 Fall 2003.
Managing XML and Semistructured Data
RE-Tree: An Efficient Index Structure for Regular Expressions
Management of XML and Semistructured Data
Managing XML and Semistructured Data
Managing XML and Semistructured Data
Managing XML and Semistructured Data
On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Managing XML and Semistructured Data
CSCI-2400 Models of Computation.
Lecture 2- Query Processing (continued)
Lecture 10: Query Complexity
XML indexing – A(k) indices
Incremental Maintenance of XML Structural Indexes
Indexing Methods for Efficient XML Query Processing
Switching Lemmas and Proof Complexity
Presentation transcript:

Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001

In this lecture Indexes –XSet –Region algebras –Dataguides –T-indexes Resources Index Structures for Path Expressions by Milo and Suciu, in ICDT'99Index Structures for Path Expressions XSet description: Data on the Web Abiteboul, Buneman, Suciu : section 8.2

The problem Input: large, irregular data graph Output: index structure for evaluating regular path expressions

The Data Semistructured data instance = a large graph

The queries Regular expressions (using Lorel-like syntax) SELECT X FROM (Bib.*.author).(lastname|firstname).Abiteboul X

Analyzing the problem what kind of data –tree data (XML) –graph data what kind of queries –restricted regular expressions (e.g. XPath) –arbitrary regular expressions

XSet: a simple index for XML Part of the Ninja project at Berkeley Example XML data:

XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluation SELECT X FROM part.name X -yes SELECT X FROM part.supplier.name X -yes SELECT X FROM part.*.subpart.name X -maybe SELECT X FROM *.supplier.name X -maybe Will gain when index fits in memory

Region Algebras structured text = text with tags (like XML) powerful indexing techniques [Baeza-Yates, Gonnet, Navarro, Salminen, Tompa, etc.] New Oxford English Dictionary critical limitation:ordered data only (like text) less critical limitation: restricted regular expressions

Region Algebras data = sequence of characters [c 1 c 2 c 3 …] region = interval in the text –representation (x,y) = [c x,c x+1, … c y ] –example: … region set = a set of regions –example all regions (may be nested) region algebra = operators on region set, s1 op s2

Representation of a region set Example: the region set:

Region algebra: some operators s1 intersect s2 = {r | r  s1, r  s2} s1 included s2 = {r | r  s1,  r’  s2, r  r’} s1 including s2 = {r | r  s1,  r’  s2, r  r’} s1 parent s2 = {r | r  s1,  r’  s2, r is a parent of r’} s1 child s2 = {r | r  s1,  r’  s2, r is child of r’} Examples: included = { s1, s2, s3, s5} including = {p2, p3}

Efficient computation of Region Algebra Operators Example: s1 included s2 s1 = {(x1,x1'), (x2,x2'), …} s2 = {(y1,y1'), (y2,y2'), …} (i.e. assume each consists of disjoint regions) Algorithm: if xi < yj then i := i + 1 if xi' > yj' then j := j + 1 otherwise: print (xi,xi'), do i := i + 1 Can do in sub-linear time when one region is very small

From path expressions to region expressions part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root)) Region expressions correspond to simple XPath expressions

DataGuides Goldman & Widom [VLDB 97] –graph data –arbitrary regular expressions

DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

Dataguides Example:

DataGuides Multiple DataGuides for the same data:

DataGuides Definition Let w, w’ be two words (I.e word queries) and G a graph w  G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if  G is the same as  DB

DataGuides Example: - G1 is a strong dataguide - G2 is not strong person.project !  DB dept.project person.project !  G2 dept.project

DataGuides Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)=  while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) Use hash table for Nodes(G) This is precisely the powerset automaton construction.

DataGuides How large are the dataguides ? –if DB is a tree, then size(G) <= size(DB) why? answer: every node is in exactly one extent of G here: dataguide = XSet –How many nodes does the strong dataguide have for this DB ? 20 nodes (least common multiple of 4 and 5) Dataguides usually fail on data with cyclic schemas, like:

T-Indexes Milo & Suciu [ICDT 99] 1-index: –data graph –arbitrary regular expressions 2-index, T-index: for more complex queries, consisting of more regular expressions.

1-Indexes A first attempt: Database: DB = (V,E,Roots) Queries: regular path expressions q(DB)  u  V. Lu  {a 1 …a n | v 0  …  v n  DB, v 0  Root, v n =u}  u,v  V. u  v  L u = L v  u  V. [u] = {v | u  v} a1a1 anan

1-Indexes Nodes(I) = { [u] | u in nodes(DB) } Edges(I) = { s  s’ |  u  s,  u’  s’, (u  a u’)  Edges(DB)} I = q(DB) = { u |  s  q(I), u  s } Example: Inefficient: construction cost (PSPACE)

1-indexes IDEA: Use Simulation or Bisimulation instead of  Fact: u  b v  u  s v  u  v Use the same construction, but [u] now refers to  b instead of . Works because L u = L [u] Efficient PTIME algorithms exist for computing  b and  s [Paige&Tarjan, Henzinger&Henzinger&Kopke]

1-Indexes Example

1-Indexes Analyzing the 1-index always: size(I) <= size(DB) (unlike Dataguide) always: can compute in O(nlogn) time n=size(DB) When DB is a tree:  b,  s,  coincide –no penalty for  b,  s –1-index = Dataguide = XSet

1-Indexes Analyzing the 1-index: Do we have size(I) << size(DB) ? No. Two worst cases: Facts: –in theory: except for these two DB’s, size(I) << size(DB) –in practice: it’s a different story. Experiments: size(I)  1/3 size(DB)

Conclusions work on structured text: relevant but restrictive trees are simple: XSet = Dataguides = 1-index (conceptually) 1-index: scales to cyclic data too more complex queries: 2-index, T-index T-index: space/generality tradeoff Problem: how to use a specific T-index to answer a given query. Query rewriting (see [ICDT'99]). Need external-memory algorithm for bisimulation/simulation.