Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring XML Schema Definitions from XML Data

Similar presentations


Presentation on theme: "Inferring XML Schema Definitions from XML Data"— Presentation transcript:

1 Inferring XML Schema Definitions from XML Data
Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB 2007 Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University Many parts from author’s slide (http://www.vldb2007.org/program/slides/s998-bex.pdf)

2 Inferring XML Schema Why schemas? Why infer schemas?
automation & optimization of search integration of XML data sources Why infer schemas? 50% of XML on the web have none 33% of schemas are not valid Why infer XSD? (XML Schema Definition) DTD (Document Type Definitions) has limitations element type only depend on the element’s name (not consider path)

3 Example: DTD vs. XSD name type

4 Theorem Inferring XSD from XML corpus is
impossible to learn from positive data only Content model of an element is uniquely determined by the path from the root to that element

5 Observation: local context
XSD is k-local its content models depend only on labels up to the k-th ancestor 98% of XSD, k = 2

6 Observation: SORE Single Occurrence Regular Expression (SORE)
What’s SORE title, (author, affiliation?)+, abstract What’s not SORE title, ((author, affiliation)++(editor, affiliation)+), abstract 99 % of regular expressions is single occurrence duplicated element names

7 SOA: Single Occurrence Automaton
Proposed Algorithms Theorem XSDs with local context and SORE content models are learnable from positive examples only (need ‘sufficiently large’) iLocal = iSOA + TOSORE + MINIMIZE infer k-local and single occurrence target XSD Schema iXSD = iLocal & REDUCE REDUCE = (unify sufficiently similar types) SOA: Single Occurrence Automaton

8 Algorithm: iLocal (1/4)

9 Algorithm: iLocal (2/4)

10 Algorithm: iLocal (3/4) iSOA: make SOA from strings
ToSORE: translate SOA → SORE

11 Algorithm: iLocal (4/4)

12 Algorithm: iXSD incomplete data REDUCE: practical heuristics
iLocal derives too many types REDUCE: practical heuristics define distance between types for type s and t if distance(s, t) < ε then unify s and t

13 Experiments 8 schemas & 200 generated documents for each schema
schema: 12~23 types with unbounded depth and width local with k = 2, 3 types of iXSD imprecisions: content model for target and inferred type can differ based on positive examples, can’t be avoided type in target XSD can corresponds to multiple types in inferred XSD: false positives type in inferred XSD can corresponds to multiple types in target XSD: false negatives type in target XSD is not derived incomplete corpus, can't be avoided

14 Experiments k = 3, parsing 697 XSDs (40Mb), PentiumM1.73 → 17 seconds
k = 2, without REDUCE → 29 false positive power of REDUCE Sensitivity to parameters context size k ↑ ⇒ false positives ↑ ⇒ false negatives ↓ ε ↑ ⇒ false positives ↓ ⇒ false negatives ↑

15 Experiments iXSD derives good XSDs from small training sets (50~)

16 Conclusions Propose two algorithms Future work
iLocal – sound & k-complete iXSD – deal with poor data good performance on real world good runtime performance Future work determine best locality k


Download ppt "Inferring XML Schema Definitions from XML Data"

Similar presentations


Ads by Google