Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring XML Schema Definitions from XML Data Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB.

Similar presentations


Presentation on theme: "Inferring XML Schema Definitions from XML Data Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB."— Presentation transcript:

1 Inferring XML Schema Definitions from XML Data Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

2 Copyright  2007 by CEBT Inferring XML Schema  Why schemas? automation & optimization of search integration of XML data sources …  Why infer schemas? 50% of XML on the web have none 33% of schemas are not valid  Why infer XSD? (XML Schema Definition) DTD (Document Type Definitions) has limitations – element type only depend on the element’s name (not consider path) 2

3 Copyright  2007 by CEBT Example: DTD vs. XSD 3 type name

4 Copyright  2007 by CEBT Theorem  Inferring XSD from XML corpus is impossible to learn from positive data only  Content model of an element is uniquely determined by the path from the root to that element 4

5 Copyright  2007 by CEBT Observation: local context  XSD is k-local its content models depend only on labels up to the k-th ancestor 98% of XSD, k = 2 5

6 Copyright  2007 by CEBT Observation: SORE  Single Occurrence Regular Expression (SORE)  What’s SORE title, (author, affiliation?) +, abstract  What’s not SORE title, ((author, affiliation) + +(editor, affiliation) + ), abstract  99 % of regular expressions is single occurrence 6 duplicated element names

7 Copyright  2007 by CEBT Proposed Algorithms  Theorem XSDs with local context and SORE content models are learnable from positive examples only (need ‘sufficiently large’)  iLocal = iSOA + TOSORE + MINIMIZE infer k-local and single occurrence target XSD Schema  iXSD = iLocal & REDUCE REDUCE = (unify sufficiently similar types) SOA: Single Occurrence Automaton 7

8 Copyright  2007 by CEBT Algorithm: iLocal (1/4) 8

9 Copyright  2007 by CEBT Algorithm: iLocal (2/4) 9

10 Copyright  2007 by CEBT Algorithm: iLocal (3/4) iSOA: make SOA from strings ToSORE: translate SOA → SORE 10

11 Copyright  2007 by CEBT Algorithm: iLocal (4/4) 11

12 Copyright  2007 by CEBT Algorithm: iXSD  incomplete data iLocal derives too many types  REDUCE: practical heuristics define distance between types for type s and t – if distance(s, t) < ε then unify s and t 12

13 Copyright  2007 by CEBT Experiments  8 schemas & 200 generated documents for each schema schema: 12~23 types with unbounded depth and width local with k = 2, 3  types of iXSD imprecisions: content model for target and inferred type can differ – based on positive examples, can’t be avoided type in target XSD can corresponds to multiple types in inferred XSD: false positives type in inferred XSD can corresponds to multiple types in target XSD: false negatives type in target XSD is not derived – incomplete corpus, can't be avoided 13

14 Copyright  2007 by CEBT Experiments  k = 3, parsing 697 XSDs (40Mb), PentiumM1.73 → 17 seconds  k = 2, without REDUCE → 29 false positive power of REDUCE  Sensitivity to parameters context size k ↑ ⇒ false positives ↑ ⇒ false negatives ↓ ε ↑ ⇒ false positives ↓ ⇒ false negatives ↑ 14

15 Copyright  2007 by CEBT Experiments  iXSD derives good XSDs from small training sets (50~) 15

16 Copyright  2007 by CEBT Conclusions  Propose two algorithms iLocal – sound & k-complete iXSD – deal with poor data – good performance on real world – good runtime performance  Future work determine best locality k 16


Download ppt "Inferring XML Schema Definitions from XML Data Eert Jan Bex, Frank Neven, Stijn Vansummeren Hasselt Univ. and transnational Univ. of Limburg, Belgium VLDB."

Similar presentations


Ads by Google