Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University.

Similar presentations


Presentation on theme: "Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University."— Presentation transcript:

1 Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg

2 Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

3 Aims & requirements Problem: infer DTD from XML corpus Requirements: –Concise: humans can interpret/validate –Work on large data sets –Work on small data sets –Robust to noise DTD XML

4 Why DTD inference? Schema inference –≈ 50 % of XML documents : no schema [Barbosa et al. 2005] –≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005] –Improving existing schemas –“Noisy” XML documents ≈ 90 % of XHTML docs : not valid Related work –Fails on real-world, large data sets –Results not concise

5 Why schemas? Validation : efficiency, security Optimization : search, processing Static analysis, type checking (e.g., XQuery) Software development : modeling, OR-mapping Integration : (meta-)data sources Schema matching Semantics

6 Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

7 XML documents book title author author year … ………… book title editor year isbn … …… Learning regular expression from set of strings title (author + + editor + ) year isbn?

8 Learning automata? Well studied, but… Learning automata ≠ learning regular expressions ((b?(a+c)) + d) + e

9 abbb + abbd + acd + ac –most specific regex for S (a + b + c + d)* –most general regex for S Learning regular languages? S = { abbb, abbd, acd, ac } ??? < < a (b* + c) d? ? generalization vs. specificity positive examples only! Impossible… in general

10 Subclasses S ingle O ccurrence R egular E xpressions –99 % of regular expression in DTDs/XSDs CHA in R egular E xpressions –90 % of regular expression in DTDs/XSDs Infer with iDTD Infer with CRX 

11 Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

12 SOREs What’s a SORE header. protein. organism. reference*. comment*. genetics*. complex*. function*. classification?. keywords?. feature*. summary. sequence authors. citation. volume?. month?. year. pages?. (title + descr)?. xrefs? title. (author. affiliation?) +. abstract … and what’s not title. ((author. affiliation) + + (editor. affiliation) + ). abstract duplicate element names

13 Sample  SOA W = {bacacdacde, cbacdbacde, abccaadcde} b a c e d S ingle O ccurrence A utomaton 2T-Inf [Garcia & Vidal 1990]

14 Sample  SOA SOA size –|  | + 2 states – O (|  | 2 ) transitions Complexity of algorithm – O (||W||) –streaming Algorithm sound –W  L(SOA) in general: |S| |L(SOA)| <<

15 SOA  SORE: R EWRITE b a e d c optional b a e d c b? disjunction a, c e d b? a+c concatenation b?, a+c e d b? (a+c) e d ((b? (a+c)) + self-loop b? (a+c) ((b? (a+c)) + d) + e

16 R EWRITE : properties Theorem –R EWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete) –Complexity: O (|  | 4 ) SORE size –|  | symbols – O (|  |) operators

17 R EWRITE + repairs = iDTD W = {bacacdacde, cbacdbacde} b a ce d no rules apply !!! almost disjunction a, c b a e d c ((b? (a+c)) + d) + e Fix: enable-disjunction enable-optional

18 iDTD: properties Theorem –iDTD transforms SOA into SORE such that L(SOA)  L(SORE) iDTD can be parameterized for performance

19 Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

20 CHAREs Definition: A chain regular expression is a sequence of factors f 1,…,f n such that no alphabet symbol occurs more than once and a factor is one of (a 1 + … + a k ) (a 1 + … + a k )? (a 1 + … + a k ) + (a 1 + … + a k )* CRX derives CHAin Regular Expressions C hain R egular expression e X traction

21 CHAREs What’s a chain header. protein. organism. reference*. comment*. genetics*. complex*. function*. classification?. keywords?. feature*. summary. sequence authors. citation. volume?. month?. year. pages?. (title + descr)?. xrefs? … and what’s not title. (author. affiliation?) +. abstract title. ((author. affiliation) + + (editor. affiliation) + ). abstract not a factor duplicate element names

22 CRX run: pre-order relation a b c c d e c c c a d b f e g b f h i Sample W Pre-order relation  W a b b c c d d e c a a d b f f e e g f h h i a b cf e dg hi

23 a  W b and b  W c then a  W c CRX run: transitive closure a b c c d e c c c a d b f e g b f h i Sample W f e dg hi a b c

24 CRX run: transitive closure a b c c d e c c c a d b f e g b f h i Sample W f e dg hi a b c a,b,c equivalence class a  W b and b  W a then a  W b Symbol occurs in exactly one equivalence class

25 CRX run: folding a b c c d e c c c a d b f e g b f h i Sample W f e dg hi a,b,c predecessor setsuccessor set partial order  W pred(  ) = {  ’ |  ’  W  } succ(  ) = {  ’ |   W  ’}

26 CRX run: folding a b c c d e c c c a d b f e g b f h i Sample W eghi a,b,c d,f partial order  W pred(  ) = {  ’ |  ’  W  } succ(  ) = {  ’ |   W  ’}  W : partial order  W

27 CRX run: multiplicity & RE a b c c d e c c c a d b f e g b f h i Sample W e g hi a,b,c d,f + ? ? ?? e?.. h?i?. g?.. (d + f)(a + b + c) + Chain Regular Expression topological sort

28 CRX algorithm: properties Optimality:  W linearly ordered   CHARE r, W  L(r) and L(r)  L(r W ): r W = r Performance : O (||W|| + |Σ| 3 ) Training set size: Any CHARE r can be learned from {w | w  L(r)   w’  L(r): |w|  |w’| + 2}

29 Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

30 Related work XTRACT [Garofalakis et al. 2000] –Pioneer –More general than iDTD –Focuses on regular expressions that don’t occur in real DTDs  no concise schemas Trang: roughly equivalent to CRX –Inconsistent results

31 Data Real world regular expressions –SOREs –Non SOREs Real world data when available Synthetic data otherwise

32 real world data

33 real world regexes

34 Experiments: generalization CRX iDTD no repairs

35 Experiments: generalization CRX iDTD

36 Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

37 Extensions Incremental computation –new data  update internal representation (SOA or partial order) Noise –Support for element name too small  ignore element –SOA: support for edges too small  delete edges before repair Numerical predicates –Bookkeeping: minOccurs, maxOccurs Generating XSDs –Infer data types (integer, double, date,…)

38 Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

39 iDTD + CRX –learns robust class of regexes from positive examples –complete in their target class for sufficient data –deals with insufficient data –performs well on real world data –runs efficiently Future work: inferring XML Schemas


Download ppt "Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University."

Similar presentations


Ads by Google