Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University.

Slides:



Advertisements
Similar presentations
Querying on the Web: XQuery, RDQL, SparQL Semantic Web - Spring 2006 Computer Engineering Department Sharif University of Technology.
Advertisements

XML to Relational Database Mapping
XDuce Tabuchi Naoshi, M1, Yonelab.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
4b Lexical analysis Finite Automata
Lecture 24 MAS 714 Hartmut Klauck
1 Web Data Management XML Schema. 2 In this lecture XML Schemas Elements v. Types Regular expressions Expressive power Resources W3C Draft:
Inferring XML Schema Definitions from XML Data
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
XML Schemas Lecture 10, 07/10/02. Acknowledgements A great portion of this presentation has been borrowed from Roger Costello’s excellent presentation.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
Compiler Construction
Sunday, June 28, 2015 Abdelali ZAHI : FALL 2003 : XML Schemas XML Schemas Presented By : Abdelali ZAHI Instructor : Dr H.Haddouti.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.
1 Regular Expressions/Languages Regular languages –Inductive definitions –Regular expressions syntax semantics Not covered in lecture.
University of Lübeck, Germany Institute of Information Systems Incremental Validation of String- Based XML Data in Databases, File Systems and Streams.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
Dr. Azeddine Chikh IS446: Internet Software Development.
Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*
1 Chapter 1 Automata: the Methods & the Madness Angkor Wat, Cambodia.
Of 33 lecture 3: xml and xml schema. of 33 XML, RDF, RDF Schema overview XML – simple introduction and XML Schema RDF – basics, language RDF Schema –
CIS 451: XML DTDs Dr. Ralph D. Westfall February, 2009.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
Design of an Evolutionary Algorithm M&F, ch. 7 why I like this textbook and what I don’t like about it!
Introduction to Graph Grammars Fulvio D’Antonio LEKS, IASI-CNR Rome,
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
XSDL & Relax : 2 new schema languages for XML Rajasekar Krishnamurthy.
An OO schema language for XML SOX W3C Note 30 July 1999.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
1 Module 14 Regular languages –Inductive definitions –Regular expressions syntax semantics.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Sheet 1XML Technology in E-Commerce 2001Lecture 2 XML Technology in E-Commerce Lecture 2 Logical and Physical Structure, Validity, DTD, XML Schema.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
Management of XML and Semistructured Data Lecture 11: Schemas Wednesday, May 2nd, 2001.
CPS 506 Comparative Programming Languages Syntax Specification.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
1 Typing XQuery WANG Zhen (Selina) Something about the Internship Group Name: PROTHEO, Inria, France Research: Rewriting and strategies, Constraints,
XML Labling and Query Optimization Sigmod
Lecture 5 1 CSP tools for verification of Sec Prot Overview of the lecture The Casper interface Refinement checking and FDR Model checking Theorem proving.
Deriving Relation Keys from XML Keys by Qing Wang, Hongwei Wu, Jianchang Xiao, Aoying Zhou, Junmei Zhou Reviewed by Chris Ying Zhu, Cong Wang, Max Wang,
1 Propositional Logic Limits The expressive power of propositional logic is limited. The assumption is that everything can be expressed by simple facts.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Logical Agents. Outline Knowledge-based agents Logic in general - models and entailment Propositional (Boolean) logic Equivalence, validity, satisfiability.
1 XML and XML in DLESE Katy Ginger November 2003.
Lexical analysis Finite Automata
Formal Language Theory
On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Compiler Construction
Presentation transcript:

Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University of Limburg 2 Dortmund University 3 Maastricht University and Transnational University of Limburg

Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

Aims & requirements Problem: infer DTD from XML corpus Requirements: –Concise: humans can interpret/validate –Work on large data sets –Work on small data sets –Robust to noise DTD XML

Why DTD inference? Schema inference –≈ 50 % of XML documents : no schema [Barbosa et al. 2005] –≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005] –Improving existing schemas –“Noisy” XML documents ≈ 90 % of XHTML docs : not valid Related work –Fails on real-world, large data sets –Results not concise

Why schemas? Validation : efficiency, security Optimization : search, processing Static analysis, type checking (e.g., XQuery) Software development : modeling, OR-mapping Integration : (meta-)data sources Schema matching Semantics

Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

XML documents book title author author year … ………… book title editor year isbn … …… Learning regular expression from set of strings title (author + + editor + ) year isbn?

Learning automata? Well studied, but… Learning automata ≠ learning regular expressions ((b?(a+c)) + d) + e

abbb + abbd + acd + ac –most specific regex for S (a + b + c + d)* –most general regex for S Learning regular languages? S = { abbb, abbd, acd, ac } ??? < < a (b* + c) d? ? generalization vs. specificity positive examples only! Impossible… in general

Subclasses S ingle O ccurrence R egular E xpressions –99 % of regular expression in DTDs/XSDs CHA in R egular E xpressions –90 % of regular expression in DTDs/XSDs Infer with iDTD Infer with CRX 

Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

SOREs What’s a SORE header. protein. organism. reference*. comment*. genetics*. complex*. function*. classification?. keywords?. feature*. summary. sequence authors. citation. volume?. month?. year. pages?. (title + descr)?. xrefs? title. (author. affiliation?) +. abstract … and what’s not title. ((author. affiliation) + + (editor. affiliation) + ). abstract duplicate element names

Sample  SOA W = {bacacdacde, cbacdbacde, abccaadcde} b a c e d S ingle O ccurrence A utomaton 2T-Inf [Garcia & Vidal 1990]

Sample  SOA SOA size –|  | + 2 states – O (|  | 2 ) transitions Complexity of algorithm – O (||W||) –streaming Algorithm sound –W  L(SOA) in general: |S| |L(SOA)| <<

SOA  SORE: R EWRITE b a e d c optional b a e d c b? disjunction a, c e d b? a+c concatenation b?, a+c e d b? (a+c) e d ((b? (a+c)) + self-loop b? (a+c) ((b? (a+c)) + d) + e

R EWRITE : properties Theorem –R EWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete) –Complexity: O (|  | 4 ) SORE size –|  | symbols – O (|  |) operators

R EWRITE + repairs = iDTD W = {bacacdacde, cbacdbacde} b a ce d no rules apply !!! almost disjunction a, c b a e d c ((b? (a+c)) + d) + e Fix: enable-disjunction enable-optional

iDTD: properties Theorem –iDTD transforms SOA into SORE such that L(SOA)  L(SORE) iDTD can be parameterized for performance

Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

CHAREs Definition: A chain regular expression is a sequence of factors f 1,…,f n such that no alphabet symbol occurs more than once and a factor is one of (a 1 + … + a k ) (a 1 + … + a k )? (a 1 + … + a k ) + (a 1 + … + a k )* CRX derives CHAin Regular Expressions C hain R egular expression e X traction

CHAREs What’s a chain header. protein. organism. reference*. comment*. genetics*. complex*. function*. classification?. keywords?. feature*. summary. sequence authors. citation. volume?. month?. year. pages?. (title + descr)?. xrefs? … and what’s not title. (author. affiliation?) +. abstract title. ((author. affiliation) + + (editor. affiliation) + ). abstract not a factor duplicate element names

CRX run: pre-order relation a b c c d e c c c a d b f e g b f h i Sample W Pre-order relation  W a b b c c d d e c a a d b f f e e g f h h i a b cf e dg hi

a  W b and b  W c then a  W c CRX run: transitive closure a b c c d e c c c a d b f e g b f h i Sample W f e dg hi a b c

CRX run: transitive closure a b c c d e c c c a d b f e g b f h i Sample W f e dg hi a b c a,b,c equivalence class a  W b and b  W a then a  W b Symbol occurs in exactly one equivalence class

CRX run: folding a b c c d e c c c a d b f e g b f h i Sample W f e dg hi a,b,c predecessor setsuccessor set partial order  W pred(  ) = {  ’ |  ’  W  } succ(  ) = {  ’ |   W  ’}

CRX run: folding a b c c d e c c c a d b f e g b f h i Sample W eghi a,b,c d,f partial order  W pred(  ) = {  ’ |  ’  W  } succ(  ) = {  ’ |   W  ’}  W : partial order  W

CRX run: multiplicity & RE a b c c d e c c c a d b f e g b f h i Sample W e g hi a,b,c d,f + ? ? ?? e?.. h?i?. g?.. (d + f)(a + b + c) + Chain Regular Expression topological sort

CRX algorithm: properties Optimality:  W linearly ordered   CHARE r, W  L(r) and L(r)  L(r W ): r W = r Performance : O (||W|| + |Σ| 3 ) Training set size: Any CHARE r can be learned from {w | w  L(r)   w’  L(r): |w|  |w’| + 2}

Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

Related work XTRACT [Garofalakis et al. 2000] –Pioneer –More general than iDTD –Focuses on regular expressions that don’t occur in real DTDs  no concise schemas Trang: roughly equivalent to CRX –Inconsistent results

Data Real world regular expressions –SOREs –Non SOREs Real world data when available Synthetic data otherwise

real world data

real world regexes

Experiments: generalization CRX iDTD no repairs

Experiments: generalization CRX iDTD

Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

Extensions Incremental computation –new data  update internal representation (SOA or partial order) Noise –Support for element name too small  ignore element –SOA: support for edges too small  delete edges before repair Numerical predicates –Bookkeeping: minOccurs, maxOccurs Generating XSDs –Infer data types (integer, double, date,…)

Outline Goals & motivation Problem setting iDTD: Sample  SOA  SORE CRX: Sample  CHARE Experiments Extensions Conclusions

iDTD + CRX –learns robust class of regexes from positive examples –complete in their target class for sufficient data –deals with insufficient data –performs well on real world data –runs efficiently Future work: inferring XML Schemas