POPL 2006 The Next 700 Data Description Languages Yitzhak Mandelbaum, David Walker Princeton University Kathleen Fisher AT&T Labs Research.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
1 How to transform an analyzer into a verifier. 2 OUTLINE OF THE LECTURE a verification technique which combines abstract interpretation and Park’s fixpoint.
1 XML Data Management Course Outline and Organisation Werner Nutt.
Kathleen Fisher AT&T Labs Research Yitzhak Mandelbaum, David Walker Princeton The Next 700 Data Description Languages.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
2440: 211 Interactive Web Programming JavaScript Fundamentals.
ISBN Chapter 3 Describing Syntax and Semantics.
CS 355 – Programming Languages
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
The KB on its way to Web 2.0 Lower the barrier for users to remix the output of services. Theo van Veen, ELAG 2006, April 26.
--What is a Database--1 What is a database What is a Database.
David Walker Princeton University Computer Science Pads: Simplified Data Processing For Scientists.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez.
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Describing Syntax and Semantics
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Chapter 6 Text and Multimedia Languages and Properties
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
1 XML Data Management Course Outline and Organisation Werner Nutt.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Configuration Management (CM)
Design and Programming Chapter 7 Applied Software Project Management, Stellman & Greene See also:
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
ISBN Chapter 3 Describing Semantics -Attribute Grammars -Dynamic Semantics.
CS 363 Comparative Programming Languages Semantics.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber.
Copyright OpenHelix. No use or reproduction without express written consent1.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Problems with XML & XML Schemas XML falls apart on the Scalability design goal. 1.The order in which elements appear in an XML document is significant.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
CS223: Software Engineering
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
PADL 2008 A Generic Programming Toolkit for PADS/ML Mary Fernández, Kathleen Fisher, Yitzhak Mandelbaum AT&T Labs Research J. Nathan Foster, Michael Greenberg.
Unit 4 Representing Web Data: XML
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
Chapter 7 Representing Web Data: XML
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
COMP 150-IDS: Internet Scale Distributed Systems (Spring 2016)
Announcements Quiz 5 HW6 due October 23
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Presentation transcript:

POPL 2006 The Next 700 Data Description Languages Yitzhak Mandelbaum, David Walker Princeton University Kathleen Fisher AT&T Labs Research

POPL Data, data everywhere! Incredible amounts of data stored in well-behaved formats: Tools Schema Browsers Query languages Standards Libraries Books, documentation Conversion tools Vendor support Consultants… Databases: XML:

POPL Ad hoc data Vast amounts of data in ad hoc formats. Ad hoc data is semi-structured: –Not free text. –Not as structured as XML. –Different than PL syntax. Examples from many different areas: –Data mining –Consumer electronics –Computer science –Computational biology –Finance –More!

POPL Ad Hoc Data in Biology format-version: 1.0 date: 11:11: :24 auto-generated-by: DAG-Edit rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID: ,PMID: , SGD:mcc] is_a: GO: ! organelle inheritance is_a: GO: ! mitochondrion distribution

POPL Ad Hoc Data in Finance HA START OF TEST CYCLE aA BXYZ U1AB B HL START OF OPEN INTEREST d FZYX G1AB HM END OF OPEN INTEREST HE START OF SUMMARY f NYZX B1QB B B HF END OF SUMMARY k LYXW B1KB G HB END OF TEST CYCLE

POPL Ad Hoc Data from Web Server Logs (CLF) [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" tj62.aol.com - - [16/Oct/1997:14:32: ] "POST HTTP/1.0" [15/Oct/1997:18:53: ] "GET /tr/img/gift.gif HTTP/1.0” [15/Oct/1997:18:39: ] "GET /tr/img/wool.gif HTTP/1.0" [16/Oct/1997:12:59: ] "GET / HTTP/1.0" ekf - [17/Oct/1997:10:08: ] "GET /img/new.gif HTTP/1.0" 304 -

POPL Ad Hoc Data: DNS packets : 9192 d8fb d r : f 6d00 esearch.att.com : 00fc 0001 c00c e ' : 036e 7331 c00c 0a68 6f73 746d ns1...hostmaste : 72c0 0c77 64e e r..wd.I : 36ee e 10c0 0c00 0f e : a00 0a05 6c69 6e75 78c0 0cc0 0c linux : 0f e c00 0a07 6d61 696c mail : 6d61 6ec0 0cc0 0c e 1000 man : 0487 cf1a 16c0 0c e a0: e73 30c0 0cc0 0c e..ns b0: c0 2e03 5f67 63c0 0c _gc...! c0: d c c X.....d...phys d0: f.research.att.co

POPL Challenges of Ad hoc Data Data arrives “as is.” Documentation is often out-of-date or nonexistent. –Hijacked fields. –Undocumented “missing value” representations. Data is buggy. –Missing data, “extra” data, … –Human error, malfunctioning machines, software bugs (e.g. race conditions on log entries), … –Errors are sometimes the most interesting portion of the data.

POPL Data Description Languages Binary Format Description Language (BFD) Data Format Description Language (DFDL) PacketTypes (SIGCOMM ‘00) DataScript (GPCE ‘02) Erlang binaries (ESOP ‘04) PADS (PLDI ‘05)

POPL Describing Data with Types Types can simultaneously describe both external and internal forms of data.

POPL The Next 700 DDLs Develop semantic framework to understand data description languages and guide future development. Explain other languages with Data Description Calculus (DDC). Give denotational semantics to DDC. PacketTypes PADS Datascript DDC

POPL Data Description Calculus DDC: calculus of dependent types for describing data. –Base types: atomic pieces of data. –Type constructors: richer structures. Specify well-formed types with kinding judgment. t ::= unit | bottom | C(e) |  x:t.t | t + t | t & t | {x:t | e} | t seq(t,e,t) |  | .t | x.e | t e | compute (e:  ) | absorb(t) | scan(t)

POPL Base Types and Pairs Base types: C (e). –Abstract. –Parameterized by host-language expression. –Concrete examples: int, char, stringFW(n), stringUntil(c). Ordered pairs: t * t’ and  x:t.t’. –Variable x gives name to first value in pair. –Use t * t’ when x does not appear in t’.

POPL Joe|Wright|45|95|79 n/aEd|Wood|10|47|31 124Chris|Nolan|80|93|85 Burton|30|82|71 126George|Lucas|32|62|40 Base Types and Pairs Tim int* stringUntil(‘|’)* char 125|

POPL C Programming31Types and Programming Languages20Twenty Years of PLDI36Modern Compiler Implementation in ML 27Elements o f ML Programming Base Types and Pairs 13C Programming  length:.stringFW(length)int

POPL Constraints Constrained types: {x:t | e}. –Enforce the constraint e on the underlying type t. BurtonTim125|30| {c:char | c = ‘|’}  min:int.S c (‘|’) *  max:{m:int | min ≤ m}.S c (‘|’) * {avg:int | min ≤ avg & avg ≤ max} 8271|| char S c (‘|’)

POPL Unions t + t’ : deterministic or : try t; on failure, try t’. Example: 122Joe|Wright|45|95|79 n/aEd|Wood|10|47|31 124Chris|Nolan|80|93|85 125Tim|Burton|30|82|71 126George|Lucas|32|62|40 S s (“n/a”) + int n/aEd|Wood|10|47|31 124Chris|Nolan|80|93|8 5

POPL Absorb, Compute and Scan absorb(t) : consume data from source; produce nothing. compute(e:  ) : consume nothing; output result of computation e. scan(t) : scan data source for type t. scan(S c (‘|’))  width:int.S c (‘|’) *  length:int. compute(width  length:int) absorb(S c (‘|’)) ^%$!&_ 10|12 | | 120

POPL Choosing a Semantics Set semantics? But fails to account for: –Relationship between internal and external data. –Error handling. –Types of representation and parse descriptor. Parsing semantics. Representation semantics.

POPL Semantics Overview t Parser Representation Parse Descriptor Description [ t ] [ t ] rep [ t ] pd Interpretations of t[ {x:t | e} ] rep = [ t ] rep + [ t ] rep [  x:t.t’ ] rep = [ t ] rep * [ t’ ] rep [ {x:t | e} ] pd = hdr * [ t ] pd [  x:t.t’ ] pd = hdr * [ t ] pd * [ t’ ] pd

POPL Type Correctness t Parser Representation Parse Descriptor Description Theorem [ t ] : data  [ t ] rep * [ t ] pd [ t ] [ t ] rep [ t ] pd Interpretations of t

POPL Error Correctness x x Parser Theorem: Parsers report errors accurately. –Errors recorded in parse descriptor correspond to errors in data. –Parsers check all semantic constraints. –More …

POPL Encoding DDLs in DDC We distill idealized version of PADS (iPADS) and formalize its encoding in DDC. –PADS manual: ~350 pages –iPADS encoding : 1/2 page Majority of other languages encoded in DDC. –Either has direct parallel in iPADS, –Or, we present encodings directly. Remaining cases could be handled with straightforward extension to DDC.

POPL Semantics: Practical Benefits Crystallized invariants - uncovered bugs. –Semantics forced us to formalize the error counting methodology. Guided our design decisions in the subsequent implementation efforts. –Added recursion in PADS. –Designing version of PADS for O’Caml. Helped us revisit design decisions from new perspective.

POPL Future work What set of languages are recognized by DDC? How does the expressive power of DDC relate to CFGs and regular expressions? Add polymorphism to DDC. PADS/ML: next generation of PADS.

POPL Summary Type-based data description languages are well- suited to describing ad hoc data. No one DDL will ever be right - a la Landin, different domains and applications will demand different languages with differing levels of expressiveness and abstraction. Our work defines the first semantics for data description languages. For more information, visit

POPL The End

POPL Cut slides follow

POPL Existing Approaches Most people use C,Perl, or shell scripts. Time consuming & error prone to hand code parsers. –Binary data particularly so: Programs break in subtle and machine-specific ways (endien-ness, word-sizes). Difficult to maintain (worse than the ad hoc data itself in some cases!). Often incomplete, particularly with respect to errors. –Error code, if written, swamps main-line computation. If not written, errors can corrupt “good” data.

POPL Type Kinding Kinding ensures types are well formed.  |- t :type ,x:s |- t’ : type  |-  x:t.t’: type  |- t : type  |- t’ : type  |- t + t’: type  |- t : type ,x:s |- e : bool (s = …)  |- {x:t | e}: type

POPL Parsing Semantics Each type interpreted as a parsing function in the polymorphic -calculus. –[  ] : DDC Type  Function –Input data and offset, output new offset, value and parse descriptor. [  x:t 1.t 2 ] = (bits,offset 0 ). let (offset 1,data 1,pd 1 ) = [ t 1 ](bits,offset 0 ) in let x = (data 1,pd 1 ) in let (offset 2,data 2,pd 2 ) = [ t 2 ](bits,offset 1 ) in (offset 2, R  (data 1,data 2 ), P  (pd 1,pd 2 ))

POPL Representation Semantics Each type interpreted as a two types in the host language (polymorphic -calculus). –[  ] rep : type of parsed data. –[  ] pd : type of parse descriptor (meta-data). Examples: –[ C(e) ] rep = HostType(C) [ C(e) ] pd = hdr * unit –[  x:t.t’ ] rep = [ t ] rep * [ t’ ] rep [  x:t.t’ ] pd = hdr * [ t ] pd * [ t’ ] pd

POPL Properties of the Calculus Theorem: If  |- t : k then –[t] = F well formed types yield parsers –  |- F : bits * offset  offset * [t] rep * [t] pd a t-Parser returns values with types that correspond to t. Theorem: Parsers report errors accurately. –Errors in parse descriptor correspond to errors found in data. –Parsers check all semantic constraints. –More …

POPL Representation Semantics Parsers produce values with following type in the host language: DDCHost Language [C(e)] rep I(C) + none [unit] rep unit [  x:T.T’] rep [T] rep * [T’] rep [T + T’] rep [T] rep + [T’] rep [{x:T | e}] rep [T] rep + [T] rep [absorb(T)] rep unit + none unrecoverable error error dependency erased Base Types Pairs Union Set types ok

POPL Representation Semantics Parse descriptors have the following type in the host language: DDCHost Language [C(e)] pd hdr * unit [unit] pd hdr * unit [  x:T.T’] pd hdr * [T] pd * [T’] pd [T + T’] pd hdr * ([T] pd + [T’] pd ) [{x:T | e}] pd hdr * [T] pd [absorb(T)] pd hdr * unit Base Types Pairs Union Set types