Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker.

Similar presentations


Presentation on theme: "The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker."— Presentation transcript:

1 The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

2 “The Next 700 …” Program(s)Data Format(s) Programming Language(s) PL Semantics Data Description Language(s) DDL Semantics

3 What Data Needs Describing? There's much data in databases and common formats like XML; there’s much data that’s ad hoc. Ad hoc data lacks readily available parsing, querying, analysis or transformation tools It’s all over the place: financial, telecomm, chemistry, physics, biology, etc.

4 Ad Hoc Data in Biology !autogenerated-by: DAG-Edit version 1.419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: 3.223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673 <biological_process ; GO:0008150 %behavior ; GO:0007610 ; synonym:behaviour %adult behavior ; GO:0030534 ; synonym:adult behaviour %adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour % feeding behavior ; GO:0007631 %adult locomotory behavior ; GO:0008344 ;... from www.geneontology.org

5 Ad Hoc Data in Chemistry O=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4) (OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H] (OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C) [C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O) C5=CC=CC=C5)=O)C1

6 Ad Hoc Data from Web Server Logs (CLF) 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/dd@grp.org/confirm HTTP/1.0" 200 941

7 Ad Hoc Data: DNS packets 00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872...............r 00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com. 00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027...............' 00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465.ns1...hostmaste 00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I......... 00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6............... 00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00......linux..... 00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c............mail 00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man............. 00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000................ 000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e..ns0........... 000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100......_gc...!... 000000c0: 0002 5800 1d00 0000 640c c404 7068 7973..X.....d...phys 000000d0: 0872 6573 6561 7263 6803 6174 7403 636f.research.att.co

8 Data Description Languages Data description languages describe many ad hoc formats and provide the following features: –Descriptions serves as documentation, including semantic of data –Compiler generates tools from description: parser, printer, query engine, converter to XML, statistical profiler, etc. –Parser includes robust error detection and recovery. –Parsers can handle high data volume. > 1GB/second Netflow traffic from Cisco routers.

9 Many Data Description Languages Logical Descriptions –ASN.1 –ASDL Physical Descriptions –PacketTypes (SIGCOMM ‘00) –DataScript (GPCE ‘02) –PADS (PLDI ‘05) Basis for current work 001010101 101001001 111001010 001010100 001010101 101001001 111001010 001010100 Logical Physical

10 Contributions A core data description calculus (DDC) –Based on dependent type theory –Simple, orthogonal, composable types –Types are transducers from external data source to internal data representation. Encodings of high-level DDLs in low-level DDC –Explain semantics of PADS language in particular. PacketTypes PADS Datascript DDC

11 Base Types and Sequences C(e): base type can be parameterized by expression e.  x:T.T’: dependent product describes sequence of values. –Variable x gives name to first value in sequence. Examples: “123hello|”int * string(‘|’) * char(123, “hello”, ‘|’) “3513”  width:int_fw(1). int_fw(width) (3,513) “:hello:”  term:char.string(term) * char (‘:’,“hello”,‘:’)

12 Constraints {x:T | e}: set types allow you to constrain the type T and express relationships between elements of the data. Examples: ‘a’{c:char | c = ‘a’} (abbrev: S c (‘a’))inl ‘a’ “101”, “82” {x:int | x > 100} inl 101, inr error(82) “43|105|67”  min:int.S c (‘|’) *  max:{m:int | min ≤ m}.S c (‘|’) * {avg:int | min ≤ avg & avg ≤ max} (43, inl ‘|’, inl 105, inl ‘|’, inl 67)

13 Unions and the Empty String true: matches the empty string. T + T’ : deterministic, exclusive or: try T; on failure, try T’. Examples: “54”, “n/a”int + S s (“n/a”)inl 54, inr “n/a” “2341”, “”int + trueinl 2341, inr ()

14 Array Features What features do we need to handle data sequences? –Elements –Separator between elements –Termination condition (“are we done yet?”) –Terminator after sequence Examples: “192.168.1.1” “Bill|Cathy|Jane|Bob;”

15 False and Arrays T seq(T s ; e, T t ) specifies: –Element type T –Separator types T s. –Termination condition e. –Terminator type T t. false: reads nothing, flagging an error. Example: IP address. “192.168.1.1”int seq(S c (‘.’); len 4, false)[192,168,1,1]

16 Abstraction and Application Can parameterize types over values: x.T Correspondingly, can apply types to values: T e Example: IP address with terminator none term.int seq(S c (‘.’); len 4, S c (term)) none “1.2.3.4|”IP_addr ‘|’ * S c (‘|’)([1,2,3,4],inl ‘|’)

17 Absorb, Compute and Scan Absorb, Compute and Scan are active types. –absorb(T) : consume data from source; produce nothing. –compute(e:  ) : consume nothing; output result of computation e. –scan(T) : scan data source for type T. Examples: “|”absorb(S c (‘|’))() “10|12”  width:int.S c (‘|’) *  length:int. area:compute(width  length:int) (10,12,120) “^%$!&_|”scan(S c (‘|’))(6,inl ‘|’)

18 Type Kinding Kinding ensures types are well formed.  |- T : s  k  |- e : s  |- T e: k  |- T : type  |- T’ : type  |- T + T’: type  |- T : type ,x:s |- e : bool (s = …)  |- {x:T | e}: type

19 Parsing Semantics of Types Semantics expressed as parsing functions written in the polymorphic -calculus. –Sem(T) : DDC Type  Function –Input data and offset, output new offset, value and parse descriptor. –For specifics, see upcoming technical report.

20 Types of Parser Output Parsers produce values with following type in the host language: DDCHost Language [C(e)] rep I ( C) + noval [true] rep unit [  x:T.T’] rep [T] rep * [T’] rep [ x.T] rep, [T e] rep [T] rep [T + T’] rep [T] rep + [T’] rep + noval [{x:T | e}] rep [T] rep + ([T] rep error) unrecoverable error semantic error dependency erased Base Types Products Union Abs. and App. Set types

21 Properties of the Calculus Theorem: If  |- T : k then –[T] = F well formed types yield parsers –  |- F : bits * offset  offset * [T] rep * [T] pd a T-Parser returns values with types that correspond to T. Theorem: Parsers report errors accurately. –Errors in parse descriptor correspond to actual errors in data. –Parsers check all semantic constraints. –More …

22 Making Use of the Calculus IPADS DDC  |- t  T IPADS t ::= C(e) | Pfun(x:s) = t | t e | Pstruct{fields} | Punion{fields} | Pswitch e of {alts t def ;} | Popt t | t Pwhere x.e | Palt{fields} | t [t; e,t] | Pcompute e | Plit c fields ::= | fields x : t; alts ::= | alts e => t;

23 Example: Popt and Plit  |- Popt t  T + true  |- t  T  |- Plit c  scan(absorb({x:char | x = c }))  |- c : char true T 1 + T 2 C(e) {x:T | e} absorb(T) scan(T)

24 Example: Pswitch  |- Pswitch e of {e 1 => t 1 ; e 2 => t 2 ; … t def }  ( c.{x:T 1 | c = e 1 } + {x:T 2 | c = e 2 } + …+ T def ) e  |- t i  T i (i = 1…n) T + T’ x.T {x:T|e}  |- t def  T def

25 Future work What are the set of languages recognized by the DDC? How does the expressive power of the DDC relate to CFGs and regular expressions? Implement recursive types in PADS system based on the recursive types of the DDC. Add polymorphism to DDC and PADS.

26 Summary Data description languages are well-suited to describing ad hoc data. No one DDL will ever be right - different domains and applications will demand different languages with differing levels of expressiveness and abstraction. Our work defines the first semantics for data description languages. For more information, visit www.padsproj.org.www.padsproj.org

27 Cut slides follow

28 A Brief History In the beginning, there was just one program (maybe two). No need for programming language. That program was copied and changed until there were many programs. High-level programming language was invented. Nice, but not right for all situations - many new programming languages appeared. How do these languages related to each other? –Programming language semantics was born.

29 A Brief History In the beginning, there was just one data format (binary). No need for data description language. That format was evolved until there were many formats. Data description language was invented. One language did not suit all and many new data description languages appeared. –This is where we are today We’d like to help answer that question by devising the first data description language semantics.


Download ppt "The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker."

Similar presentations


Ads by Google