Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.

Similar presentations


Presentation on theme: "From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber."— Presentation transcript:

1 From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber Yitzhak Mandelbaum Peter White Kenny Q. Zhu www.padsproj.org

2 Data, data, everywhere AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data” Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available –not free text; not html; not xml Common problems: no documentation, evolving formats, huge volume, error-filled... Web Logs Network Monitoring Billing Info Router Configs Call Details

3 Data, data, everywhere 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/dd@grp.org/confirm HTTP/1.0" 200 941 234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409 240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178 188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082 214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 - web server common log format

4 Data, data, everywhere AT&T phone call provisioning data 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii1522 72|EDTF_6|0|MARVINS1|UNO|10|1000295291 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii1522 2|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|10 01 649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|E DTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982

5 Data, data, everywhere HA00000000START OF TEST CYCLE aA00000001BXYZ U1AB0000040000100B0000004200 HE00000005START OF SUMMARY f 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000 HF00000007END OF SUMMARY k 00000008LYXW B1KB0000065G0000009900100000001000020000 HB00000009END OF TEST CYCLE www.opradata.com www.opradata.com

6 Data, data, everywhere format-version: 1.0 date: 11:11:2005 14:24 auto-generated-by: DAG-Edit 1.419 rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO:0000001 name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc] is_a: GO:0048308 ! organelle inheritance is_a: GO:0048311 ! mitochondrion distribution www.geneontology.org

7 Goal Billing Info Raw Data ASCII log files Call Detail XML CSV Standard formats & schema Visual Information End-user tools We want to create this arrow

8 Half-way there: The PADS System 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07] “Ad Hoc” Data Source Analysis Report XML PADS Data Description PADS Compiler Generated Libraries (Parsing, Printing, Traversal) PADS Runtime System (I/O, Error Handling) XML Converter Data Profiler Graphing Tool Query Engine Custom App GraphInformation ? generic description- directed programs coded once

9 PADS Language Overview Rich base type library: –integers: Pint8, Puint32, … –strings: Pstring(’|’), Pstring_FW(3),... –systems data: Pdate, Ptime, Pip, … Type constructors describe complex data sources: –sequences: Pstruct, Parray, –choices: Punion, Penum, Pswitch –constraints: arbitrary predicates describe expected semantic properties –parameterization: allows definition of generic descriptions Data formats are described using a specialized language of types A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.

10 The Last Mile: The PADS System 2.0 Chunking & Tokenization Structure Discovery Format Refinement PADS Data Description Scoring Function Raw Data PADS Compiler Profiler XMLifier Analysis Report XML Format Inference Engine Chunking & Tokenization Structure Discovery

11 Convert raw input into sequence of “chunks.” Supported divisions: –Various forms of “newline” –File boundaries Also possible: user-defined “paragraphs” Chunking Process

12 Tokenization Tokens/Base types expressed as regular expressions. Basic tokens Integer, white space, punctuation, strings Distinctive tokens IP addresses, dates, times, MAC addresses,...

13 Histograms

14 Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height. Clustering Cluster 1 Group clusters with similar frequency distributions Cluster 2Cluster 3 Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.

15 Partition chunks In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.

16 Find subcontexts Tokens in selected cluster: Quote(2) Comma White

17 Then Recurse...

18 Inferred type

19 Structure Discovery Review Compute frequency distribution for each token. Cluster tokens with similar frequency distributions. Create hypothesis about data structure from cluster distributions –Struct –Array –Union –Basic type (bottom out) Partition data according to hypothesis & recurse Once structure discovery is complete, later phases massage & rewrite candidate description to create final form “123, 24” “345, begin” “574, end” “9378, 56” “12, middle” “-12, problem” …

20 Testing and Evaluation Evaluated overall results qualitatively –Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation –Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity Evaluated accuracy quantitatively –For many formats: 95%+ accuracy from 5% of available data Evaluated performance quantitatively –Hours to days to hand-write formats –after fixing the format, appears to scale linearly with data size –<1 min on 300K data

21 Technical Summary [www.padsproj.org] PADS 1.0 is an effective implementation framework for many data processing tasks PADS 2.0 improves programmer productivity further by automatically inferring formats & generating many tools & libraries Email ASCII log files Binary Traces struct {......................... } XML CSV

22 End

23 Execution Time Data sourceSD (s)Ref (s)Tot (s)HW (h) 1967Transactions.short0.202.322.564.0 MER_T01_01.cvs0.112.822.920.5 Ai.30001.9726.3528.641.0 Asl.log2.9052.0755.261.0 Boot.log0.112.402.531.0 Crashreporter.log0.123.583.732.0 Crashreporter.log.mod0.153.834.002.0 Sirius.10002.245.698.001.5 Ls-l.txt0.010.100.111.0 Netstat-an0.070.740.821.0 Page_log0.080.550.650.5 quarterlypersonalincome0.075.115.1848 Railroad.txt0.062.692.762.0 Scrollkeeper.log0.133.243.401.0 Windowserver_last.log0.379.6510.071.5 Yum.txt0.111.912.035.0 SD: structure discovery Ref: refinement Tot: total HW: hand-written

24 Training Time

25 Minimum Necessary Training Sizes Data source90%95% Sirius.1000510 1967Transaction.short55 Ai.3000510 Asl.log510 Scrollkeeper.log55 Page_log55 MER_T01_01.csv55 Crashreporter.log1015 Crashreporter.log.mod515 Windowserver_last.log515 Netstat-an2535 Yum.txt3045 quarterlypersonalincome10 Boot.log4560 Ls-l.txt5065 Railroad.txt6075

26 Problem: Tokenization Technical problem: –Different data sources assume different tokenization strategies –Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions –Matching tokenization of underlying data source can make a big difference in structure discovery. Current solution: –Parameterize learning system with customizable configuration files –Automatically generate lexer file & basic token types Future solutions: –Use existing PADS descriptions and data sources to learn probabilistic tokenizers –Incorporate probabilities into sophisticated back-end rewriting system Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead


Download ppt "From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber."

Similar presentations


Ads by Google