Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.

Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2 Target audience In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces. Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). For professional programmers, programs are a deliverable. For EUPs, programs are a means to an end. motivation ● overview ● algorithm ● evaluation ● conclusion

3 Current practice: Storing data as strings Typical tools of EUPs: –Excel“text” cells –Access/MSSQL“varchar” fields –FrontPage/Dreamweaver“textfield” inputs Validation involves… –Learning an exotic new notation (VBScript, regexps, etc) –Writing cumbersome expressions in that notation Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations. motivation ● overview ● algorithm ● evaluation ● conclusion

4 Our Topes system to date… Formats are presented in human-readable notation in our format editor –Format = sequence of parts with constraints on parts –Constraints can be “often” true (rather than “always”) The format is automatically converted to a context-free grammar, with constraints attached to productions. At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value. motivation ● overview ● algorithm ● evaluation ● conclusion

5 Needed: Format inference Problem: To date, the system offers limited support for helping users to get started. Although users do not need to learn specialized notation, there is still the cognitive work of… –examining data –breaking it into parts –representing parts in the format Solution: Infer a boilerplate format from examples motivation ● overview ● algorithm ● evaluation ● conclusion

6 Talk Outline Motivation / Problem Solution –Overview of Topei –Inference algorithm Evaluation Conclusion motivation ● overview ● algorithm ● evaluation ● conclusion

7 Prototype Task flow diagram Algorithm infers a format from cell values User reviews and customizes format User creates a format from scratch User loads an existing format from a file Plug-in flags cells that don’t match format User highlights spreadsheet cells [1][6] or motivation ● overview ● algorithm ● evaluation ● conclusion

8 Sample task: validating a spreadsheet with the prototype we have built The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in. motivation ● overview ● algorithm ● evaluation ● conclusion

9 Sample task: validating a spreadsheet Customizing an inferred format Inferred format is presented in editor with sentence-like prompts to improve human-readability User can specify meaningful names for parts motivation ● overview ● algorithm ● evaluation ● conclusion

10 Sample task: validating a spreadsheet Customizing constraints in our prototype User can add/edit constraints motivation ● overview ● algorithm ● evaluation ● conclusion

11 Sample task: validating a spreadsheet Flagging potential errors A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message motivation ● overview ● algorithm ● evaluation ● conclusion

12 Our algorithm has 2 phases Input: An array of strings Phase 1: Identify format parts Phase 2: Identify constraints on each part of each format Output: An array of formats –Sorted according to how many examples they match motivation ● overview ● algorithm ● evaluation ● conclusion

13 Phase 1: Identify format parts For each string, replace each character with its class, then collapse runs, generating a string “signature” –Supported character classes: Auppercase letter alowercase letter 0digit Example: apple@gmail.coma@a.a banana1@hotmail.coma0@a.a carrot@company.coma@a.a DATE@UNIVERSITY.eduA@A.a eggplant@firm-name.coma@a-a.a fig.plant@mail.univ.edua.a@a.a.a motivation ● overview ● algorithm ● evaluation ● conclusion

14 Phase 1: Identify format parts Pack strings with identical signatures (often leads to significant performance improvement) Example: apple@gmail.coma@a.a banana1@hotmail.coma0@a.a carrot@company.com > DATE@UNIVERSITY.eduA@A.a eggplant@firm-name.coma@a-a.a fig.plant@mail.univ.edua.a@a.a.a motivation ● overview ● algorithm ● evaluation ● conclusion

15 Phase 1: Identify format parts Align signatures based on separators Example: apple@gmail.coma @ a. a banana1@hotmail.coma0 @ a. a DATE@UNIVERSITY.eduA @ A. a eggplant@firm-name.coma@a-a.a fig.plant@mail.univ.edua.a@a.a.a motivation ● overview ● algorithm ● evaluation ● conclusion

16 Phase 1: Identify format parts Abstract to least general composite character class, yielding the parts of each format. Example (3 formats below): apple@gmail.coma @ a. a banana1@hotmail.coma0 @ a. a DATE@UNIVERSITY.eduA @ A. a a0A aA a eggplant@firm-name.coma @ a - a. a fig.plant@mail.univ.edua. a @ a. a. a motivation ● overview ● algorithm ● evaluation ● conclusion

17 Phase 2: Identify constraints on each part Constrain each part’s contents to the character classes Require indicated separators before/after parts Infer an additional content constraint that is “often” true: –Must be in a set of 3 or fewer literals? –Must be in a numeric range? –Must start with or end with certain characters? A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature. Afterward, the user can review/customize format. motivation ● overview ● algorithm ● evaluation ● conclusion

18 Evaluation as an outlier finder Outlier finding: –Infer a format from example values –Use the inferred format to check the examples  Reveals “outliers” that might contain typos or other errors Comparison algorithm: Lapis Lapis example @DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing @ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing @ShortYear is Number equal to /\d\d/ ignoring nothing Date is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation motivation ● overview ● algorithm ● evaluation ● conclusion

19 Evaluation data Drawn from EUSES spreadsheet corpus 6288 US phone numbers in 37 columns –First cell in column contains “phone” –And at least 20 cells have exactly 10 digits –And at least 2/3 of cells have exactly 10 digits 1124 country names in 7 columns –First cell in column contains “country” –And there are at least 20 cells –And at least one cell contains “Portugal” motivation ● overview ● algorithm ● evaluation ● conclusion

20 Run each algorithm (Topei & Lapis) and compare their output to hand-labeling For determining “true outliers” in calculating accuracy: –Outlier phone numbers have an area code that is not in service, or if they contain errant separators such as spaces not shared by most cells in the column. –Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English- speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy] motivation ● overview ● algorithm ● evaluation ● conclusion

21 Results: Topei’s precision/recall exceed Lapis’s Standard machine learning measures for outlier finding –Precision = # outliers found / # outliers claimed –Recall = # outliers found / # true outliers TaskAlgorithmPrecision (%)Recall (%) Country Topei56.594.6 Country Lapis46.77.6 Phone Topei97.799.8 Phone Lapis44.02.4 motivation ● overview ● algorithm ● evaluation ● conclusion

22 Limitations & future work Topei still makes mistakes: –Doesn’t infer constraints aggressively enough –Doesn’t recognize non-ASCII chars in character classes –Doesn’t handle formats with repeating parts Need deeper integration with EUPs’ tools Computational complexity: –Is intended to be O(# examples), seems to be true –More careful verification needed Usability has not yet been evaluated in user study motivation ● overview ● algorithm ● evaluation ● conclusion

23 Thank You… …to you for your interest and attention …to INSTICC for the opportunity to present …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929) motivation ● overview ● algorithm ● evaluation ● conclusion

24 Another example: Carnegie Mellon University phone #: 8-1234 motivation ● overview ● algorithm ● evaluation ● conclusion

25Integration Formats can be inferred from… –Spreadsheet cells –Database queries (e.g.: Access/MSSQL) –Arbitrary collection of text strings (via C#) Formats can then be used without modification in other venues, as well. –E.g.: infer a format from spreadsheet cells, then use it to create a trigger for a database table motivation ● overview ● algorithm ● evaluation ● conclusion

26 Related Work Many algorithms train a recognizer to notice features –See (Mitchell, 1997) for a summary –Such algorithms do not infer a human-editable format. Others generate formats in specialized notation. –(Miller, 2001) (Blackwell, 2001) (Lerman, 2000), (Lieberman, 2001) (Nardi, 1998) –Regular expressions and CFGs have limited readability. Several tools recognize or manipulate some of the same kinds of data as Topei. –(Hong, 2006) (Pandit, 1997) (Stylos 2004) –Custom formats are unsupported (only hardcoded formats) motivation ● overview ● algorithm ● evaluation ● conclusion

Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.

Similar presentations

Presentation on theme: "Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.

Similar presentations

Presentation on theme: "Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback