Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie.

Slides:



Advertisements
Similar presentations
Usage Statistics in Context: related standards and tools Oliver Pesch Chief Strategist, E-Resources EBSCO Information Services Usage Statistics and Publishers:
Advertisements

Using the Self Service BMC Helpdesk
Business Development Suit Presented by Thomas Mathews.
With Folder HelpDesk for Outlook, support centres and other helpdesks can work efficiently with support cases inside Microsoft Outlook. The support tickets.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
Calendar Browser is a groupware used for booking all kinds of resources within an organization. Calendar Browser is installed on a file server and in a.
Fast, Accurate Creation of Data Validation Formats by End-User Developers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Estimating the Numbers of End Users and End User Programmers Christopher Scaffidi Brad Myers Mary Shaw Carnegie Mellon University EUSES Consortium VL/HCC.
Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University.
Empirically Assessing End User Software Engineering Techniques Gregg Rothermel Department of Computer Science and Engineering University of Nebraska --
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Carnegie Mellon University.
1 Chapter 4 The Fundamentals of VBA, Macros, and Command Bars.
Tool Support for Data Validation by End-User Programmers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Toped: Enabling End-User Programmers to Validate Data Chris Scaffidi, Brad Myers, Mary Shaw, Carnegie Mellon University, School of Computer Science,
Accommodating Data Heterogeneity in ULS Systems Christopher Scaffidi Mary Shaw Carnegie Mellon University.
Introduction to a Programming Environment
A Lightweight Model for End Users’ Domain-Specific Data Christopher Scaffidi Carnegie Mellon University VL/HCC Graduate Consortium 2006.
With Microsoft Access 2010 © 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access.
A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 16 Slide 1 User interface design.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
1 Agenda Views Pages Web Parts Navigation Office Wrap-Up.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
2012 National BDPA Technology Conference Creating Rich Data Visualizations using the Google API Yolanda M. Davis Senior Software Engineer AdvancED August.
JavaScript Form Validation
Database Applications – Microsoft Access Lesson 2 Modifying a Table and Creating a Form 45 slides in presentation Accessibility check 9/14.
WorkPlace Pro Utilities.
Database-Driven Web Sites, Second Edition1 Chapter 8 Processing ASP.NET Web Forms and Working With Server Controls.
My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University.
Overview of Previous Lesson(s) Over View  ASP.NET Pages  Modular in nature and divided into the core sections  Page directives  Code Section  Page.
Event Manager Training Part 3.  Edit Event Options - Customize FY11 Sites  Edit Event Webpages  Sending s (Recruitment/Engagement)  Help and.
No application is an island: Using topes to transform strings during data transfer Atipol Asavametha, Prashanth Ayyavu, Christopher Scaffidi School of.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
Tutorial 121 Creating a New Web Forms Page You will find that creating Web Forms is similar to creating traditional Windows applications in Visual Basic.
Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
‘Tirgul’ # 7 Enterprise Development Using Visual Basic 6.0 Autumn 2002 Tirgul #7.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Chapter 8 Collecting Data with Forms. Chapter 8 Lessons Introduction 1.Plan and create a form 2.Edit and format a form 3.Work with form objects 4.Test.
1.NET Web Forms Business Forms © 2002 by Jerry Post.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.
Social Innovation Fund Creating an Application in eGrants Technical Assistance Call 1 – 2:00 p.m. Eastern Time on Friday, March 19, ;
11 3 / 12 CHAPTER Databases MIS105 Lec15 Irfan Ahmed Ilyas.
ETL Extract Transform Load. Introduction of ETL ETL is used to migrate data from one database to another, to form data marts and data warehouses and also.
McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Lab 6 Creating and Using Lists and.
1 CSE 2337 Introduction to Data Management Access Book – Ch 1.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
Chapter 4: Working with ASP.NET Server Controls OUTLINE  What ASP.NET Server Controls are  How the ASP.NET run time processes the server controls on.
Chapter 3 Part II Describing Syntax and Semantics.
Graphical Enablement In this presentation… –What is graphical enablement? –Introduction to newlook dialogs and tools used to graphical enable System i.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Working with Data Lists.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
1 11 Exploring Microsoft Office Access 2007 Chapter 6 Data Protection.
Lesson 4.  After a table has been created, you may need to modify it. You can make many changes to a table—or other database object—using its property.
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
Formal Specification: a Roadmap Axel van Lamsweerde published on ICSE (International Conference on Software Engineering) Jing Ai 10/28/2003.
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
Transportation Agenda 77. Transportation About Columns Each file in a library and item in a list has properties For example, a Word document can have.
1 Year of Progress on Topes Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Chapter – 8 Software Tools.
CoScripter and Topes: Putting Data into Usable Formats Christopher Scaffidi Carnegie Mellon University With Allen Cypher and Jimmy Lin IBM Almaden.
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
JavaScript, Sixth Edition
GO! with Microsoft Access 2016
A Data Model to Help End Users Shape Effective Software
Guidelines for Microsoft® Office 2013
Presentation transcript:

Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie Mellon University Sebastian ElbaumComputer Science & Engineering, University of Nebraska-Lincoln Jim HerbslebInstitute for Software Research, Carnegie Mellon University Brad MyersHuman-Computer Interaction Institute, Carnegie Mellon University

2 Target population In 2012, there will be 90 million computer end users in American workplaces. Of these, at least 55 million will create spreadsheets, databases, web applications, or other programs. –Spreadsheets for computing budgets –Spreadsheets and databases for storing information –Web applications for collecting data from coworkers And similar programs for automating a wide range of tedious or error-prone work tasks. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

3 Contextual inquiry: What are the problems of end users? Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

4 Lots of manual labor— validating and reformatting strings Building a staff roster, merging data from web sites: –Had to scrutinize data to identify questionable values (e.g.: CMU campus phone numbers are usually 268-xxxx but 269-xxxx might be right) –Had to manually transform data to consistent format (e.g.: Put person names in Lastname, Firstname format) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

5 Another person’s task: validate web forms-- but he didn’t know JavaScript / regexps Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

6 Collaborations of programmers with widely varying skills, interests, concerns Interviewing creators of Hurricane Katrina “person locator” sites (helping survivors publish their status) 4 managers in IT firms, 1 student, 1 graphic designer –2 people each created a site on their own –4 people collaborated with other programmers (principally on site aggregation) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

7 Hurricane Katrina “Person Locator” site: Many inputs unvalidated Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

8 Hurricane Katrina sites are not alone in lacking input validation. Eg: Google Base web application –13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”) If professional programmers can’t get this right, then it’s unsurprising that those 90 million end users also have so much trouble. So many unvalidated inputs. So many data errors. So much time to find mistakes. So many millions of people laboriously reformatting data by hand. We need a better way! Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

9Outline 1.Requirements for a better model 2.Topes Model for describing data Tools for creating/using topes 3.Evaluations 4.Conclusion Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

10 Underlying problem: abstraction mismatch Tools support strings, integers, floats, maybe dates. Problem domain involves higher-level data categories: –Person names “ Scaffidi, Chris”, “Chris Scaffidi” –CMU phone numbers “ ”, “x8-1234” –CMU room numbers “ WeH 4623”, “Wean 4623” Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

11 Approach: Create a new abstraction for each category of data Like software “libraries,” implementations of these abstractions could be reused in many programs. Abstractions would need to include functions for: –Recognizing instances of the category (for automating data validation) –Transforming instances among various formats (for automating data reformatting) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

12 1. Identify valid, invalid, and questionable values Data is sometimes questionable… yet valid. –E.g.: an unusually long address –In practice, person names and other proper nouns are never validated with regexps… too brittle. –Life is full of corner cases and exceptions. If code can identify questionable data, then it can double-check the data: –Ask an application end user to confirm the input –Flag the input for checking by a system administrator –Compare the value to a list of known exceptions –Call up a server and see if it can confirm the value Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

13 2. Capture reformatting rules Two different strings can be equivalent. –What if an end user types a date in the wrong format? –“Jan ” and “1/3/2007” mean the same thing because of the category that they are in: date. –Sometimes the interpretation is ambiguous. In real life, preferences and experience guide interpretation. If code can transform among formats, then it can put data in an unambiguous format as needed. –Display result so users can check/fix interpretation Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

14 3. User-extensibility Many kinds of data are organization-specific But users at those organizations know what the data values mean—take advantage of what they know… Users can describe the constrained parts of data. –Eg: CMU room numbers, “EDSH 303”, have a building name and an internal room number –Valid data obeys intra- and inter-part constraints. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

15 4. Reusability across programming environments (“platforms”) Validity does not depend on whether the string is in a spreadsheet or a webform or a database To validate a kind of data, people don’t want to write –JavaScript for webforms on the client side –C#/Java/PHP for webforms on the server side –Stored procedures for databases –VBScript for spreadsheets Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

16 Limitations of existing approaches Types do not support questionable values Grammars (eg: regexps, CFGs, Lapis) do not either, and cannot reformat Tools to integrate heterogeneous databases require a professional DBA and are specific to database systems (ie: not spreadsheets, webforms, etc). Cues, Forms/3,  -calculus, Slate, etc, infer numerical constraints but not constraints on strings, and they are tied to specific programming platforms Information extraction algorithms rely on grammatical cues that are absent during validation Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

17Topes A “tope” = a platform-independent abstraction that describes how to recognize and reformat instances of a data category Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

18 A tope is a graph. Node = format, edge = transformation Notional representation for a CMU room number tope… Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Building abbreviation & room number EDSH 225 Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

19 A tope has functions for recognizing and transforming instances of a data category Each tope implementation has executable functions: –1 isa:string  [0,1] function per format, for recognizing instances of the format (a fuzzy set) –0 or more trf:string  string functions linking formats, for transforming values from one format to another Validation function:  (str) = max(isa f (str)) where f ranges over tope’s formats –Valid when  (str) = 1 –Invalid when  (str) = 0 –Questionable when 0 <  (str) < 1 Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

20 Common kinds of topes: enumerations and proper nouns Multi-format Enumerations, e.g: US states –“New York”, “CA”, maybe “Guam” Open-set proper nouns, e.g.: company names –Whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corp”, “GOOG”) –Augmented with a pattern for promising inputs that are not yet on the whitelist Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

21 Two other common kinds of topes: numeric and hierarchical Numeric, e.g.: human masses –Numeric and in a certain range –Values slightly outside range might be questionable –Sometimes labeled with an explicit unit –Transformation usually by multiplication Hierarchical, e.g.: address lines –Parts described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Simple isas can be implemented with regexps. –Transformations involve permutation of parts, lookup tables, and changes to separators & capitalization. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

22 Role of good tool support Some simple isa functions could be implemented as –Enumerations –Regular expressions / formal grammars But for many topes, we also need to support questionable values and reformatting And usability can almost always be improved by tailoring the tools to the problem domain –Integrate with users’ familiar tools –Match the user interface to the problem’s structure Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

23 Topes in action 1.Users create data descriptions (abstract, user- friendly descriptions of data categories) 2.Users publish data descriptions on repositories. 3.Other users download data descriptions to cache. 4.System automatically generates tope implementations from data descriptions. 5.Tool add-ins help users browse their cache and associate topes with variables and input fields. 6.Add-ins get topes from local cache and call them at runtime to validate and reformat data. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

24 What the user sees Introduction  Requirements  Topes  Tools  Evaluation  Conclusion User highlights cells Clicks “New” button on our Validation toolbar

25 System infers a boilerplate tope and presents it for review and customization Induction steps: 1.Identify number & word parts 2.Align parts based on punctuation 3.Infer simple constraints on parts 25 Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

26 User gives names to the parts and edits constraints Features Part names Value whitelists Testing features Soft constraints (never / rarely / often / almost always / always) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

27 System identifies typos Introduction  Requirements  Topes  Tools  Evaluation  Conclusion Features Targeted messages Overridable Filterable Can add to “whitelist” Integrated with Excel’s “reviewing” functionality Checking inputs 1.Convert description to CFG w/ constraints on productions 2.Parse each input string 3.For each constraint violation, downgrade parse’s isa score

28 Easy access to reformatting functionality Introduction  Requirements  Topes  Tools  Evaluation  Conclusion Reformatting string 1.Parse with input format’s CFG 2.For each part in target format, a)Get node from parse tree b)Reformat node if needed (recurse) c)Concatenate (with separators if needed) 3.Validate result with target format’s CFG

29 Recommending topes based on label and examples-to-match Introduction  Requirements  Topes  Tools  Evaluation  Conclusion Efficient recommendation Only consider a tope if its instances could possibly have the “character content” of each example string. (eg.: could this have 12 letters & 1 space?)

30 Search repository by label and/or examples Note: many repositories will be organization-specific Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

31 Integration with Visual Studio.NET Introduction  Requirements  Topes  Tools  Evaluation  Conclusion Features Targeted messages Overridable Drag & drop code generation

32 Other integrations to date: CoScripter, Robofox, XML/HTML library Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

33 Other integration underway Introduction  Requirements  Topes  Tools  Evaluation  Conclusion RedRover –Spreadsheet auditing –They already support formula auditing –Goal: Using topes for checking strings LogicBlox –Decision-support –Helping users enter data & make decisions from it –Goal: Using topes for validating data –Goal: Using topes for data de-duplication

34Evaluation Many evaluations rely on the EUSES Spreadsheet Corpus (collected by Univ. Nebraska) –In particular, 4250 spreadsheet columns that contained at least 20 strings These evaluations generally use the F1 statistic as a measure of accuracy 1.Get strings from the corpus 2.Manually validate the strings 3.Automatically validate the strings (eg: with topes) 4.Compute F1 to check agreement F1 = precision * recall / ( (precision + recall)/2 ) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

35 Evaluating accuracy Implemented topes for spreadsheet data –Created 32 topes for the most common categories Covering 1199 columns, which was ~69% of the 1713 categorized columns, or ~28% of all 4250 columns Up to 5 formats per tope –Compared to current practice Validate w/ tope, simulate asking user on questionable inputs, F1=0.7 Validate w/ regexps or enumerations if available, but accept all inputs when no regexp or enumeration is available, F1=0.19 –Tope-based validation was 3 times as accurate Big benefit from supporting multi-format topes Moderate benefit from validating currently-unvalidated categories Small benefit from double-checking questionable values (~ 3% of inputs) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

36 Evaluating reusability Reused spreadsheet-based topes on webform data –Downloaded data for 8 data categories on Google Base and 5 in Hurricane Katrina website –Reused spreadsheet-based topes on the web data –Validation was even more accurate than on spreadsheets F1=0.75 for Google Base, 0.92 for Hurricane Katrina Website data had less formatting diversity than spreadsheets Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

37 Evaluating support for data cleaning Used topes to put web data into consistent formats –Again with the 5 columns in Hurricane Katrina website –Used transformation functions to put each string into the most common format for that data category –Increased number of duplicate strings found by 10% Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

38 Evaluating usability for data validation Users validating data with single-format topes –Between-subjects lab study –8 users validated spreadsheet data with our tools; for comparison, 8 users validated with Lapis patterns –Yes/no validation tasks (no questionable data) –Our tool users vs Lapis users Found three times as many typos (comparable F1 scores) Were twice as fast Reported significantly higher user satisfaction –Our tool users vs users in earlier regexp study Faster & more accurate (Similar but not identical tasks: not statistically comparable) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

39 Evaluating usability for data reformatting Users reformatting data with multi-format topes –Within-subjects lab study –9 users reformatted spreadsheet data by creating & using topes; for comparison, they then did it manually –Effort of creating a tope “pays off” at only 47 strings (further reuse is essentially “free”) –Every participant strongly preferred using our tools instead of doing tasks manually Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

40 Evaluating tope recommendations Quickly recommend existing tope for data at hand –Supports keyword-based search + search-by-match (eg: topes that match “ ”) –Evaluated by searching through topes for the 32 most common data categories in EUSES spreadsheet corpus, using strings from corpus –High accuracy: Recall over 80% (result set size = 5) –Adequate speed: User is likely to have a few dozen topes on computer, taking under 1 sec to search Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

41 Closing the mismatch between data abstractions and the real world People often work with strings that are possibly- questionable instances of multi-format categories. These categories are application-agnostic and often common to many people. By capturing rules for validating and reformatting strings (including distinguishing questionable strings and multiple formats), topes… –Increase the accuracy of validation –Help users to accomplish validation and reformatting activities quickly and effectively –Improve the reusability of validation code Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

42 Thank You… To my committee and the entire EUSES Consortium for helpful suggestions To NSF for funding Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

43References For more information on end users and topes -End users’ counts and needs: VL/HCC’05, VL/HCC’07 -Topes model: ICSE’08 -Format inferrence: ICEIS’07 -Integration with other systems: WEUSE’08 & FSE’08 -Our latest tools + usability validation: ISEUD’09 & IUI’09 For more information on some related work -Dependent types, eg: X. Ou, Dynamic Typing with Dependent Types, Tech Rpt TR , Princeton Univ, Regexp induction, eg: K. Lerman, S. Minton. Learning the Common Structure of Data, Proc. AAAI, Lapis system : R. Miller, Lightweight structure in text, Tech Rpt CMU-CS , Carnegie Mellon Univ., SWYN regexp editor : A. Blackwell, See What You Need: Helping End-users to Build Abstractions, JVLC, Federated databases, eg: A. Sheth, J. Larsen, Federated database systems for managing distributed, heterogeneous, and autonomous databases, CSUR, ETL Tools, eg: E. Rahn, H. Do, Data Cleaning: Problems and Current Approaches, IEEE Data Eng. Bulletin, Potter’s Wheel : V. Raman, J. Hellerstein, Potter's Wheel: An Interactive Data Cleaning System, VLDB, Forms/3 : M. Burnett et al, End-user software engineering with assertions in the spreadsheet paradigm, ICSE,  -calculus : M. Erwig, M. Burnett, Adding Apples and Oranges. Symp. Practical Aspects of Declarative Lang., Named entities, eg: Message Understanding Conference series. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

44 Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.com Split inputs into many easy-to-validate fields. Who cares if the user has to type tabs now, or if he can’t just copy-paste into one field? Make users pick from drop-downs. Who cares if it’s faster for users to type “NJ” or “1/2007”? (Disclaimer: drop-downs sometimes are good!) I implemented this site in Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

45 Even with these tricks, writing validation is still very time-consuming. Overall, the site had over 1100 lines of JavaScript just for validation…. Plus equivalent server-side Java code (too bad code isn’t platform-independent) if (!rfcCheck (frm.primary .value)) return messageHelper(frm.primary , "Please enter a valid Primary address."); var atloc = if (atloc > 31 || atloc < frm.primary .value.length-33) return messageHelper(frm.primary , "Sorry. You may only enter 32 characters or less for your name\r\n”+ ”and 32 characters or less for your domain Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

46 That was worst case. Best case: reusable regexps. Many IDEs allow the programmer to enter one regular expression for validating each input field. –Usually, this drastically reduces the amount of code, since most validation ain’t fancy. –Yet programmers don’t validate most inputs. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

47 Users’ spreadsheets are rife with formatting inconsistencies & other typos In one study by Univ Nebraska, nearly 40% of spreadsheet cell values were strings (not numbers or dates). Part of an actual spreadsheet on Carnegie Mellon’s public web site Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

48 Evaluating expressiveness Implemented topes for common webform inputs –Instrumented web browsers of 4 administrative assistants for 3 weeks –Logged strings that they typed into forms – in a regexp-masked format e.g.:  –Also logged strings nearby to textfields –Semi-automatically grouped strings into categories e.g.: project number, expense type, address, zip code –Implemented 14 most common topes –Found 22 probable typos in user inputs (0.5%) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

49 Tope Development Environment (TDE) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion User’s finished data description Strings from applications Boilerplate data description Error messages and reformatted strings Tope implementation and data description Error messages and reformatted strings Data description (download) Data description (upload) End user applications Microsoft Excel – spreadsheets Visual Studio.NET – web forms Robofox – web macros Vegemite/Co-Scripter – web macros... Topei Toped ++ Topeg Add-ins Remote repositories Local repository

50 As a tool builder, what do I have to do so that people can use topes in my tool? You need to make an add-in 1.Figure out what kind of fields you want to help your users validate/reformat (eg: spreadsheets’ cells; webforms’ textboxes) 2.Download our open source C# or Java API (library) 3.In your tool’s UI, add buttons and other widgets so user can select a tope for the fields; in your event handler, call our API methods 4.At runtime, pass field’s value (a string) to our API methods to validate or reformat strings 5.Display validation error messages; update value in UI Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

51 Recognizing multiple formats and questionable inputs raises accuracy Condition 4: Hypothetical user has to help on ~ 3% of inputs Condition 1: Recall = 0 (fails to identify any invalid inputs) Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

52 User’s finished data description Strings from applications Boilerplate data description Error messages and reformatted strings Tope implementation and data description Error messages and reformatted strings Data description (download) Data description (upload) End user applications Microsoft Excel – spreadsheets Visual Studio.NET – web forms Robofox – web macros Vegemite/Co-Scripter – web macros... Topei Toped ++ Topeg Add-ins Remote repositories Local repository

53 Imagine a world where… Code can ask an oracle, “Is this a person name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray. Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed. Regardless of whether they are working in spreadsheets, webforms, or other programming environment, end users can teach the oracle about a new data category by concisely stating its parts and constraints. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

54 Data errors reduce the usefulness of data. Even little typos impede data de-duplication. Age is not useful for flying my helicopter to come rescue you. Nor is a “city name” with 1 letter. Introduction  Requirements  Topes  Tools  Evaluation  Conclusion

55

56

57

58

59

60 A Word-like part that almost always contains 1-6 words that each always have 1-8 lowercase letters per word and only hyphens or ampersands between words: #PART : #WORDLIST : COUNT(#WORD)>=1 && COUNT(#WORD)<=6 {90} #WORDLIST : #WORD | #WORD #SEP #WORDLIST #WORD : #CHLIST : COUNT(#CH)>=1 && COUNT(#CH)<=8 {100} #CHLIST : #CH | #CH #CHLIST #CH : a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z #SEP : - | &

61

62

63

64

65

66

67

68