Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Chapter 11 Introduction to Programming in C
Advertisements

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Database Ed Milne. Theme An introduction to databases Using the Base component of LibreOffice LibreOffice.
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Chapter 07: Lecture Notes (CSIT 104) 1111 Exploring Microsoft Office Excel 2007 Chapter 7 Data Consolidation, Links, and Formula Auditing.
Calendar Browser is a groupware used for booking all kinds of resources within an organization. Calendar Browser is installed on a file server and in a.
Fast, Accurate Creation of Data Validation Formats by End-User Developers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie.
A Guide to Oracle9i1 Introduction To Forms Builder Chapter 5.
Context-Free Grammars Lecture 7
Dimensions Characterizing Programming Feature Usage by Information Workers Christopher Scaffidi, Andrew Ko, Brad Myers, Mary Shaw Carnegie Mellon University.
A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University.
Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Carnegie Mellon University.
Tool Support for Data Validation by End-User Programmers Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Toped: Enabling End-User Programmers to Validate Data Chris Scaffidi, Brad Myers, Mary Shaw, Carnegie Mellon University, School of Computer Science,
Accommodating Data Heterogeneity in ULS Systems Christopher Scaffidi Mary Shaw Carnegie Mellon University.
A Lightweight Model for End Users’ Domain-Specific Data Christopher Scaffidi Carnegie Mellon University VL/HCC Graduate Consortium 2006.
Qualtrics 360 Peer Review Survey Instructions
A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006.
1 An Introduction to Visual Basic Objectives Explain the history of programming languages Define the terminology used in object-oriented programming.
1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
JavaScript Form Validation
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University.
Overview of Previous Lesson(s) Over View  ASP.NET Pages  Modular in nature and divided into the core sections  Page directives  Code Section  Page.
Microsoft Visual Basic 2005: Reloaded Second Edition
OFC304 Excel 2003 Overview: XML Support Joseph Chirilov Program Manager.
No application is an island: Using topes to transform strings during data transfer Atipol Asavametha, Prashanth Ayyavu, Christopher Scaffidi School of.
CSCI 6962: Server-side Design and Programming Validation Tools in Java Server Faces.
Chapter 9 Database Management Discovering Computers Fundamental.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Lecture 16 Page 1 CS 236 Online SQL Injection Attacks Many web servers have backing databases –Much of their information stored in a database Web pages.
1 ADVANCED MICROSOFT EXCEL Lesson 9 Applying Advanced Worksheets and Charts Options.
07 Coding Conventions. 2 Demonstrate Developing Local Variables Describe Separating Public and Private Members during Declaration Explore Using System.exit.
C# Tutorial -1 ASP.NET Web Application with Visual Studio 2005.
Examining data using Microsoft Access Queries Using Criteria and Calculations SESSION 3.2 This section covers specifying an exact match condition in a.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Chapter 8 Collecting Data with Forms. Chapter 8 Lessons Introduction 1.Plan and create a form 2.Edit and format a form 3.Work with form objects 4.Test.
Intelligently Creating and Recommending Reusable Reformatting Rules Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
11 3 / 12 CHAPTER Databases MIS105 Lec15 Irfan Ahmed Ilyas.
ETL Extract Transform Load. Introduction of ETL ETL is used to migrate data from one database to another, to form data marts and data warehouses and also.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
1 User Interface Design Components Chapter Key Definitions The navigation mechanism provides the way for users to tell the system what to do The.
Chapter 4: Working with ASP.NET Server Controls OUTLINE  What ASP.NET Server Controls are  How the ASP.NET run time processes the server controls on.
Copyright © 2008 Pearson Prentice Hall. All rights reserved Copyright © 2008 Prentice-Hall. All rights reserved. Committed to Shaping the Next.
Genesys Shell development Input-side development progress.
Intermacs Form Download Excel Tutorial Pivot Tables, Graphic Tools, Macros By: Devin Koehl.
Intermediate 2 Computing Unit 2 - Software Development.
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
Intermacs Form Download Excel Tutorial Pivot Tables, Graphic Tools, Macros By: Devin Koehl.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
1 Year of Progress on Topes Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.
A Data Model to Support End-User Software Engineering Christopher Scaffidi Carnegie Mellon University.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
CoScripter and Topes: Putting Data into Usable Formats Christopher Scaffidi Carnegie Mellon University With Allen Cypher and Jimmy Lin IBM Almaden.
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
Programming Languages Concepts Chapter 1: Programming Languages Concepts Lecture # 4.
Forum to improve your experience entering data into SRDR 1 SRDR is being developed and maintained by the Brown EPC under contract with the Agency for Healthcare.
AP CSP: Cleaning Data & Creating Summary Tables
A Data Model to Help End Users Shape Effective Software
The ultimate in data organization
Presentation transcript:

Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2 Hurricane Katrina “Person Locator” site: Many inputs unvalidated... and error-ful Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

3 Data errors reduce the usefulness of data. Even little typos impede data de-duplication. Age is not useful for flying my helicopter to come rescue you. Nor is a “city name” with 1 letter. Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

4 Hurricane Katrina sites are not alone in lacking input validation. Eg: Google Base web application –13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”) Eg: Spreadsheets –40% of cells are non-numeric, non-date textual data –Often used to gather/organize textual data for reports Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

5Outline 1.Challenges of data validation 2.Topes Model for describing data Tools for creating/using topes 3.Evaluation 4.Conclusion Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

6 Digging into the details: real user inputs that need validation. Sources: –Interviews of Hurricane Katrina website creators –Survey of Information Week readers –Contextual inquiry of information workers who created and used websites –Logs of what admin assistants typed into browsers –Exploration of the EUSES spreadsheet corpus Validating user inputs has 3 primary challenges… Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

7 1. Inputs don’t always conform well to the simple “binary” validation model. Data is sometimes questionable… yet valid. –Eg: a suspiciously long address –In practice, person names and other proper nouns are never validated with regexps… too brittle. –Life is full of corner cases and exceptions. If code can identify questionable data, then it can double-check the data: –Ask an application end user to confirm the input –Flag the input for checking by a system administrator –Compare the value to a list of known exceptions –Call up a server and see if it can confirm the value Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

8 2. User inputs often can occur in multiple different formats. Two different strings can be equivalent. –How many ways can you write a date? –What if an end user types a date in the wrong format? –“Jan ” and “1/1/2007” mean the same thing because of the category that they are in: date. –Sometimes the interpretation is ambiguous. In real life, preferences and experience guide interpretation. If code can transform among formats (ie: not just recognize formats with regexps), then it can put data in an unambiguous format as needed. –Display result so users can check/fix interpretation Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

9 3. The meaning of data is often tied to its “parts”, not directly to its characters. Data often has parts, each with a meaning. –What are the parts of a date, 12/31/2008? –Valid data obeys intra- and inter-part constraints. –Constraints are usually platform-independent –Writing regexps requires you to translate constraints into a character sequence… tough in many cases, practically or truly impossible in others. If code could succinctly state the parts, as well as mandatory and optional constraints on the parts, wouldn’t the code be easier to write and maintain? –Especially if it was platform-independent! Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

10 Limitations of existing approaches Types do not support questionable values Grammars do not, either, nor can they reformat Information extraction algorithms rely on grammatical cues that are absent during validation Cues, Forms/3,  -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

11 Imagine a world where… Code can ask an oracle, “Is this a company name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray. Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed. People teach the oracle about a new data category by concisely stating its parts and constraints. Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

12 New Approach: Topes A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

13 A tope is a graph. Node = format, edge = transformation Notional representation for a CMU room number tope… Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Building abbreviation & room number EDSH 225 Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

14 A tope is a conceptual abstraction. A tope implementation is code. Each tope implementation has executable functions: –1 isa:string  [0,1] function per format, for recognizing instances of the format (a fuzzy set) –0 or more trf:string  string functions linking formats, for transforming values from one format to another Validation function:  (str) = max(isa f (str)) where f ranges over tope’s formats –Valid when  (str) = 1 –Invalid when  (str) = 0 –Questionable when 0 <  (str) < 1 Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

15 Common kinds of topes: enumerations and proper nouns Multi-format Enumerations, e.g: US states –“New York”, “CA”, maybe “Guam” Open-set proper nouns, e.g.: Company names –Whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corp”, “GOOG”) –Augmented with a pattern for promising inputs that are not yet on the whitelist Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

16 Two other common kinds of topes: numeric and hierarchical Numeric, e.g.: human masses –Numeric and in a certain range –Values slightly outside range might be questionable –(Very rarely) labeled with an explicit unit –Transformation usually by multiplication Hierarchical, e.g.: address lines –Parts described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Simple isas can be implemented with regexps. –Transformations involve permutation of parts, changes to separators, arithmetic, and lookup tables. Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

17 Tope Development Environment (TDE) Topei Module Infers tope from examples Toped Module Enables EUPs to create/edit topes Topeg Module Generates context-free grammars and transformations Topep Module Parses data against grammars, performs transformations Plug-ins Read/write program data Robofox Web macros Vegemite/CoScripter Web macros Microsoft Excel Spreadsheets Visual Studio.NET Web applications … Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

18 Toped User Interface Features Format inference Format/part names Soft constraints Testing features Format reusability Introduction  Challenges  Topes  Tools  Evaluation  Conclusion User Study EUPs are fast & accurate at creating tope formats

19 Integration with programming platforms Microsoft Excel: buttons and menus Visual Studio: drag-and drop code generation Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

20 Other integrations to date: CoScripter, Robofox, XML/HTML library Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

21 Evaluating accuracy, reusability, and usefulness for data cleaning Implemented topes for spreadsheet data –32 topes based on 720 online spreadsheets –Tested accuracy Reused topes on web application data –8 data categories in Google Base and 5 data categories in Hurricane Katrina site –Tested accuracy Used transformations to reformat data –5 data categories in Hurricane Katrina site –Measured increase in number of duplicates identified Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

22 Extracting spreadsheet test data Cluster spreadsheet columns based on data category –EUSES spreadsheet corpus “database” section –Hierarchical agglomerative clustering –Manual inspection –Result = 1713 columns in 246 clusters (1 cluster per data category) Created 1 tope for each of 32 most common categories –Yielding 32 topes –Covered 70% of clustered columns Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

23 We considered 5 validation strategies Strategy 1: Current spreadsheet practice (accept all inputs) Strategy 2: Current webapp practice (validate with regexp or fixed list, when available; accept all other inputs) –36 regexps + 35 fixed lists, in 7 categories Strategy 3A: Tope rejecting questionable (accept when  (str)=1) Strategy 3B: Tope accepting questionable (accept when  (str)>0) Strategy 4: Tope warn on questionable (simulate double-check by user when 0<  (str)<1) Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

24Measurements Based on 100 random values per category Used F1 to measure accuracy –standard measure of accuracy for classifiers = (precision*recall)/avg(precision,recall) Considered topes with 1, 2, 3, 4, or 5 formats Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

25 Recognizing multiple formats and questionable inputs raises accuracy Condition 4: Hypothetical user has to help on ~ 3% of inputs Condition 1: Recall = 0 (fails to identify any invalid inputs) Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

26 Topes based on spreadsheet data were accurate on web application data. Google Base Introduction  Challenges  Topes  Tools  Evaluation  Conclusion Hurricane Katrina

27 Putting data in a consistent format improves duplicate identification. Randomly extracted values for each of 5 Hurricane Katrina data categories Implemented transformations for each 5-format tope from the less commonly used formats to the most commonly used Found approximately 8% more duplicates after transformation Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

28 Conclusion: Topes improve data validation Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Contributions: –Support for ambiguous data categories –Support for transforming values –Platform-independent validation Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

29 Primary Limitations Topes are only appropriate… For string-ish, categorical data –Not for validating images, audio files, etc –Values must appear in a single field or variable –Validation rules derive from categorical constraints When validation rules are known by programmer –Who must label the field/variable with a tope –Who must implement the tope, which runs locally (future work…) Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

30 Future Work: Sharing topes Introduction  Challenges  Topes  Tools  Evaluation  Conclusion Future topes development/use process: 1.People implement new topes by using the basic tope editor (or another language such as JavaScript) 2.People publish tope implementations on repositories. 3.People download tope implementations to local cache 4.Tool plug-ins let people browse their local cache and associate topes with variables and input fields. 5.Plug-ins use tope implementations to validate data. Stay tuned (or come collaborate !!)

31 Thank You… To Margaret Burnett, Martin Erwig, and many others for suggestions over the past 3 years To Oregon State University for this opportunity To NSF for funding Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

32 Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.com Split inputs into many easy-to-validate fields. Who cares if the user has to type tabs now, or if he can’t just copy-paste into one field? Make users pick from drop-downs. Who cares if it’s faster for users to type “NJ” or “1/2007”? (Disclaimer: drop-downs sometimes are good!) I implemented this site in Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

33 Even with these tricks, writing validation is still very time-consuming. Overall, the site had over 1100 lines of JavaScript just for validation…. Plus equivalent server-side Java code (too bad code isn’t platform-independent) if (!rfcCheck (frm.primary .value)) return messageHelper(frm.primary , "Please enter a valid Primary address."); var atloc = if (atloc > 31 || atloc < frm.primary .value.length-33) return messageHelper(frm.primary , "Sorry. You may only enter 32 characters or less for your name\r\n”+ ”and 32 characters or less for your domain Introduction  Challenges  Topes  Tools  Evaluation  Conclusion

34 That was worst case. Best case: reusable regexps. Many IDEs allow the programmer to enter one regular expression for validating each input field. –Usually, this drastically reduces the amount of code, since most validation ain’t fancy. –So why don’t programmers validate most inputs? Introduction  Challenges  Topes  Tools  Evaluation  Conclusion